This version reads farther ahead in an attempt to cope
with greedy regular expressions.
=begin
Unlike Gawk and Mawk, Ruby won't accept a regular expression as a
record-separator. Let's fix that. The substring matched by the
record-separator is automatically removed from the record, but it
can be obtained by RecSep#terminator.
Typical usage:
File.open("stuff.txt"){|handle|
reader = RecSep.new( handle, /^\d+\.\n/ )
reader.each {|x| p x }
}
=end
class RecSep
def initialize( file_handle, record_separator, chunk_size=10_000 )
@handle = file_handle
@rec_sep = record_separator
@chunk_size = chunk_size
@buffer = ""
@terminator = nil
end
attr_reader :terminator, :buffer
def get_rec
## The record-separator may be something like /\n\s*\n/,
## so we read until there's something left over in the buffer
## after the match.
loop do
@rec_sep.match( @buffer )
break if $~ && $~.post_match.size > 0
s = @handle.read( @chunk_size )
break if not s
@buffer << s
end
if $~
@buffer = $~.post_match
@terminator = $~.to_s
$~.pre_match
else
@terminator = nil
return nil if "" == @buffer
s, @buffer = @buffer, ""
s
end
end
def each
while s = self.get_rec
yield s
end
end
end