On Thu, Jul 1, 2010 at 7:03 AM, Robert Klemme
<shortcutter / googlemail.com> wrote:
> 2010/7/1 Michael Fellinger <m.fellinger / gmail.com>:
>> On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke
>> <stuart.clarke1986 / gmail.com> wrote:
>>> Could anyone advise me on a fast way to search a single, but very large
>>> file (1Gb) quickly for a string of text? Also, is there a library to
>>> identify the file offset this string was found within the file?
>>
>> You can use IO#grep like this:
>> File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
>> io.grep(/apiKey/){|m| p io.pos => m } }
>>
>> The pos is the position the match ended, so just substract the string length.
>> The above example was a file with 700mb, took around 40s the first
>> time, 2s subsequently, so disk I/O is the limiting factor in terms of
>> speed (as usual).
>
> If you only need to know whether the string occurs in the file you can do
> found = File.foreach("foo").any? {|line| /apiKey/ =~ line}
> This will stop searching as soon as the sequence is found.
>
> "fgrep -l foo" is likely faster.

irb> `fgrep -l waters /usr/share/dict/words`.size > 0
=> true
irb> `fgrep -l watershed /usr/share/dict/words`.size > 0
=> true
irb> `fgrep -l watershedz /usr/share/dict/words`.size > 0
=> false

irb> `fgrep -ob waters /usr/share/dict/words`.split.map{|s| s.split(':').first}
=> ["153088", "153102", "204143", "234643", "472357", "856441",
"913606", "913613", "913623", "913635", "913646", "913656", "913668",
"913679", "913690", "913703"]
irb> `fgrep -ob watershed /usr/share/dict/words`.split.map{|s|
s.split(':').first}
=> ["913613", "913623", "913635"]
irb> `fgrep -ob watershedz /usr/share/dict/words`.split.map{|s|
s.split(':').first}
=> []