On Apr 24, 2009, at 04:39, Adam Akhtar wrote:
>> I think regexp is the wrong way to do this.  Since this is a binary
>> file format a regexp is unlikely to give you real data.  Scanning
>> seems to work out better.  Where did you get this data?
>
> Can you tell me why regular expressions are bad for this? Although the
> text represents binary, its just text at the end of the day. And if i
> know in advance that the binary starts after a :20 and ends before a
> d\d+ is there any reason why
> /:20.+?d\d+/ wouldnt work?

(I think you mean "20:")

It will incorrectly match this stream of text, losing data:

"20:d20:d20:d20:d20:d20:d20:"

A /d\d/ could happen in the middle of that binary chunk.  You're just  ucky that it hasn't shown up.

> I looked at StringScanner but that seems to use regular experssion to
> scan though.

Yes, but they are all anchored at the front so you can choose what to  o:

require 'strscan'

open 'mini-scrape.txt', 'rb' do |io|
   s = StringScanner.new io.read

   # look for any number of digits followed by a ":" at the scan pointer
   len = s.scan(/\d+:/).to_i # #to_i ignores the ":"

   # now the scan pointer has moved to the start of the binary data
   # so we can read the length of bytes out
   data = s.scan(/.{#{len.to_i}}/m) # m flag makes . match newlines,  on't use the u flag

   p :data => data

   p :next => s.string[s.pos, 20]

   # what's next in the stream is a "d" followed by another length  
specifier,
   # so let's read in the "d" even though I don't know what to do with  t
   case s.peek 1
   when 'd' then
     s.get_byte

   # add your own cases here for other thingys that show up.
   else
     raise "unknown thingy #{s.peek 1}"
   end

   # you'll probably want to put a loop around this, which will start  ver reading
   # another length specifier and a chunk of data
end

If you wrap this in a loop you can easily continue extending it until  t handles your entire file.

> What confuses me re: reg expressions is if I do something like
>
> File.open("some-file", "rb") do |data|
> text = data.read
> end
>
> text =~ /(.{20})/um
> $1
> => "d5:filesd20:\000\006ку
>
> Notice that the result doesnt show 20 characters and it doesnt end  
> with
> the expected " that irb uses to enclose results...whys that?

This probably is the fault of your terminal.  Remember you're working  n bytes (8 bits wide) not UTF-8 characters (which may be up to 6  
bytes long).  One of the characters is probably overwriting the the