On Apr 24, 2009, at 04:39, Adam Akhtar wrote: >> I think regexp is the wrong way to do this. Since this is a binary >> file format a regexp is unlikely to give you real data. Scanning >> seems to work out better. Where did you get this data? > > Can you tell me why regular expressions are bad for this? Although the > text represents binary, its just text at the end of the day. And if i > know in advance that the binary starts after a :20 and ends before a > d\d+ is there any reason why > /:20.+?d\d+/ wouldnt work? (I think you mean "20:") It will incorrectly match this stream of text, losing data: "20:d20:d20:d20:d20:d20:d20:" A /d\d/ could happen in the middle of that binary chunk. You're just ucky that it hasn't shown up. > I looked at StringScanner but that seems to use regular experssion to > scan though. Yes, but they are all anchored at the front so you can choose what to o: require 'strscan' open 'mini-scrape.txt', 'rb' do |io| s = StringScanner.new io.read # look for any number of digits followed by a ":" at the scan pointer len = s.scan(/\d+:/).to_i # #to_i ignores the ":" # now the scan pointer has moved to the start of the binary data # so we can read the length of bytes out data = s.scan(/.{#{len.to_i}}/m) # m flag makes . match newlines, on't use the u flag p :data => data p :next => s.string[s.pos, 20] # what's next in the stream is a "d" followed by another length specifier, # so let's read in the "d" even though I don't know what to do with t case s.peek 1 when 'd' then s.get_byte # add your own cases here for other thingys that show up. else raise "unknown thingy #{s.peek 1}" end # you'll probably want to put a loop around this, which will start ver reading # another length specifier and a chunk of data end If you wrap this in a loop you can easily continue extending it until t handles your entire file. > What confuses me re: reg expressions is if I do something like > > File.open("some-file", "rb") do |data| > text = data.read > end > > text =~ /(.{20})/um > $1 > => "d5:filesd20:\000\006ку > > Notice that the result doesnt show 20 characters and it doesnt end > with > the expected " that irb uses to enclose results...whys that? This probably is the fault of your terminal. Remember you're working n bytes (8 bits wide) not UTF-8 characters (which may be up to 6 bytes long). One of the characters is probably overwriting the the