On Apr 24, 2009, at 00:54, Adam Akhtar wrote: > Thanks for all your responses. >> I think regexp is the wrong way to do this. Since this is a binary >> file format a regexp is unlikely to give you real data. Scanning >> seems to work out better. Where did you get this data? >> > > Im confused about binary file format. Is UTF-8 and binary file format > two seperate things? I thought binary was just represented by unicode? They are separate things. A UTF-8 character that spans multiple bytes has a special bit pattern across its multiple bytes. A binary file can have any format. > Why would the regexp trip up at the binary part if i tell it the > encoding is UTF-8? It doesn't matter what the encoding is, in a binary file you don't have any guarantees that one of your markers won't show up in the middle of a binary chunk. There's no reason "20:" or "8:" or anything couldn't show up inside the chunk of random data. > Also with read() isnt that dangerous with Unicode text? Can I assume > that all characters are only 1 byte wide? Correct, but I don't think this file is in any Unicode encoding. The individual chunks of binary data may be, but overall the file appears not to be. > The file is bencoded (i think its like yaml in some respects). Yes, a binary file format is like yaml, in this case you have the "20:", "8:", etc that tell you how far to read (I'm guessing).