On Apr 24, 2009, at 00:54, Adam Akhtar wrote:
> Thanks for all your responses.
>> I think regexp is the wrong way to do this.  Since this is a binary
>> file format a regexp is unlikely to give you real data.  Scanning
>> seems to work out better.  Where did you get this data?
>>
>
> Im confused about binary file format. Is UTF-8 and binary file format
> two seperate things? I thought binary was just represented by unicode?

They are separate things.  A UTF-8 character that spans multiple bytes  
has a special bit pattern across its multiple bytes.  A binary file  
can have any format.

> Why would the regexp trip up at the binary part if i tell it the
> encoding is UTF-8?

It doesn't matter what the encoding is, in a binary file you don't  
have any guarantees that one of your markers won't show up in the  
middle of a binary chunk.  There's no reason "20:" or "8:" or anything  
couldn't show up inside the chunk of random data.

> Also with read() isnt that dangerous with Unicode text? Can I assume
> that all characters are only 1 byte wide?

Correct, but I don't think this file is in any Unicode encoding.  The  
individual chunks of binary data may be, but overall the file appears  
not to be.

> The file is bencoded (i think its like yaml in some respects).

Yes, a binary file format is like yaml, in this case you have the  
"20:", "8:", etc that tell you how far to read (I'm guessing).