On 25/02/2008, Simone Carletti <weppos / gmail.com> wrote: > > I use it as a fallback mechanism when I can't reliably get the original > charset from feeds. > > > That's a great example, thank you. > Unfortunately I don't have a real charset header to check. :( I must > rely only on input string. You can ask a crystal ball as well. The multibyte encodings can be often distinguished by their structure - utf-8, perhaps utf-16, the Asian encodings. If something passes for a valid string in a multibyte encoding it very likely is a string in that encoding. However, the Latin 8bit encodings are all the same - 7bit ascii with some mess attached in the upper 128 characters. By converting from any of these you get perfectly valid utf-8 but different gibberish each time. You can tell the ISO variant from the Windows variant sometimes because some control characters are at different positions - and these should not appear in text. But that does not help you at all - you still don't know which of the latin encodings you got. If you know the language (and it's one of the few supported) you can use enca. If the language is not supported you can do the filter yourself - basically you collect the set of accented (with 8th bit set) characters in your language, and encode them in different encodings (the dos and windows codepage, the iso encoding, any other legacy encodings). You get sets of bytes that would usually overlap but would contain some unique bytes. When you see that byte you know what encoding you should use. Good luck :-) Michal