Simone Carletti wrote: > > If I'm right both ISO-8859-1 and ISO-8859-15 belongs to Latin1 thus I > can convert them in the same way using Iconv.iconv('UTF-8', 'LATIN1', > 'a string').join. > > You'll probably loose the ãâ¥ã (euro) sign from ISO-8859-15 sources as LATIN1 is probably equivalent to ISO-8859-1. > My goal is not to be able to detect each single different charset but > to convert all string from an input into UTF-8. > > In fact... it's the same if you don't know the original charset you can't convert properly to UTF-8. > In the meantime I was reading the code of rFeedParser, the Ruby > implementation of Python FeedParser. > I just discovered it depends on a project called https://rubyforge.org/projects/rchardet/ > > I gave it a look and it seems to do exactly what I was looking for. > > Anyone is using this library? > > I use chardet 0.9.0. I believe they work more or less the same. I use it as a fallback mechanism when I can't reliably get the original charset from feeds. Some feeds actually tell that they are UTF-8 encoded but have invalid code points (your database isn't happy when you try to feed it something like that...), this becomes a mess when you find out that each item in the feed may use different charsets because people aggregate different sources without checking their charset themselves... The behavior I'm using is : 1/ Try the advertised charset with Iconv('utf-8', charset), even if charset =~ /^utf-?8$/i succeeds? -> END fails? (Exception) -> continue 2/ Use chardet to guess the charset, 3/ Iconv('utf-8', chardet_charset). Good luck, you're in for a lot of pain... Lionel