On Sun, 23 Apr 2006, Mateo Barraza wrote:
> I'm fairly new to the Ruby scene.
> Is there any library that can read MS Word (.doc) files and extract the pure
> text...what about libs for PDF files?

Hi,

There's not a MS Word library that I know of that will easily allow you
to extract the pure text, but the OLE suggestion is a good idea. Another
method would be to save as WordprocessingML (XML) (if you have Word 2003) and use
either REXML or libxml-ruby (two Ruby XML libraries) to parse it (or XSLT). If you've got XML, the
interesting nodes (if you really only want text) are the 'w:t' ones.


HTH,
Keith