Hi,

In <b3e54e146d346d393b16b935800076bb / ruby-forum.com>
  "Pdf Parsing Challenge" on Wed, 18 May 2011 06:04:19 +0900,
  Felipe Espinoza <fespinozacast / gmail.com> wrote:

> The Problem:
> =

> I have to extract some data from a paper in a pdf format. I just need=

> some data from the page 1, like the title of the paper, the authors
> list, the universities of these autors, their mails, the abstract and=

> keywords
> =

> how I can extract this data from this paper?
> http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf
> =

> with a simple string that contains the information of a complete fiel=
d
> (keywords, abstract, etc) would help me

% gem install poppler
% cat extract-data-from-paper.rb
require 'tempfile'
require 'open-uri'
require 'poppler'

ARGV.each do |url|
  pdf =3D Tempfile.new(["extract-data-from-paper", ".pdf"])
  pdf.binmode
  open(url) do |input|
    pdf.write(input.read)
  end
  pdf.close

  document =3D Poppler::Document.new(pdf.path)
  title_page =3D document.pages.first
  text =3D title_page.get_text
  lines =3D text.lines.to_a
  title =3D lines[0, 2].collect(&:strip).join(" ")
  puts title
  authors =3D lines[2, 2].collect(&:strip).join(" ")
  puts authors
  # ...
end
% ruby1.9 extract-data-from-paper.rb http://dl.dropbox.com/u/6928078/CL=
EI_2008_002.pdf
Query Routing Process for Adapted Information Retrieval using Agents
Angela Carrillo-Ramos2, J=E9r=F4me Gensel1, Marl=E8ne Villanova-Oliver1=
, Herv=E9 Martin1, and Miguel Torres-Moreno2


Thanks,
--
kou