Hi,

In <b3e54e146d346d393b16b935800076bb / ruby-forum.com>
  "Pdf Parsing Challenge" on Wed, 18 May 2011 06:04:19 +0900,
  Felipe Espinoza <fespinozacast / gmail.com> wrote:

> The Problem:
> ave to extract some data from a paper in a pdf format. I just needome data from the page 1, like the title of the paper, the authors
> list, the universities of these autors, their mails, the abstract andeywords
> ow I can extract this data from this paper?
> http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf
> ith a simple string that contains the information of a complete field
> (keywords, abstract, etc) would help me

% gem install poppler
% cat extract-data-from-paper.rb
require 'tempfile'
require 'open-uri'
require 'poppler'

ARGV.each do |url|
  pdf = Tempfile.new(["extract-data-from-paper", ".pdf"])
  pdf.binmode
  open(url) do |input|
    pdf.write(input.read)
  end
  pdf.close

  document = Poppler::Document.new(pdf.path)
  title_page = document.pages.first
  text = title_page.get_text
  lines = text.lines.to_a
  title = lines[0, 2].collect(&:strip).join(" ")
  puts title
  authors = lines[2, 2].collect(&:strip).join(" ")
  puts authors
  # ...
end
% ruby1.9 extract-data-from-paper.rb http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf
Query Routing Process for Adapted Information Retrieval using Agents
Angela Carrillo-Ramos2, Je Gensel1, Marle Villanova-Oliver1, HervMartin1, and Miguel Torres-Moreno2


Thanks,
--
kou