Hi, In <b3e54e146d346d393b16b935800076bb / ruby-forum.com> "Pdf Parsing Challenge" on Wed, 18 May 2011 06:04:19 +0900, Felipe Espinoza <fespinozacast / gmail.com> wrote: > The Problem: > ave to extract some data from a paper in a pdf format. I just needome data from the page 1, like the title of the paper, the authors > list, the universities of these autors, their mails, the abstract andeywords > ow I can extract this data from this paper? > http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf > ith a simple string that contains the information of a complete field > (keywords, abstract, etc) would help me % gem install poppler % cat extract-data-from-paper.rb require 'tempfile' require 'open-uri' require 'poppler' ARGV.each do |url| pdf = Tempfile.new(["extract-data-from-paper", ".pdf"]) pdf.binmode open(url) do |input| pdf.write(input.read) end pdf.close document = Poppler::Document.new(pdf.path) title_page = document.pages.first text = title_page.get_text lines = text.lines.to_a title = lines[0, 2].collect(&:strip).join(" ") puts title authors = lines[2, 2].collect(&:strip).join(" ") puts authors # ... end % ruby1.9 extract-data-from-paper.rb http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf Query Routing Process for Adapted Information Retrieval using Agents Angela Carrillo-Ramos2, JñÓe Gensel1, MarlïÏe Villanova-Oliver1, HervMartin1, and Miguel Torres-Moreno2 Thanks, -- kou