> > What exactly are "header lines" in your case?  Without knowing the
> > format and your parsing / processing requirements it's difficult to
> > come up with suggestions.
>
> A header line starts with ">"
>
> First Entry, first file:
> >1_4_138_F5-P2
> 234234234234234
>
> First Entry, second file:
> >1_4_138_F3
> 234234234234234
>
> I have two large files(several gigs) that I need to read with a bunch of
> "entries" as the above shown. I need to match the header lines, in this
> case i need to make sure "1_4_138_" is the same in both entries and if
> it is, write those entries with matching headers to new seperate files.

So do you want to use the first file as template for checking only and
only write out matching sections from the second file?  In other
words, is this what you want conceptually?

valid_sections = read_headers(file_1)

for each section in file_2
  if section in valid_sections
    print to file_3


On Wed, Sep 14, 2011 at 3:26 AM, Cyril J. <cyril.varghese.jose / gmail.com> wrote:
>> Do you have guaranteed ordering for the header lines in each file?
>
> The headers should be in the same order in both files. Some headers are
> going to be missing in the second file however, which is why I need to
> do the check. If a header doesn't match one from the first file, then
> those two headers don't get written out to the file.

Here's an implementation of the algorithm above:

require 'set'

headers = Set.new

File.foreach file_1 do |line|
  %r{^>(\d+_{3})} and headers << $1
end

File.open file_3, "w" do |out|
  do_print = false

  File.foreach file_2 do |line|
    if %r{^>(\d+_{3})}
      do_print = headers.include? $1
    end

    out.puts line if do_print
  end
end

Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/