On Jul 6, 8:22 pm, Janus Bor <ja... / urban-youth.com> wrote:
> Hello everyone,
>
> I'm pretty new to Ruby and programming in general. Here's my problem:
>
> I'm writing a program that will automatically download protein sequences
> from a server and write them into the corresponding file. Every single
> sequence has a unique id and I have to eliminate duplicates. However, as
> the number of sequences might exceed 50 000, I can't simply save all
> sequences in a hash (with their id as key) and then write them to hd
> after downloading has finished. So my idea is to write every sequence to
> the corresponding file immediately, but first I have to check if it has
> been processed already.
>
> I could save all processed id's in an array and then check if the array
> includes my current id:
>
> sequences = []
> some kind of loop magic
>  if sequences.include?(id)
>   process file
>   sequences << id
>  end
> end
>
> But I suspect that sequences.include?(id) would iterate over the whole
> array until it finds a match. As this array might have up 50 000
> positions and I will have to do this check for every sequence, this
> would probably be very inefficient.
>
> I could also save all processed id's as keys of a hash, however I don't
> have any use for a value:
>
> sequences = {}
> some kind of loop magic
>  if sequences[id]
>   process file
>   sequences[id] = true
>  end
> end
>
> Would this method be more efficient? Is there a more elegant way? Also,
> can Ruby handle arrays/hashes of this size?
>
> Thanks in advance!
> --
> Posted viahttp://www.ruby-forum.com/.

BioRuby+BioSQL ?
You can fetch a sequence from servers and dump it directly into the
database. You can choose MySQL, PostgreSQL, SqLite

ok it's not well coded but works:
  server = Bio::Fetch.new('http://www.ebi.ac.uk/cgi-bin/dbfetch')
  ARGV.flags.accession.split.each do |accession|
    puts accession
    if Bio::SQL.exists_accession(accession)
      puts "Entry #{accession} already exists!"
    else
      entry_str = server.fetch('embl', accession, 'raw', 'embl')

      if entry_str=="No entries found\. \n"
        $stderr.puts "Error: no entry #{accession} found.
#{entry_str}"
      else
        puts "Downloaded!"
        puts "Loading..."
        puts "Converting EMBL obj..."
        entry = Bio::EMBL.new(entry_str)
        puts "Converting Biosequence obj..."
        biosequence = entry.to_biosequence
        puts "Saving Biosequence into Bio::SQL::Sequence database"
        result =
Bio::SQL::Sequence.new(:biosequence=>biosequence,:biodatabase_id=>db.id)
unless Bio::SQL.exists_accession(biosequence.primary_accession)
        puts entry.entry_id
        if result.nil?
          pp "The sequence is already present into the biosql
database"
        else
          pp "Stored."
        end
        end#notfound on web
        end#bioentry exists
      end #list accession

PS: I need to write docs about BioSQL and Ruby, sorry my fault.

--
Ra