On Jul 6, 8:22 pm, Janus Bor <ja... / urban-youth.com> wrote: > Hello everyone, > > I'm pretty new to Ruby and programming in general. Here's my problem: > > I'm writing a program that will automatically download protein sequences > from a server and write them into the corresponding file. Every single > sequence has a unique id and I have to eliminate duplicates. However, as > the number of sequences might exceed 50 000, I can't simply save all > sequences in a hash (with their id as key) and then write them to hd > after downloading has finished. So my idea is to write every sequence to > the corresponding file immediately, but first I have to check if it has > been processed already. > > I could save all processed id's in an array and then check if the array > includes my current id: > > sequences = [] > some kind of loop magic > if sequences.include?(id) > process file > sequences << id > end > end > > But I suspect that sequences.include?(id) would iterate over the whole > array until it finds a match. As this array might have up 50 000 > positions and I will have to do this check for every sequence, this > would probably be very inefficient. > > I could also save all processed id's as keys of a hash, however I don't > have any use for a value: > > sequences = {} > some kind of loop magic > if sequences[id] > process file > sequences[id] = true > end > end > > Would this method be more efficient? Is there a more elegant way? Also, > can Ruby handle arrays/hashes of this size? > > Thanks in advance! > -- > Posted viahttp://www.ruby-forum.com/. BioRuby+BioSQL ? You can fetch a sequence from servers and dump it directly into the database. You can choose MySQL, PostgreSQL, SqLite ok it's not well coded but works: server = Bio::Fetch.new('http://www.ebi.ac.uk/cgi-bin/dbfetch') ARGV.flags.accession.split.each do |accession| puts accession if Bio::SQL.exists_accession(accession) puts "Entry #{accession} already exists!" else entry_str = server.fetch('embl', accession, 'raw', 'embl') if entry_str=="No entries found\. \n" $stderr.puts "Error: no entry #{accession} found. #{entry_str}" else puts "Downloaded!" puts "Loading..." puts "Converting EMBL obj..." entry = Bio::EMBL.new(entry_str) puts "Converting Biosequence obj..." biosequence = entry.to_biosequence puts "Saving Biosequence into Bio::SQL::Sequence database" result = Bio::SQL::Sequence.new(:biosequence=>biosequence,:biodatabase_id=>db.id) unless Bio::SQL.exists_accession(biosequence.primary_accession) puts entry.entry_id if result.nil? pp "The sequence is already present into the biosql database" else pp "Stored." end end#notfound on web end#bioentry exists end #list accession PS: I need to write docs about BioSQL and Ruby, sorry my fault. -- Ra