On 20.10.2009 03:10, Rob Doug wrote: > Hi all, > I'm writing web crawler with threads support. And after working with > some amount of link memory usage increase more and more. When program > just started in use 20mb of mem. After crawling of 150-200 link, memory > usage ~100. When 1000 link crawled my program may use up to 1GB of mem. > Help me please find out why? > > require 'rubygems' > require 'mechanize' > require 'hpricot' > require 'yaml' > require 'net/http' > require 'uri' > require 'modules/common' > > Thread.abort_on_exception = true > $config = YAML.load_file "config.yml" > links = IO.read("bases/3+4.txt").split("\n") > threads = [] > > links.each do |link| > if Thread.list.size < 50 then > threads << Thread.new(link) { |myLink| > Common.post_it(myLink) > } > else > sleep(1) > threads.each { |t| > unless t.status then > t.join > end > } > puts "total threads: " + Thread.list.size.to_s > redo > end > end > > threads.each { |t| t.join() } > > > What in "Common" module: > 1. Crawler (net/http or mechanize - I tried both, results the same) > 2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad > results) > so I extract some data from page and save it to the file. Nothing > special as you see. > > When I start this program without threads I getting the same results :( > > Please help, is this my fault or something wrong in the libraries ? As far as I can see you're never cleaning Array threads. You are just appending to the Array and so even terminated threads will stay in there because you also do not reuse threads. Design criticism: usually you create threads upfront and pass work to them via a Queue then you don't need the sleep crutch and use blocking instead. Also, you can make your code better handle large volumes of URLs by using a more streamed approach along the lines Ryan presented by limiting all resources (not only threads but also size of the queue). require 'thread' ... THREADS = 50 q = SizedQueue.new(THREADS * 2) threads = (1..THREADS).map do Thread.new q do qq until qq.equal?(url = qq.deq) Common.... end end end File.foreach("bases/3+4.txt") do |line| line.chomp! q.enq(URI.parse(line)) end threads.size.times {|q| q.enq q} Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/