On 20.10.2009 03:10, Rob Doug wrote:
> Hi all,
> I'm writing web crawler with threads support. And after working with
> some amount of link memory usage increase more and more. When program
> just started in use 20mb of mem. After crawling of 150-200 link, memory
> usage ~100. When 1000 link crawled my program may use up to 1GB of mem.
> Help me please find out why?
> 
> require 'rubygems'
> require 'mechanize'
> require 'hpricot'
> require 'yaml'
> require 'net/http'
> require 'uri'
> require 'modules/common'
> 
> Thread.abort_on_exception = true
> $config = YAML.load_file "config.yml"
> links = IO.read("bases/3+4.txt").split("\n")
> threads = []
> 
> links.each do |link|
>   if Thread.list.size < 50 then
>     threads << Thread.new(link) { |myLink|
>       Common.post_it(myLink)
>     }
>   else
>     sleep(1)
>     threads.each { |t|
>       unless t.status then
>         t.join
>       end
>     }
>     puts "total threads: " + Thread.list.size.to_s
>     redo
>   end
> end
> 
> threads.each { |t| t.join() }
> 
> 
> What in "Common" module:
> 1. Crawler (net/http or mechanize - I tried both, results the same)
> 2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad
> results)
> so I extract some data from page and save it to the file. Nothing
> special as you see.
> 
> When I start this program without threads I getting the same results :(
> 
> Please help, is this my fault or something wrong in the libraries ?

As far as I can see you're never cleaning Array threads.  You are just 
appending to the Array and so even terminated threads will stay in there 
because you also do not reuse threads.

Design criticism: usually you create threads upfront and pass work to 
them via a Queue then you don't need the sleep crutch and use blocking 
instead.  Also, you can make your code better handle large volumes of 
URLs by using a more streamed approach along the lines Ryan presented by 
limiting all resources (not only threads but also size of the queue).

require 'thread'
...

THREADS = 50
q = SizedQueue.new(THREADS * 2)

threads = (1..THREADS).map do
   Thread.new q do qq
     until qq.equal?(url = qq.deq)
       Common....
     end
   end
end

File.foreach("bases/3+4.txt") do |line|
   line.chomp!
   q.enq(URI.parse(line))
end

threads.size.times {|q| q.enq q}

Kind regards

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/