Every browser cleans up invalid markup. Each one has a different way to do it. Firefox, for example, adds to every <table> a <tbody>, when it doesn't exist. Firebug shows you the cleaned up source. I had to download a website once, because it was so crappy and I searched for the table entry by hand. It had a path like "\html\body\table\tr\td\tr\center\font\b\font". Quite annoying, but it speeded up scraping. You could try the hpricot gem to get data from websites if the regex become to complex.