On 03/29/2012 11:11 AM, Robert Klemme wrote:
> Jeremy Bopp wrote in post #1053948:
>> On 03/29/2012 07:55 AM, Robert Klemme wrote:
>>> But what about the dups?  What constitutes a duplicate?  If it is just
>>> raw content, you could use "sort -u" (standalone command).
>>
>> Again from the original example, the records to compare for uniqueness
>> are simple lines.  Of course that simplicity belies the issue of line
>> endings. ;-)
> 
> Ah, I overlooked the call to #uniq.  I think we should be able to fix 
> the original with a small insertion:
> 
> File.open("newf.txt", "w+") { |file| file.puts
> File.readlines("oldf.txt").each(&:chomp!).uniq }
> 
> Although from an efficiency point of view another approach would be 
> preferable:
> 
> File.open "oldf.txt" do |in|
>   File.open "newf.txt", "w" do |out|
>     last = nil
> 
>     in.each_line do |line|
>       line.chomp!
>       out.puts line unless line == last
>       last = line
>     end
>   end
> end

The limitation here is that duplicate lines must occur in consecutive
runs.  If they are interleaved with different lines, this filter won't
work.  For the general case, loading all the lines and running #uniq
over the array is more likely to work.  I admit that it's not very
efficient for large files though.

> Note that opening "newf.txt" before opening "oldf.txt" will lead to an 
> empty file being written in case "oldf.txt" does not exist even though 
> an exception is thrown.

Good point.

>> Also, the OP appears to be running on Windows, so "sort -u" is not
>> available out of the box.
> 
> Right, I'm so used to cygwin that I keep forgetting not everybody has it 
> installed. :-)

When stuck on Windows, Cygwin is definitely a must-have!

-Jeremy