On Jun 27, 2009, at 17:26 , Mrmaster Mrmaster wrote:

> Hello,
>
> I'm creating an application that will parse mbox files, extract the
> data, and put it into a db. I have a couple of problems. For those of
> you who are not familiar with mbox files, just think of one text file
> that stores all of the emails in text format.
>
> 1) mbox files keep updating so how do notify my script that new data  
> has
> come in? Do I rerun the script with a placeholder where it last
> finished? That would require me to rescan the whole mbox file to find
> the placeholder which is pretty bad design.

Is it bad design? What happens when the user deletes the first email  
in the mbox?

> 2) What is the most efficient way to read the emails into memory  
> before
> putting it into a db? Since there are multiple emails in each mbox  
> file
> will I just read one of the emails, store it into memory, dumb it into
> db, then replace the current email in memory with the new one?

efficient? why do you care about efficiency already? Get something  
working first, worry about efficiency AFTER you've measured stuff, not  
blindly guessed. Oh... and measure only once you have efficiency  
issues, until then it is Fast Enough(tm) (cousin of Just Works(tm)).

Luckily in this case, one of the most efficient (time wise) is also  
the cleanest (code wise):

   File.read(path) #=> contents