On Saturday 27 June 2009 07:26:40 pm Mrmaster Mrmaster wrote:
> I'm creating an application that will parse mbox files, extract the
> data, and put it into a db.

I'd address this on several levels.

First, mbox is not a good idea. Go with maildir or IMAP. Even better, write it 
as a "forward" script -- most modern mailservers allow you to configure a 
"forward" address to be a script, rather than an email address. Every time a 
new message comes in, it runs the script, piping the message to it over 
standard input.

If you end up doing mbox, maildir, or IMAP, then:

> 1) mbox files keep updating so how do notify my script that new data has
> come in?

With IMAP or POP3, I'm fairly sure you just have to poll.

With mbox or maildir, there's probably some sort of library to watch a file or 
a directory for changes. This is more efficient and responsive, but harder to 
do.

> Do I rerun the script with a placeholder where it last
> finished? That would require me to rescan the whole mbox file to find
> the placeholder which is pretty bad design.

Well, you could seek, as others have said...

That works only if no one ever deletes anything. If someone does, I'm really 
not sure how you would even know whether a given message was in the database 
already. There might be a header you could look for, but that would require 
you to, as you said, rescan the whole file.

Probably the best solution, if your script is the only thing processing that 
file, is to lock it (however your mailserver supports that) and rename it out 
of the way as you process it. That way, any new messages will come into a 
brand new mbox file.

> 2) What is the most efficient way to read the emails into memory before
> putting it into a db? Since there are multiple emails in each mbox file
> will I just read one of the emails, store it into memory, dumb it into
> db, then replace the current email in memory with the new one?

Probably something like that. If you use a library like TMail, it will 
probably take care of using a temporary file to store the mail if it gets too 
big -- but it also understands mbox already, so it may be doing more than you 
want.