On Thu, 15 Mar 2001 jjthrash / pobox.com wrote:

> On Thu, Mar 15, 2001 at 02:20:14AM +0900, David Fung wrote:
> > i would like to locate probable email addresses in a bunch of text files,
> > but don't know how to build a regexp for the search.  glad if somebody can
> > help.
>
> /\w+@\w+(\.\w+)+/
>
> should probably work.  That is
>
> stuff.grep(/\w+@\w+(\.\w+)+/)
>
> Now, this is assuming an email address can only be made up of
> characters from a-zA-Z0-9_.  I don't know if that's a real
> limitation.  I think it is a limitation on the domain name, so
> the following might work:
>
> /.+@\w+(\.\w+)+/

Short version: Take a look at the last example in "Mastering Regular
Expressions", then cringe. Translating the Perl that generates it is
probably the shortest and most painful root to a better understanding of
regexes (assuming you try and understand the pieces).

Long version:
Except for greediness issues this should work for most addresses. Caveats
are:
 - You will sweep up leading junk with the .+ - if you can't rely on the
   address being the first thing in the subject of the match.[1]
 - The character limitations for addresses are (as you hypothesize)
   significantly looser than for domains.  In particular there are rarely
   used quoting rules (completely ignoring the quoting for the "Real Name"
   and other such ancillary information) that allow most characters in the
   leading part.  You can ignore this for a very high percentage of addresses
   with no lossage[2].
 - as the DNS system goes multilingual the definition of "most" is
   dependent on the percentage of non-ASCII names versus Ruby's
   multilingualization and Unicode support.
 - An optional . at the end of the address "(\.)?" is allowed since that
   represents the root node and though it's silently added by just about
   everything that uses DNS it's perfectly legit.  The same address will
   function without it so it's largely irrelevant (and will sweep up
   periods when the address is at the end of a sentence).  In fact if you
   have broken mail software/libraries you may have a problem with it's
   presence[2].

Jonathan Conway

[1] I'd reproduce Friedl's solution for this part, but since I haven't
    written the Ruby equivalent I can't provide it.
[2] Of course you'll have complete lossage on the few you _can't_ ignore
    if for.
[3] Having gotten pedantic I apparently swung wildly the other way and
    talked myself out of it.