Daniel Brockman wrote:

>I just wanted to point out that if you add the /x flag to
>the regexp, you can insert whitespace and comments at will.
>That can sometimes make the expression pseudo-readable.
>  
>
Hello Daniel,

thanks for pointing that out, I didn't know it.

I will quickly explain this regex: it's supposed to match strings like:

"the hek2p , a2p , and b3p multidomain proteins".

A protein usually contains a digit inside or an uppercase.

Here is the whole regex:

[tT]he								=> the
\s+

(([\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*		=> a protein
((\s*(\,|and|or)\s*)*						=> a comma-separator, "and", "or"
[\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*)*))	=> another protein (this with the preceding make up a protein comma-separated sequence)

\s+

((\w+\s+){0,3}\s*						=> some adjectives (at most 3)

(proteins|genes|protein|gene))



I'll also detail the protein regex which occurs 2 times in the big regex:

([\w\d\_]+					=> some letters or digits
(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])		-> necessarily a digit or uppercase but with at least a letter before or after
[\w\d\_]*					=> some other letters or digits

the complexity of the second part of the protein regex is in order to avoid matching a simple number (only digits)


Here it is, I hope it's clearer. Please tell me if I should file a bug.


Best regards,
Adrian.