A complete dictionary shouldn't be necessary. Just exceptions. Look at how Rails handles pluralization. You can use the algorithm:

- if work starts with consonent, use "a"
- if word matches entry in exception list, use designated article
- else use "an" if it's a for-certain vowel ['a', 'e', 'i', 'o', 'u']

This way, you only do a lookup for words starting with possible vowels ['a', 'e', 'i', 'o', 'u', 'y', 'h']

You might even extend the consonant searching algorithm to use some heuristics as suggested the email below:

'a' if word =~ /^[-aieouyh]/ || word =~ /^u[-aieouyh] || word =~ /^y[-aieouyh]/

The problem is that the choice between 'a' and 'an' has to do with the way the word *sounds* in a given English (i.e., American, British). It is unlikely you will capture all the cases with a dictionary, hence the suggestion that the algorithm use a set of commonly encountered exceptions, accepting the fact that it will be incomplete and sometimes a bit embarrassing -- but no more so that the pronunciation of words by my nav. system's text to speech :)


On Nov 9, 2011, at 5:10 AM, GoníÂlo C. Justino wrote:

> google hasn't helped: does anyone have or know of a "complete" list of
> english words ?
> 
> On 8 November 2011 17:20, Chad Perrin <code / apotheon.net> wrote:
> 
>> On Tue, Nov 08, 2011 at 05:23:31PM +0900, GoníÂlo C. Justino wrote:
>>> 
>>> does different pronunciation comes from the subsequent  letters ? i'm
>>> thinking uMBrella, uNCle, uRGengt, uNDer, uGLy, uPPer, uRGe but uNIcorn,
>>> eULogy (or is this "an eulogy"? now i'm confused)... i'm wondering if two
>>> consonants make it "an" and at least one vowel make in "a". Maybe I'm
>> just
>>> ramblingm, this sounds so un-rubyesque :S
>> 
>> You're right about unicorn and eulogy.  I'm interested in checking out
>> the correlation between second-and-third letters and vowels that become
>> consonants in pronunciation now, to see how strong a correlation that is.
>> I'm pretty sure there are exceptions to these perceived rules, though, in
>> any case.
>> 
>> It seems likely that, most often, you'd get the following results, where
>> V means "vowel" and C means "consonant".  Lower case letters are
>> literals.  In each case, two adjacent vowels are assumed to be
>> *different* vowels.
>> 
>>   uCC: treat as vowel
>>   uCV: treat as consonant
>>   VVC: treat as consonant
>>   yC: treat as vowel
>>   yV: treat as consonant
>> 
>> These are only my immediate impressions, so far.  Assuming for argument's
>> sake that they're correct for the general case, though, there would
>> almost certainly be exceptions for every one of these correlations, and
>> the question that arises then is whether the exceptions are rare enough
>> to warrant using these correlations as rules with a set of exceptions
>> used to override them, or numerous enough for it to make more sense to
>> just use an extensive dictionary to handle such matters.
>> 
>> If I get really bored, I may put together a really extensive dictionary
>> to cover this, then use it to determine the strength of such
>> correlations some day (or week or month), but not today.
>> 
>> --
>> Chad Perrin [ original content licensed OWL: http://owl.apotheon.org ]
>>