A complete dictionary shouldn't be necessary. Just exceptions. Look at how Rails handles pluralization. You can use the algorithm: - if work starts with consonent, use "a" - if word matches entry in exception list, use designated article - else use "an" if it's a for-certain vowel ['a', 'e', 'i', 'o', 'u'] This way, you only do a lookup for words starting with possible vowels ['a', 'e', 'i', 'o', 'u', 'y', 'h'] You might even extend the consonant searching algorithm to use some heuristics as suggested the email below: 'a' if word =~ /^[-aieouyh]/ || word =~ /^u[-aieouyh] || word =~ /^y[-aieouyh]/ The problem is that the choice between 'a' and 'an' has to do with the way the word *sounds* in a given English (i.e., American, British). It is unlikely you will capture all the cases with a dictionary, hence the suggestion that the algorithm use a set of commonly encountered exceptions, accepting the fact that it will be incomplete and sometimes a bit embarrassing -- but no more so that the pronunciation of words by my nav. system's text to speech :) On Nov 9, 2011, at 5:10 AM, GoníÂlo C. Justino wrote: > google hasn't helped: does anyone have or know of a "complete" list of > english words ? > > On 8 November 2011 17:20, Chad Perrin <code / apotheon.net> wrote: > >> On Tue, Nov 08, 2011 at 05:23:31PM +0900, GoníÂlo C. Justino wrote: >>> >>> does different pronunciation comes from the subsequent letters ? i'm >>> thinking uMBrella, uNCle, uRGengt, uNDer, uGLy, uPPer, uRGe but uNIcorn, >>> eULogy (or is this "an eulogy"? now i'm confused)... i'm wondering if two >>> consonants make it "an" and at least one vowel make in "a". Maybe I'm >> just >>> ramblingm, this sounds so un-rubyesque :S >> >> You're right about unicorn and eulogy. I'm interested in checking out >> the correlation between second-and-third letters and vowels that become >> consonants in pronunciation now, to see how strong a correlation that is. >> I'm pretty sure there are exceptions to these perceived rules, though, in >> any case. >> >> It seems likely that, most often, you'd get the following results, where >> V means "vowel" and C means "consonant". Lower case letters are >> literals. In each case, two adjacent vowels are assumed to be >> *different* vowels. >> >> uCC: treat as vowel >> uCV: treat as consonant >> VVC: treat as consonant >> yC: treat as vowel >> yV: treat as consonant >> >> These are only my immediate impressions, so far. Assuming for argument's >> sake that they're correct for the general case, though, there would >> almost certainly be exceptions for every one of these correlations, and >> the question that arises then is whether the exceptions are rare enough >> to warrant using these correlations as rules with a set of exceptions >> used to override them, or numerous enough for it to make more sense to >> just use an extensive dictionary to handle such matters. >> >> If I get really bored, I may put together a really extensive dictionary >> to cover this, then use it to determine the strength of such >> correlations some day (or week or month), but not today. >> >> -- >> Chad Perrin [ original content licensed OWL: http://owl.apotheon.org ] >>