I needed some code which could work out whether the word ‘an’ or ‘a’ (the indefinite article) should go before a specific word. I needed this in order to create automated headings and sentences in certain useful contexts. At first I thought this was an easy problem to solve. Surely I just needed to check the first letter of a word to see whether it was a vowel or a consonant. The test is fairly accurate, it works about 99.5% of the time. There are a set of exception words like:
- honest
- European
- NSA
- heir
that don’t work. These make up about 0.5% of all words in the English language (i.e. 1 in 200 words fail the simple literal vowel test). The correct algorithm in English is to see if the first part of a word sounds like a vowel or a consonant. If it sounds like a vowel then we put ‘an’ in front, otherwise we put ‘a’ in front.
To ‘an’, or not to ‘an’, that is the question:
Whether ’tis Nobler in the mind to suffer
The Slings and Arrows of a poor coding solution
Or to take Arms against a Sea of troubles
And by opposing end them
For example:
- honest – starts with a vowel sound even though it’s first letter is a consonant, so it’s ‘An honest person’ and not ‘A honest person’
- European – starts with a consonant sound even though it’s first letter is a vowel, so it’s ‘A European’ and not ‘An European’
These are two words which contradict the basic test. After researching this issue online, I came across an excellent discussion which covered most of the points that needed considering (https://stackoverflow.com/questions/1288291/how-can-i-correctly-prefix-a-word-with-a-and-an).
Different Accents and Versions of English
I decided to try and create a coding solution that worked for both American English and British English. The other issue I came across is that the sound dictionary I used (i.e. a dictionary that tells you how a word is pronounced) seemed to contain different accents. For example most Americans pronounce the word ‘herb’ as ‘erb’ (a silent ‘h’), however according to the sound dictionary some do pronounce the ‘h’ similar to British English. In my solution I had to go for majority rule, i.e. the most common American pronunciation wins out for the American solution, and the same for British English pronunciations. This is a fair design decision as ‘proper’ English is usually the most commonly used English. In my case I always went for the first entry of a word. Herb (1) was ‘erb’ whereas herb (2) was ‘herb’, I assumed the first entry was the most commonly accepted.
Acronyms
So what about acronyms? Easy again right? Unfortunately not. NASA is colloquially pronounced as if it was a word rather than an acronym (i.e. it’s NASSER not the separate letters pronounced, as in EN-AY-ESS-AY). However NSA is pronounced EN-ESS-AY. The solution therefore is to ensure that we specify exceptions into the code. We still use the sound dictionary for acronyms where possible (NASA happens to be in the sound dictionary). If an acronym is not in the sound dictionary then I assume all letters are spoken out separately, more like NSA. If you find a common acronym that fails this test then please let me know. I’ve just realised about LOL which is a weird one, should we say Laugh Out Loud, ELL-OH-ELL or LOL? I’ve heard all three in common usage. As you can see there is no perfect solution.
How do we know if something is an acronym?
One suggestion in the discussion (I wasn’t involved in any part of the discussion) was to check a word against a set of known acronyms or to see if a word was capitalised. Unfortunately capitalisation alone is not enough as some people capitalise words that aren’t acronyms. The other issue is that new acronyms come into existence all the time. My preferred solution was therefore one of negation, i.e. I check to see if a word is both capitalised AND NOT in the sound dictionary. If it IS in the sound dictionary then it will be handled appropriately, if it isn’t in the sound dictionary then we simply guess that the acronym is spoken one letter at a time. If this is not appropriate then we simply just need to add exceptions to the ‘starts with a vowel’ test to the sound dictionary or somewhere else in the code.
Conclusion
I think I’ve got the accuracy up to 99.9% or somewhere in that region. Which is better than before. This will probably be more accurate than text written by humans which can suffer from typos and / or poor grammar. If you do end up using this code then please adhere to the licensing terms and let me know what improvements can be made.
The Code (Available on GitHub)
You can get the code from the GitHub project called Anorak. Anorak stands for AN OR A Knowhow. i.e the knowhow of knowing whether to use ‘an’ or ‘a’ before a word.
Alternative Solutions
I found https://github.com/Kaivosukeltaja/php-indefinite-article which is a port of the Lingua::EN::Inflect Perl module’s A() and AN(). It seems to cope with digit-based numbers, e.g. ‘an 18 foot van’ which my code currently doesn’t. It would be interesting to test this code against mine. This code uses some shorthand rules to predict the ‘a’ or ‘an’, whereas my code uses information from a sound dictionary.
Copyright Technology Wales 2016