Introduction

12dicts is a collection of English word lists. It differs in several
important ways from most of the other free word lists you can download.

  * The 12dicts lists are oriented towards common words. If you're
    looking for myriads of archaic, scientific or computer jargon words,
    you should look elsewhere.
  * The 12dicts lists have been rigorously checked for errors. (This is
    not to say that they are error-free, merely that enough care has
    been taken that errors are rather infrequent.)
  * 12dicts contains a variety of lists, of different sizes and
    characteristics. One size does not fit all. Because each list has
    different characteristics, I do not recommend combining them, except
    as noted below.

Originally, 12dicts was composed of lists derived from a specific set of
12 source dictionaries. In addition to these "classic" lists, 12dicts
now includes lists derived from other sources. It would perhaps be
appropriate to rename 12dicts to something more generic, such as BAWL
(Beale's Assorted Word Lists), but I have not done so in order to
preserve continuity.

A quick summary of the 12dicts lists and their characteristics is as
follows:

+---------------------------------------------------------+
|               |3esl |6of12|2of12|2of4brif|5desk|2of12inf|
|---------------+-----+-----+-----+--------+-----+--------|
|Size           |21877|32153|41236|60387   |61406|81520   |
|---------------+-----+-----+-----+--------+-----+--------|
|Abbreviations  |Y    |Y    |N    |N       |N    |N       |
|---------------+-----+-----+-----+--------+-----+--------|
|Acronyms       |Y    |Y    |N    |N       |Y    |N       |
|---------------+-----+-----+-----+--------+-----+--------|
|British English|N    |N    |N    |Y       |N    |N       |
|---------------+-----+-----+-----+--------+-----+--------|
|Hyphenations   |Y    |Y    |Y    |N       |N    |N       |
|---------------+-----+-----+-----+--------+-----+--------|
|Inflections    |N    |N    |N    |Y       |N    |Y       |
|---------------+-----+-----+-----+--------+-----+--------|
|Names          |Y    |Y    |N    |N       |Y    |N       |
|---------------+-----+-----+-----+--------+-----+--------|
|Phrases        |Y    |Y    |N    |N       |N    |N       |
+---------------------------------------------------------+

The remainder of this document is organized as follows:

  * This release
  * The classic 12dicts lists
      + The 6of12 and 2of12 lists
      + The 2of12inf list
  * The 3esl list
  * The 2of4brif list
  * The 5desk list
  * How 12dicts came to be
  * Conclusions

This release

This is release 4.0 of 12dicts, released Jan. 18, 2003. It differs from
previous versions by containing three additional lists which are not
derived from the "classic" 12dicts sources. Changes to the classic lists
are limited to error corrections.

The classic 12dicts lists

The 12dicts project began as the n-dicts projects, n being a variable
whose value finally stabilized as 12. The purpose of the project was to
create a list of words approximating the common core of the vocabulary
of American English.

The methodology of the project was to record and correlate the words
listed in a number of small dictionaries. The number of dictionaries so
recorded is now 12, comprising 8 ESL (English as a Second Language)
dictionaries and 4 "desk dictionaries". The dictionaries chosen vary
widely by publisher, by style, by completeness and by depth. In this
version of 12dicts, all of them are dictionaries of American English
(three from British publishers). The smallest of them contains about
20,000 entries, and the largest 46,000. (All totaled, there are about
75,000 entries, many of which appear in only a single dictionary.) All
but two of them were published in the last seven years.

The 6of12 and 2of12 lists

I initially tried two different ways of winnowing the 12dicts data to
produce lists of common words. Both produced interesting results. One
list, the 6of12 list, contains all words and phrases listed in 6 of the
12 dictionaries. One way of describing this list is that it contains
those words and phrases which a (seeming) majority of lexicographers
believe are relevant to people learning English, and/or to everyday
usage. This list contains about 32,000 words and phrases. The other
list, the 2of12 list, is more inclusive in that it includes words listed
in as few as two of the source dictionaries, but less inclusive in that
it excludes items of various sorts, including multiword phrases, proper
names and abbreviations. This list contains about 41,000 words. It is
perhaps more suitable for use in areas like spell checking or word games
than the 6of12 list. (Honesty compels me to admit that neither of these
lists is, by itself, a good choice for spell checking, due to the
absence of inflections, proper names, Roman numerals, etc.)

A third list, 2of12inf.txt, developed later, is of a rather different
character, and is discussed in a later section.

A more precise description of the criteria by which the above lists were
composed is as follows:

6of12 list word selection

  * The 6of12 list contains all non-excluded words and phrases which
    appear in 6 or more of the source dictionaries.
  * Prefixes and suffixes are excluded. Abbreviations are included;
    however, if they are entirely lower-case and alphabetic, they are
    terminated with a colon (":") so they can be easily distinguished
    from regular words.
  * Inflections of included words are not themselves included unless
    they are separately defined or irregular.
  * It sometimes occurs that a word is listed in several forms (e.g.,
    with and without hyphenation) in 6 or more dictionaries, even though
    no single form is so listed. In this case, if one spelling is
    clearly more accepted, this spelling and this spelling only is
    listed. If all spellings seem equally accepted, one spelling has
    been selected arbitrarily for inclusion.
  * The 6of12 list contains a significant number of words which do not
    meet either criterion 1 or 4 above. These words, sometimes called
    "signature words", are discussed below. All of these words are
    listed in at least one of the source dictionaries.
  * In addition to the ":" suffix discussed above, other special suffix
    characters are used to mark words with certain characteristics, as
    discussed below.

2of12 list word selection

  * The 2of12 list contains all non-excluded words which appear in at
    least 2 of the source dictionaries.
  * This list excludes capitalized words, multiword phrases, and
    abbreviations, as well as prefixes and suffixes. It does not exclude
    hyphenated words or contractions. If a word occurs in both a
    hyphenated and an unhyphenated form, the unhyphenated form is
    listed, even if the hyphenated form is generally preferred.
  * The list excludes spellings which are considered (by a majority of
    the dictionaries listing it) to be non-American usage. It also
    excludes secondary spellings which are mentioned by fewer than four
    of the source dictionaries.
  * Inflections of included words are not themselves included unless
    they are separately defined, or irregular.
  * Several of the source dictionaries include listings for obscure
    currencies, such as ringgit, khoum and ngwee. I was unable to regard
    such words as part of the English "core vocabulary", and so I
    required citation in over a third of the dictionaries for inclusion
    of monetary units. A side-effect was the elimination of the word 
    lepton, which, in addition to its use in particle physics, is also
    .01 Greek drachmas.
  * This list also includes a small number of signature words, as
    discussed below.

Signature words

As indicated, both lists have been augmented with words (and, in the
case of the 6of12 list, phrases) which fail to meet the formal
requirements for inclusion. In the case of the 6of12 list, 1024 words
were added (about 3 % of the total). These are all words which, in the
judgment of the compiler, are as familiar as many of the words which met
the criteria for inclusion. Examples of some of the sorts of words which
were added are:

  * Words of the same category as other included words. An example is
    the astrological sign Cancer, which alone of all the astrological
    signs fails to appear in 6 or more of the dictionaries. Similarly
    added were the omitted holidays Thanksgiving and Christmas Eve.
  * Vulgarities, sexual terms and insults. Some such words were already
    included, but most of the source dictionaries were quite squeamish
    about them. These words are very widely known indeed; I hold that
    any list of "common" words which does not include the infamous
    f-word is simply discredited thereby. Some may feel that it would
    have been better to leave some or all of these terms unmentioned.
    Nevertheless, the expression of blasphemy, unwarranted contempt and
    perverse lust, whether in words or in deeds, is a very human trait.
    Suppressing the evidence of these aspects of the human condition in
    our language makes no more sense than excluding leprosy, gangrene
    and dementia, no matter how unpleasant they may be to contemplate.
  * Conventional conversational phrases so common as to be practically
    invisible to native speakers. Examples are thank you, good night,
    uh-huh, of course and gesundheit.
  * Sports terminology, especially for football and baseball. (If I, who
    am practically sports-blind, noticed this deficiency, it must be of
    major proportions indeed.)

Note that the signature words in the 6of12 list can be identified via
the suffix character "+", and eliminated if desired.

A much smaller set of words (49) was added to the 2of12 list. These were
of two sorts:

  * Signature words from the 6of12 list which were not already present
    in the 2of12 list, and which are not excluded due to being
    abbreviations, phrases, etc.
  * Inflections of irregular verbs not explicitly mentioned in 2 source
    dictionaries, such as outfought and reheard.

Annotations

Some of the 6of12 list entries are annotated with a suffix character,
giving additional information about the associated word. The annotations
can be easily removed with an editor or script if they are unwanted.

These annotations are:

: The word is an otherwise unmarked abbreviation. This suffix may appear
  in combination with another suffix.                                   
& The word is primarily a non-American usage.                           
# The word is generally held to be a variant or less preferred form of  
  another word.                                                         
< This form of a word is held to be the primary form by fewer           
  dictionaries than some other form of the word.                        
^ This form of the word was selected arbitrarily from a set of variants,
  none of which was clearly preferred.                                  
= Roughly, this indicates a "second class" word, as described below.    
+ The word is a signature word.                                         

The reasons a word might be marked with the = annotation are:

  * The word is an inflection which was defined in the same entry as the
    base word.
  * The word is a derived word (-ly, -ness or -er/or) which was not
    defined in a separate entry.
  * The word appeared in a list of undefined words with a common prefix,
    such as un- or re-.

The words in the 2of12 list are not annotated.

The 2of12inf list

The 2of12inf list is of a rather different character from the two
original "classic" lists. Conceptually, it is simple. It consists of all
the words in the 2of12 list, plus their inflections, amounting to about
81,000 words. This list may be more useful than the other lists for
applications like word games. It was created to help Kevin Atkinson in
his Aspell and SCOWL projects (for which, follow this link). Unlike the
6of12 and 2of12 lists, this list is not based exclusively on the
contents of my 12 source dictionaries, and for this reason it has, I
feel, less authority than the other classic 12dicts lists. It also
probably has a significantly higher error rate than the other lists, for
reasons explained below.

The criteria defining the 2of12inf list are as follows:

  * The 2of12inf list contains all non-excluded words which appear in at
    least 2 of the source dictionaries.
  * This list excludes capitalized words, multiword phrases,
    abbreviations, contractions, hyphenated words and single-letter
    words, as well as prefixes and suffixes.
  * The list does not exclude secondary spellings, non-American usages
    or monetary units.
  * The list includes inflections of all included words. Any inflection
    mentioned or clearly implied by any of the source dictionaries is
    included (i.e., two citations are not required). Additionally, some
    inflections have been added from other sources.
  * Plurals of "uncountable" nouns were included, annotated with the "%"
    suffix character. See below for an extended discussion of the
    inclusion of these words.
  * Signature words from the other lists, plus their inflections, were
    added. No other signature words were added.

Though the 2of12inf list still consists mostly of very common words,
criteria 3 through 5 above cause the 2of12inf list to contain a greater
proportion of unfamiliar and unusual words than the other classic
12dicts lists.

The 2of12inf list was not derived directly from the 12 source
dictionaries. The starting point was a subset of Kevin Atkinson's AGID
list, a list of words, parts of speech and inflections derived from
public-domain sources, notably Moby Words and WordNet. (See the file
agid.txt in the 12dicts archive, which is a copy of the AGID "readme",
for more information on the antecedents of AGID.) 2of12inf was created
by a process of editing the AGID subset to remove spurious entries and
those which reflected a more esoteric English vocabulary than the other
12dicts lists, and to add inflections which AGID failed to identify.
This process required significantly less effort than would have been
needed to derive the list directly from the source dictionaries.
Unfortunately, a side effect of the process is that the result is likely
to be somewhat less reliable than the other 12dicts lists. In
particular, Moby Words is notoriously unreliable, and I find it unlikely
that I have successfully identified all the spurious inflections its use
has introduced. It is my hope in the future to release another edition
of 2of12inf which is not derived from AGID, and therefore not "infected"
by Moby Words.

After the first version of the 2of12inf list was released, I replaced
one of the source dictionaries, officially an international dictionary
but in actuality rather British in its orientation, with a more American
dictionary by the same publisher. It was not practical (nor necessarily
desirable) for me to go through the list removing inflections endorsed
only by the superseded dictionary. For this reason, the 2of12inf list
has a slightly more international character than the other 12dicts
lists. It is not altogether clear that this is a bad thing.

Selection of inflections

Ideally, the 2of12inf list would contain only inflections listed in one
of the 12dicts source dictionaries. This proved not to be practical. The
reason for this has to do with the nature of these sources, which are
mostly ESL dictionaries. An ESL dictionary might well list the word 
esophagus, but, because an English learner is unlikely to need to talk
about this organ in the plural, it will probably not bother to list the
plural form esophagi. For words of this sort, I therefore needed to
obtain their inflections from other sources. Obviously, the decisions on
when to include additional inflections were judgment calls, as were the
choices of which inflections to add.

Adjectival inflections (comparatives and superlatives) proved to be an
especially annoying problem. Only 2 of my 12 source dictionaries
provided remotely reliable information of this sort. In fact, such
information is sparse and inconsistent in most dictionaries of any size.
I relied on a small set of additional dictionaries for this information,
which was mostly disjoint from the sources for plurals and verb forms.
Several of these sources were Scrabble(r)-related, and therefore
inclined to include forms of little plausibility such as iller/illest or
fertiler/fertilest. Accordingly, I ended up rejecting some of the
documented inflections on grounds of implausibility. I have no doubt
that, in the process, I made a number of errors of both inclusion and
exclusion and, in any case, many of the forms listed have no connection
with any of the 12dicts source dictionaries.

One additional problem in the creation of the 2of12inf list was that of
"uncountable" nouns and their plurals. Some English dictionaries,
especially ESL dictionaries, as well as other linguistic sources attest
to the existence of nouns which cannot be counted, or used in the
plural. Examples of such nouns include mud, rayon, oregano, chess,
fairness, wisdom, aluminum, training, materialism and chickenpox. This
is an entirely commonsense notion, but a difficulty is the fact that the
boundary between the countable and the uncountable is extremely vague
and ill-defined. For example, the word coffee is ordinarily uncountable,
but not when ordering in a restaurant, as is the word symmetry, except
in physics or math. In general, it is possible to contrive a context
where use of the plural of any noun whatsoever is reasonable.

An alternate position, therefore, is that in fact no nouns are
uncountable, and that any noun which is not already plural possesses a
plural. This position is especially useful in the context of word games,
where words such as zeals and anthraxes may produce large scores. For
this reason, the official Scrabble dictionaries list words such as 
thens, onces and mankinds, which most people find rather implausible.
The fact that the 2of12inf list might well be useful in gaming contexts,
together with the fact that the boundary between countable and
uncountable nouns is so ill-defined, served as a powerful argument for
inclusion of all plural forms, whether commonly used or not, while its
derivation from ESL sources argued for including only the plurals of
countable nouns, however distinguished.

In the end, I was unable to resolve this dilemma, and adopted a
compromise. The 2of12inf list includes all plurals, but with the plurals
of uncountable nouns marked, making it easy to remove them if they are
not wanted. That left the issue of how to establish countability. Six of
my source dictionaries included information on countability, which was
adequate to decide the status of most of the included nouns. As for the
rest, as usual, I used my best judgment. I will confess to occasionally
overriding the source dictionaries when I believed they were clearly
incorrect. (For instance, I chose not to mark the word hatreds as an
uncountable plural, in defiance of the opinion of all my sources, on the
grounds that it has been used in too many news stories from Bosnia to be
considered unusual.) It is interesting to note that most of the plurals
I added from auxiliary sources were of words considered uncountable.

The difficulties listed above, and the fact that I was forced to
exercise personal judgment frequently in creating it, emphasizes a
fundamental difference between this list and the other classic 12dicts
lists. I have tried to make the 6of12 and 2of12 lists reflect only the
source dictionaries, and to keep my own judgments and opinions out of
the picture (except for my addition of signature words). This has proved
impossible to achieve for the 2of12inf list, which accordingly
represents a less authoritative and more arbitrary collection.
Additionally, the 2of12inf list has undergone less proofreading and
validation than the other lists, and I suspect the error rate is
considerably higher than the idealistic goal of 0.02 % I advocate
elsewhere in this document. Nevertheless, I hope it may prove to be of
some use and interest.

I wish to offer my special thanks to Kevin Atkinson, for supplying me
with the AGID list, and for encouraging me to add the inflections. Of
course, any errors that remain in the 2of12inf list are my own
responsibility, and should not be blamed on Kevin, AGID, or even on
Moby.

The 3esl list

The 3esl list represents another attempt to produce an English "core
vocabulary" list. It is about 2/3 of the size of the 6of12 list, which
it resembles in terms of the sorts of words included.

The 3esl list is a far more subjective list than any of the classic
12dicts lists. It was compiled from 3 small ESL dictionaries, using the
same criteria for eligibility as the 6of12 list. I started with a list
composed of all words from the smallest of the 3 sources, plus all words
contained in both of the others. This list was then edited in the
following ways:

 1. I removed alternate spellings for included words, such as grey and 
    off-stage. I also removed very similar synonyms for the same
    concept, for instance, removing cable television as a duplicate of 
    cable TV.
 2. I added one form of each word which would have been included if the
    sources had agreed on spelling, such as shortchange and back seat.
 3. I removed some words which were present in the smallest of the
    sources but seemed too esoteric, such as the symbols of chemical
    elements. I did this only for words which were not present in the
    other sources.
 4. I added some words which were present in only one of the two larger
    sources, but which seemed appropriate to add. These words were
    frequently of the sort added to the 6of12 list as signature words,
    as well as some inflections that often function as words with
    meanings of their own, such as comforting and notes.

All of these changes were quite subjective in nature, and quite
numerous. Probably more than 10 % of the candidate words were added or
removed in this way. For this reason, it is pointless to speak of
signature words for this list; the composition of the list is too
arbitrary for the term to make any sense. (I will note that the list is
still not entirely arbitrary, as I added only words found in some form
in one of the sources, and removed no words present in two of the
sources other than duplicates. Thus, words like front page were not
added, no matter how familiar, and words such as lugubrious were not
removed, despite clearly not being part of any "core vocabulary".)

Like the 6of12 list, the 3esl list marks lower-case abbreviations with a
":" suffix, to prevent them from being mistaken for regular English
words.

One final note on this list. The 3esl list contains about 1500 words not
present in the 6of12 list. Because these two lists have the same rules
for the kinds of words included, one could easily combine the two to
produce a slightly larger list including a number of words whose
omission from 6of12 is rather surprising. Be warned that in a few cases,
the spelling chosen for words with multiple spellings is different in
the two lists, and I would recommend that the duplicates be removed.
(I'll be happy to provide a list of the duplicates if anyone wants one.)

The 2of4brif list

All of the classic 12dicts lists are unabashedly oriented towards
American English. I've received a few expressions of interest in a
British English list. The result is the 2of4brif list. This list was
compiled from 4 large "international" ESL dictionaries, published by
British publishers. To this American, they are more British than they
are international; quite possibly, they seem more American than
international to British readers. It is interesting to note that,
although there were only a third as many sources for this list as for
the 12dicts lists, these dictionaries resembled each other far more
closely than their American counterparts, which could mean that the
2of4brif list is as good an approximation of a "core" British English
vocabulary as the 6of12 list is for American English. (Or, alternately,
it may simply mean that my choice of sources was too narrow.)

This criteria for inclusion in this list were basically those of the
2of12inf list. In particular, inflections are included for all words,
but hyphenated words, contractions, phrases, proper names and
abbreviations are all excluded. One important difference between the two
is the way in which inflections were determined for inclusion. The
2of12inf list includes some inflections found in one (or even none) of
its sources. Further, as discussed in detail above, it includes plurals
for words which are not normally considered to have plurals. The
2of4brif list differs in both of these regards. It includes only
inflections endorsed by two or more of the sources, specifically
excluding any plural forms for nouns listed as uncountable.

The 2of4brif list includes no signature words as such. I made a small
number of adjustments for consistency, such as making sure that -ise and
-ize spellings were equally represented, and adding plurals for ordinal
numbers. (Why fourteenth would be defined as a fraction, but not 
seventeenth, I must simply regard as a mystery.) These edits were so
few, and so clearly harmless, that I have not marked them.

Prospective users of the 2of4brif list should realize that it was
compiled by an American. If my sources contained any glaring errors (and
most dictionaries have a few), I might well not have noticed, and
perpetuated them in the list. The fact that two citations were required
is some protection against such an event, but no guarantee.

As the 2of4brif list is very similar in makeup to the 2of12inf list, a
user who wants a larger, more international list than either could
reasonably merge the two. If you do this, you should remove the unusual
plurals (marked with a "%") from the 2of12inf list in the process, for
consistency.

The 5desk list

I created the 5desk list in an attempt to do a better /usr/dict/words
(about which I offer many harsh criticisms elsewhere in this document).
The sorts of words admitted are the same sorts that /usr/dict/words
contains. Though somewhat larger in size than most versions of /usr/dict
/words, this is still a short word list, striving for inclusion of words
one is likely to encounter rather than the complete jargon of every
possible scientific, artistic or occult endeavor.

5desk was assembled primarily from five "desk dictionaries". It was
augmented by words from five minor sources, including a "vocabulary
builder" and a collection of proper names. The list excludes prefixes,
suffixes, phrases, hyphenated words, contractions and most abbreviations
and acronyms. There was no requirement for multiple listings; all
qualifying words from each of the sources were included. Inflections of
included words were not included themselves except when irregular, or
separately defined. Variant and non-American spellings were not
excluded, and no signature words were added.

Words commonly considered to be abbreviations/acronyms were included if
they contained at least one upper case character, and were defined with
an explicit part of speech. This excluded items like Mr and Feb, which
are abbreviations in the classic sense, but allowed words like DNA and 
ATM, which are used far more frequently than that which they abbreviate.
While there is a trend in modern dictionaries to list such words as
nouns (or occasionally verbs, adverbs, etc.), it is a trend in progress,
and rather inconsistently applied. For this reason, the set of such
words in the 5desk list is somewhat incoherent, including SPCA but not 
PETA, AIDS but not SIDS, KGB but not CIA, and PDQ but not ASAP.

One class of commonly-used words is regrettably absent from the 5desk
list, because I was unable to find a satisfactory source for them. This
is the class of commercial names such as Exxon, Tylenol, Pepsi and Chevy
. This is probably forgivable, as this class of names is as ephemeral
and transitory as teenage slang. The one-time household words Kool,
Ovaltine, Philco and Ipana serve now only as answers to trivia
questions, with modern wonders like Starbucks, Google, Ritalin and TiVo
taking their place on the tongues of the trendy.

The 5desk list has clearly moved beyond any "core vocabulary" concept.
It includes quite esoteric words (ogee, pleonastic), very uncommon
spellings (thiamine, yuppy), and obscure geographical and historical
names (Paricutin, Nevelson). Like /usr/dict/words, it is frequently
inconsistent and arbitrary, but I hope at the least I have avoided
including spelling errors, and overlooking the stuff of everyday
conversation. Perhaps it will be useful as a compromise between basic
lists such as 3esl, and truly massive lists like Mendel Cooper's ENABLE.

How 12dicts came to be

It may have occurred to some to wonder about how something like the
n-dicts project came to be (though I assume that anyone who bothers to
download this archive must already have some idea that such a project
could be of interest).

Some years ago, there was a post to the sci.crypt Usenet newsgroup, on
the subject of creating PGP passphrases using randomly selected entries
from a supplied list of very short words. (If this sounds interesting,
follow this link for an expanded version of the post.) The word list,
which was extracted from /usr/dict/words on some UNIX system, seemed to
me ill-suited to its intended purpose. It included arcane acronyms (
bstj, fmc), misspellings (diety, ouvre) and words of amazing obscurity (
bhoy, kombu). I decided I could do better (and eventually did). This
caused me to start downloading English word lists, of which there are
many, from the Internet. I was not impressed by the overall quality of
these lists, and the few which were high-quality were all-inclusive,
burying the everyday words under a mountain of archaisms and esoterica.
The flaws of the vast majority of these lists are worth recounting:

  * Failure to proofread. Many of these lists are littered with
    misspellings and typos, sometimes approaching gibberish. (I presume,
    for instance, that the bizarre string nondploe, which was found in a
    purported Scrabble word list, is a typo for something more or less
    legitimate, but I have no idea what.) Working on my own lists has
    helped me understand that 100 % accuracy is a very demanding goal,
    seldom actually achieved, but I still feel it reasonable to expect
    no more than 1 or 2 errors per 10,000 words.
  * Acceptance of completely undocumented lazy spellings, such as 
    bullseye and courtmartial.
  * Failure to respect capitalization.
  * Failure to distinguish abbreviations from other entries.
  * Treating esoteric computer jargon, and especially UNIX jargon, as
    everyday English. (Beware any list which includes bitblt, emacs,
    inode or lvalue.)
  * Apparently random word selection. For instance, the most common
    version of /usr/dicts/words contains a large set of apparently
    randomly chosen personal names (uncapitalized, of course, and
    missing wanda, marge, polly and sid).
  * Inconsistent inflection. Some lists include all inflections of their
    vocabulary, while others include only singulars and infinitives.
    Either policy is fine, and has its advantages. I am personally very
    annoyed when inflected forms appear at random. I find this generally
    happens when a compiler merges several lists with different
    characteristics, with no attempt to reconcile their divergent
    styles.
  * Omission of everyday words. I've seen a purported general-purpose
    list that includes bremsstrahlung, yet omits log and beer. Or that
    includes saxophone but not sax, and rhinoceros but not rhino. Of
    course, due to my original purpose in seeking out common short
    words, I found this especially annoying.

One result of my frustration with this situation was my working with
Mendel Cooper on ENABLE (for further information, check out this link),
which was close to unique in having an active caretaker, one clearly
concerned with quality, and in being oriented towards American rather
than British English. But ENABLE is an all-encompassing list and, even
if it had been complete at the time I started my search for a list of
common words, it would not have been what I wanted for that reason.

I finally decided that only starting from scratch with a systematic
approach was likely to get me what I was looking for, and that
dictionaries intended for non-native speakers of English were the best
possible source for words that are in some cases so familiar that we
never think of them. This has led to the 12dicts lists, which I hope
have managed to avoid the flaws recited above.

(I should acknowledge one form of inconsistency exhibited by the 12dicts
lists, which is that sometimes related words are spelled inconsistently.
For instance, the 2of12 list contains both broadminded and 
broad-mindedness. This generally occurs as a result of the methodology
used to build the lists. In the case of broadminded, only one source
dictionary listed broadmindedness, which was therefore excluded. I felt
unequal to trying to correct these inconsistencies, some of which are
real and not mere artifacts of 12dicts, such as the contrast between 
self-conscious and unselfconscious.)

Conclusions

When I released the first version of 12dicts in 1999, I assumed I was
done with it. It hasn't worked out that way. Before I declare it
finished for a second time, there are a few more things I'd like to
accomplish.

  * As mentioned above, I would like to rework the 2of12inf list to
    remove the dependency on the Moby lists.
  * As may be seen by inspecting the table of file characteristics, the
    12dicts files now form a spectrum of word lists, with contents
    ranging from the extremely common to the mildly esoteric. I would
    like to extend the spectrum further by applying the 12dicts
    methodology to dictionaries of larger size. Whether I will ever get
    the time for a project this large remains to be seen. If it ever
    comes to pass, it will probably be released separately from 12dicts
    itself, as anything larger than the 5desk list will be too large to
    even pretend to represent a "core English" vocabulary. (Even the
    5desk list itself is too large for that purpose.)
  * It is possible that in the future the "n" of n-dicts will increase
    again, but, in fact, consideration of an additional dictionary now
    generally ends with the discovery that its vocabulary matches
    12dicts pretty closely. At the very least, this phenomenon gives me
    hope that the 12dicts lists have now fulfilled their basic purpose.

The 12dicts lists were compiled by Alan Beale. I explicitly release them
to the public domain, but request acknowledgment of their use.
(Actually, the dependency of the 2of12inf list on AGID prevents its
release into the public domain. However, I do not impose any additional
requirements on its use beyond those imposed by AGID and its sources, as
described in agid.txt.) Feel free to send comments, suggestions,
inquiries and/or large sums of money to me at biljir@pobox.com. If you
find 12dicts useful, I'd love to hear about it.
