Note: While this page will display in any browser, it will look much better when viewed with a browser that is compliant with the latest W3C standards. If you can see this message you may want to consider updating your software at this time.
The ANC will provide a massive body of language data in contemporary American English, similar to the British National Corpus (BNC) produced ten years ago. This corpus will enable dictionary makers, linguists, and developers of language understanding software to analyze the ways in which Americans typically use the English language, and to appropriately represent that usage in dictonaries and other reference works and academic studies of linguistic phenomena, and to be able to handle American usage in web search engines, translation machines, and other language processing software.
To this end, the ANC invites contributions of language data, including published and unpublished written and spoken (i.e., transcriptions) documents of all genres, including fiction, non-fiction, poetry, newspapers, magazines, journals, pamphlets, diaries etc., as well as web-based language data such as blogs, web pages, and email, and other less comoon genres such as rap lyrics.
Note that the ANC project has not enjoyed the funding and contribution of language data that projects such as the BNC relied on for their completion. Instead, we depend for our success on contributions of individuals like you to provide us with enough data to construct a representative sample of English as written and spoken by Americans today. In turn, your contribution will help to define "American English" for decades to come.
The American National Corpus includes written and spoken (i.e., transcriptions) materials that fulfill the following requirements:
If you have any doubt or questions about the suitability of your contribution, please do not hesitate to contact us and we will get back to you right away.
We may choose not to include a document in the corpus at all if there is some doubt that the author is a native speaker of American English, or if we are unable, for technical reasons, to extract meaningful information from the documents (more on this below).
Copyright is the legal exclusive right of the author of a work to control the copying of that work. Most information on the web and elsewhere concerns whether or not your using something (especially, these days, material from the web) is a violation of the creator's copyright. Here, we are concerned with ensuring we do not violate your copyright rights when you contribute data to the ANC.
The reason for this is that in the USA, almost everything created privately and originally after April 1, 1989 is copyrighted to the creator and protected whether it has a copyright notice or not. Therefore, if you produce a document, web page, or any other work, you own the copyright unless you have either
Unless you have put your document in the public domain (in which case we can include it in the ANC with no problem), we therefore must have your permission to include it in the ANC and re-distribute it. By contributing a text through the ANC upload page, you agree to the license agreement at the bottom of this page. Agreeing to this license does not transfer copyright to the ANC.
If you have any qualms, please consult the Frequently Asked Questions page to learn why granting us the right to include your document does not put you in danger of others reproducing or "stealing" your work. If you still have qualms. note that documents may be contributed in part, for example, by extracting non-contiguous segments, such as chapters 1,2,4,5,8,9 from a book. However, to be useful for linguistic analysis, we only include extracts that are relatively long and coherent in the ANC--that is, we cannot use "every other sentence" in a text, but "every other chapter" is fine.
For more information about copyright, we refer you to Brad Templeton's A brief intro to copyright and 10 Big Myths about copyright explained.
We accept documents in almost any format. However, because of the massive amount of data we are processing, it is essential that we process documents automatically rather than by hand. In our case, "processing" means rendering the document in an XML format, where, ideally, titles, headings, words in italics, etc. are marked with specific tags identifying them as such. So, in addition to needing texts that are easy to process, we prefer texts in which things such as titles and italicized words are clearly identified. Any document produced with a word processor or marked up in HTML as a web page will usually contain this information (1) if the markup is, where possible, descriptive rather than presentational (i.e., tags that say what the content is rather than how it should look, as when you use <em> (emphasis) instead of <i> for italic); and (2) if markup is used consistently.
The following are some rules of thumb concerning formats. Documents that are very difficult to process automatically will likely not be included in the ANC, so we ask that if you have a choice, please submit your document in a format as near the top of the following list as possible:
Your document(s) will be very easy to process if
- they are marked up with well-formed XML and use a "standard" vocabulary such as the XCES, TEI, or DocBook.
- they are Word doc or docx files, or rtf files, and you have made consistent use of the styles defined by Word or that you defined yourself.
- they are marked up with well-formed XHTML and use the "strict" XHTML DTD.
- The documents are "plain text" with blank lines between titles, headings, and paragraphs. Note: only send plain text files as a last resort. Also double check that all characters in your document display correctly when saved as a plain text file. Try to use UTF-8 or UTF-16 (if that is an option) when submitting text files. But note that a lot of information is lost when we get plain text format--it is harder for us to identify a title as a title if your document does not contain this information explicitly (as would a word processor or HTML document
Your document(s) will be relatively easy to process if
- they are marked up with well-formed XML
- you (or someone else) marked them by hand in HTML--i.e., they were not produced by a web-page generating program such as Dreamweaver or FrontPage
Your document(s) will be harder to process if
- they were machine generated in HTML by a program like FrontPage, DreamWeaver, etc.
- they are in PDF
Your document(s) will be virtually impossible for us to process if
- they are in Quark, InDesign, or some other "publishing" software format
- they are in double-column PDF
- they contain very non-standard fonts
If you are contributing multiple documents that contain the same kind of texts (for example, several essays, a group of stories, etc.) you can do so in a single upload as follows:
By default, the author of a contributed document is identified in the ANC header associated with the text. If you wish to contribute a document anonymously, you can enter "anonymous" in the author field on the upload page.
Once you have ascertained that your document(s) satisfy the criteria for inclusion in the ANC, do the following:
Grant of licenseBy contributing my document through the ANC web page, I hereby grant to the American National Corpus project a worldwide, perpetual, royalty-free license to use, reformat, reproduce, and distribute, in electronic form or any and all media hereinafter developed, my submission as part of a collection of American English-language material. I understand that the collection will be made available to others for the purposes of linguistic education, research, and development, including commercial development.
(Note that this license does not assign copyright to the ANC.)