Note: While this page will display in any browser, it will look much better when viewed with a browser that is compliant with the latest W3C standards. If you can see this message you may want to consider updating your software at this time.


15em 7em
first release second release open anc
about contents encoding frequency data using xaira bugs & caveats
obtaining contributing contents encoding frequency data using Xaira bugs & caveats
contents using annotations download
annotations software source code frequency data publications contributor's FAQ
project people consortium anc mailing list contact us site map

What's New

ANC.org

The American National Corpus now owns the anc.org domain name! Our web address is now www.anc.org.

We would like to thank the Animal News Center for transferring the domain to us. In gratitude, the American National Corpus project has made a donation to the Humane Society of the United States in the name of the "other" ANC.

ANC Tool Update

October 22, 2008: Version 1.2.5 of the ANC Tool is now available. The new version fixes a problem that prevented it from starting on Mac OS X.

July 24, 2008: Version 1.2.3 of the ANC Tool is now available. The new version includes better support for selecting the Unicode character encoding, a few bug fixes, and (experimental) NLTK output.

The Open ANC

The open portion of the ANC (approximately 15 million words of text, with annotations) is now available for download.

2nd Release Frequency Counts

Frequency counts for the second release are now available and can be downloaded here.

New Annotations Available

Both sets of annotations can be downloaded from our annotations page.

Manually Annotated Subcorpus

The ANC, in collaboration with the FrameNet project, WordNet, and Columbia University, has received a grant from the National Science Foundation to produce a balanced sub-corpus of the ANC that is manually annotated for WordNet senses, FrameNet frames, and validated for word and sentence boundaries, part of speech, noun chunks, and verb chunks.

ANC in UIMA

The ANC has been awarded an IBM UIMA Innovation Grant to port the ANC to UIMA and provide information with all ANC annotations that conform to UIMA Type Definitions.

The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development.

When completed, the ANC will contain a core corpus of at least 100 million words, comparable across genres to the British National Corpus (BNC). The corpus will also include an "opportunistic" component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of texts (and, where available, annotations) possible.

ANC Status

The ANC has so far released 22 million words of American English, which is available from the Linguistic Data Consortium--please consult the LDC Catalog entry. The ANC has also released an "Open" portion of the full ANC consisting of approximately 15 million words, which is freely available for download. All ANC and OANC data include annotations for word ands sentence boundaries, part of speech (4 tagsets), and noun and verb chunks. Parts of the corpus are annotated for additional linguistic features.

Contribute Data and Annotations to the ANC

Left arrowCONTRIBUTE TEXTS

The ANC is actively soliciting contributions of written texts and spoken transcripts in American English that were produced in or after 1990, to be included in the ANC and OANC.

Those who have for any purpose developed corpora of post-1989 American English are encouraged to contribute their unrestricted data to be included in the ANC. Authors can consult the frequently asked questions page to learn more about how the data will be used, and why you should consider contributing your work to the ANC.

Left arrowCONTRIBUTE ANNOTATIONS AND DERIVED DATA

We also seek annotations for linguistic features of any kind on all or part of the ANC/OANC and linguistic information (word lists, etc.) derived from it, for free distribution and use.

Coming Soon

ANC annotations in Linguistic Annotation Format (LAF/GrAF) developed by ISO TC37 SC4, and a version of the ANC Tool that handles data in this format.

New output options for the ANC Tool, including UIMA.

The First Release of the Manually Annotated Sub-Corpus (MASC) is scheduled for the end of 2008. The corpus consists of approximately 120,000 words drawn from the OANC and data that will be included in the next release of OANC data. The latter data include the publicly available portions of the Language Understanding Corpus that has been annotated by several projects and will be distributed by the LDC. About half of the corpus will also be annotated for the following from the work of the Unified Linguistic Annotation (ULA) project, as the annoations become available: Penn Treebank-style syntactic annotations, PropBank, NomBank, TimeML, and opinion annotations. All annotations, both in-house and contributed, will be in LAF/GrAF format and can therefore be merged or combined using the ANC Tool.

ANC in the News

The ANC has been written up in national newspapers.

Acknowledgements

The American National Corpus project has received support from the ANC Consortium, the TalkBank project, the Department of Chinese, Translation, and Linguistics at the City University of Hong Kong, and the National Science Foundation.

The ANC also acknowledges the following, who have provided software and/or support for ANC development:

Gate logo