MASC is a balanced subset of 500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC). The OANC is a 15 million word (and growing) corpus of American English produced since 1990, all of which is in the public domain or otherwise free of usage and redistribution restrictions.
Where licensing permits, data for inclusion in MASC is drawn from sources that have already been heavily annotated by others. MASC data includes a 50K subset consisting of OANC data that has been previously annotated for PropBank predicate argument structures, Pittsburgh Opinion annotation (opinions, evaluations, sentiments, etc.), TimeML time and events, and several other linguistic phenomena. It also includes about 5K from the 10K Language Understanding (LU) Corpus that has been annotated by multiple groups for a wide variety of phenomena, including events and committed belief and 5.5K words of Wall Street Journal texts that have been annotated by several projects, including Penn Treebank, PropBank, Penn Discourse Treebank, TimeML, and the Pittsburgh Opinion project.
Genre | No. files | No. words | Pct corpus |
Court transcript | 2 | 30052 | 6% |
Debate transcript | 2 | 32325 | 6% |
78 | 27642 | 6% | |
Essay | 7 | 25590 | 5% |
Fiction | 5 | 31518 | 6% |
Gov’t documents | 5 | 24578 | 5% |
Journal | 10 | 25635 | 5% |
Letters | 40 | 23325 | 5% |
Newspaper | 41 | 23545 | 5% |
Non-fiction | 4 | 25182 | 5% |
Spoken | 11 | 25783 | 5% |
Technical | 8 | 27895 | 6% |
Travel guides | 7 | 26708 | 5% |
2 | 24180 | 5% | |
Blog | 21 | 28199 | 6% |
Ficlets | 5 | 26299 | 5% |
Movie script | 2 | 28240 | 6% |
Spam | 110 | 23490 | 5% |
Jokes | 16 | 26582 | 5% |
TOTAL | 376 | 506768 |
The entire corpus is annotated and manually-validated for logical structure (headings, sections, paragraphs, etc.), sentence boundaries, three different tokenizations with associated part of speech tags, shallow parse (noun and verb chunks) and named entities (person, location, organization, date and time), and Penn Treebank syntax. Portions of the corpus are also annotated for FrameNet frames, PropBank semantic roles, MPQA opinion, committed belief, events, and ISO-TimeML. Co-reference annotations and clause boundaries with some discourse annotations of the entire MASC corpus will be released by the end of 2013. Several additional types of annotation have either been contracted by the MASC project or contributed from other sources. WordNet sense annotations for all occurrences of 114 words are also included in the MASC distribution, as well as FrameNet annotations for 50-100 occurrences of each of the 114 words. The sentences with WordNet and FrameNet annotations are also distributed as a part of the MASC Sentence Corpus. See the MASC Sentence Corpus page for a description of the WordNet and FrameNet annotated data.