Datalinks Wiki
Advertisement
Web 1T 5-gram Version 1

Type

Dataset

Link

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13

Source

Ckan.net

This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.

File sizes: approx. 24 GB compressed (gzip'ed) text files

Number of tokens: 1,024,908,267,229

  Number of sentences:    95,119,665,584
  Number of unigrams:         13,588,391
  Number of bigrams:         314,843,401
  Number of trigrams:        977,069,902
  Number of fourgrams:     1,313,818,354
  Number of fivegrams:     1,176,470,663
Advertisement