The Czech Language
Main page > Czech National Corpus

Czech National Corpus

Czech National Corpus (CNC) is a large repository of computer-based texts which is being built at the Faculty of Arts, Charles University in Prague. Its present (1999) size of over 100 million words, which is constantly growing, makes it the foremost and largest resource of information on and about language and, through it, about most things reflected in the language.

The CNC, which is accessible to broad academic public at home and abroad, is run under a series of programmes which allow the user to search for linguistic units, be it words, word forms, part of words or collocations, and their frequency, grammatical and other characteristics. In its balanced, rather representative shape, the CNC will be released by the turn of 1999/2000 but its provisonal use is offered to anyone since 1996. It is in a concordance format that the user will get results of his search enabling him or her to study the real contextual use of words and the like. The concordances thus obtained can be furhter processed, sorted, classified etc. This makes work with language more of a fast play rather than the old-time drudgery.

Next to the contemporary CNC (100 million words and more, later on), two small corpora, that of Old Czech and Spoken Czech are being built at the same time.

The public Internet access to a small part of CNC's (some 20 million words) is open to anyone.

For access to the full CNC, you have to address the administrator (http://ucnk.ff.cuni.cz) and ask for special permission, which is granted to anyone for non-commercial purposes.

Sentences and their structure <<

top of the page

>> Spoken Czech, its character and use