CORPUS LINGUISTICS AT THE PRESENT STAGE - Студенческий научный форум

XIII Международная студенческая научная конференция Студенческий научный форум - 2021

CORPUS LINGUISTICS AT THE PRESENT STAGE

Войлокова К.П. 1
1Владимирский государственный университет имени Александра Григорьевича и Николаевича Столетовых
 Комментарии
Текст работы размещён без изображений и формул.
Полная версия работы доступна во вкладке "Файлы работы" в формате PDF

The study of language corpora, which began in the middle of the twentieth century, led to the formation of such a direction of language science as corpus linguistics. In this regard, V. A. Plungyan, Academician of the Russian Academy of Sciences, Head of the Department of Theoretical and Applied Linguistics of the Lomonosov Moscow State University, describes this direction as" rapid "and" super-modern " [3, p.9]. Corpus linguistics has a great research potential, but there is a difference in approaches to the use of corpora in Russia and abroad.

Considering the phenomenon of corpus linguistics in the Russian and foreign scientific online space on the basis of such platforms as Elibrary, Research Gate and Google Scholar, it can be found that language corpora are considered:

as one of the directions of general linguistics in the framework of corpus linguistics as a science;

as one of the elements of the strategy of teaching a professionally-oriented foreign language;

as a basis for conducting empirical research in the field of lexicography and translation theory.

As can be seen from our definition, corpus linguistics is applicable in various fields of knowledge. However, considering this science from the point of view of its popularization, we can say that the West turns to corpus tools much more often than compatriots. For example, for the Russian-language query "corpus linguistics", the Elibrary platform provides results for only 2,323 articles. Although if you change the query to the English-language equivalent of "corpus linguistics", the number of articles will increase to 8,517, that is, almost 4 times. Within the framework of the Google Scholar platform, a similar situation is observed with a difference only in numbers: the query "corpus linguistics "leaves 15,400 mentions, and the query" corpus linguistics " – 69 times more, that is, more than 1,000,000 matches. Thus, we see that the number of domestic case studies is less than foreign ones. In our country, there are not so many separate studies devoted to the study of the use of the latest technologies in linguistics.

For example, research is conducted by scientists of the NIL "Intelligent Text Management Technologies" of the Kazan Federal University. The authors of articles on a given topic are, for example, A. S. Kiselnikov, E. V. Kharkova, O. S. Safonkina, E. V. Varlamova.

V. P. Zakharov in his book "Corpus Linguistics" notes that within the framework of this scientific discipline, the key element is the language corpus, which is understood as "a multi-faceted collection of natural cases of language use in the form of texts of different genre and stylistic orientation and stored in electronic format" [4, p.8]. Its main purpose is to provide reliable information about the use of the word and finding lexical units and grammatical structures through linguistic markup.

I must say that the science of buildings in the West began to develop 40 years earlier than in Russia. Its implementation follows similar principles and strategies, but there are some criteria for differentiation. First of all, it is a difference in the practice of constructing linguistic corpora in terms of their number and diversity.

Western linguists have a wealth of experience in creating language corpora, in particular, the first known electronic corpus-the "Brown Corpus" of G. Kucher and N. Francis-was created in the middle of the XX century. A few years later, the first corpus-based dictionaries began to be published, such as the well-known Collins’ COBUILD line (Collins Birmingham University International Language Database), which is represented by separate dictionaries dedicated to various language elements: phraseological units, metaphors, prepositions, homophones, quantifiers, etc. The development of corpus technologies, the development of new classifications and parameters allowed linguists to create corpora of variants English languages such as Wellington (New Zealand English), Kolhapur (Indian English), etc. [2, p. 46].

Russian linguistics is somewhat behind its foreign counterparts in the field of corpus design. The first, and to date the only completed corpus of Russian-language texts, is the Uppsala corpus of texts. It was created at the end of the XX century and is currently little used. Due to the fact that this corpus is not able to fully meet modern requirements due to the limited volume and lack of linguistic markup (morphological, syntactic, semantic markers), since 2003, Russian linguists have begun work on a new project called the "National Corpus of the Russian Language" (NCRL), which is currently actively being conducted [5, p.85].

It should be noted that most linguistic studies are still conducted on the basis of texts from the English language corpus. Perhaps the reason for this is that the heyday of British linguistics falls in the 60-80s of the XX century, i.e. at the time of the creation of the first corpus of the English language. In addition, the active development of information technologies in the UK and the USA could become a catalyst for the progress of creating electronic enclosures [2, p. 44]. One of the most famous English-language buildings is considered to be:

The British national corpus (BNC, British National Corpus). BNC is, in fact, a model for the creation of all modern national buildings. This corpus includes more than 100 million words and is equipped with metatextual imorphological markup. In the ratio between written and oral speech, represented by texts from the media, school student papers, scientific articles and other texts of different genres, you can see a large gap – 90% to 10%. BNC reflects the actual state of British English at the beginning of this century. Search for lexical and grammatical constructions (phrases, word forms, etc.) is carried out using the XAIRA corpus manager, which also allows you to find information about the sources of sample texts and data on the frequency of use of certain collocations. In online mode, only limited access to this corpus is possible (50 random examples in the output of results), since its full version, provided on DVD – is paid.

The National Corps of American English (NAC, National American Corpus) was created as an analogue of the BNC. To date, it includes 22 million words. As is the case with most cases, the full version of NAC is paid, however, 68% of the words are freely available online. Unlike BNC, this case does not have a search interface, so the search for information is carried out through the use of universal case managers, which are not focused on working with one specific case. The corpus is provided with metatextual, partial and partial syntactic markup. Also in the NAC there is a so-called markup of named Entities, which includes proper names, names of organizations and geographical objects [5].

If we talk about the structure of the language corpus, it can be noted that it is directly related to its functionality, as well as the scope and purpose of application. Therefore, in order to study and analyze a certain language subcategory (syntax, stylistic features, etc.), corpus creators should collect the maximum possible collection of texts related to this subcategory. Thus, we can conditionally call the corpus a "representative reduced model of a language or sublanguage" [4, p. 18].

The representativeness of the corpus is understood as its ability to reflect all the properties of the problem area and is expressed in a certain statistical assessment of their number. It is this characteristic of the corpus that helps to determine the reliability of the facts obtained from it. Based on this, by the criterion of representativeness, and therefore by the type of structural content, the corpora are divided into:

Corpora of type 1, which are all-encompassing and represent the entire variety of speech activity. At the moment, there are no type 1 corpora represented in pure form, since the language is a multi-faceted phenomenon that it is impossible to calculate absolutely all its properties and categories using mathematical methods. Even national common language corpora cannot include all the many uses of the language, but at the design stage this category of corpora looks as representative as possible in comparison with other types of corpora. As an example, we can still cite the "Brown Corps", which by the standards of its time was quite representative. In its structure, this corpus had up to 15 style registers, each of which was represented by the results (texts) of 80 or more samples. Among the various genres in this building, samples of fiction, texts of popular science, biographical, religious topics, government documents, media reports, etc.

Type 2 corpora, which are created for special purposes, most often refer to a certain type of discourse and reflect various linguistic and cultural phenomena in the process of communication. The criterion of representativeness in this case is the maximum possible objective representation of any phenomenon of interest to the members of this corpus. For example, a corpus of English-language proverbs that reflects the use of native speakers of a certain time and geographical region in speech will not be relevant when studying English political metaphors.

The attitude of linguists to the corpus approach is also important. independent science. In this regard, the views of Russian scientists and their foreign colleagues are similar. It is a well-known fact that at the time of the creation of the first corpora in linguistics abroad, the generative approach was actively developing. Its founder, N. Chomsky, said that the basic information about the structure of syntax is embedded in the human mind from birth, and therefore can be applied to the development of absolutely any language. In his opinion, the main component of language learning is human intuition, and, accordingly, incorrect speech constructions do not exist a priori. This theory has been widely discussed and criticized. By the end of XX for centuries, Western linguists have concluded that the creation of a relevant dictionary and grammar is possible only on the basis of a representative collection of texts with many examples of actual use of the language. Russian scientists, in particular, one of the leaders of the project to create a "National Corpus of the Russian Language" V. N. Plungyan, adhere to the same concept. He says that the corpus is necessary for researchers involved in the systematization of facts about the analyzed language, as well as for academic purposes, since in this way the process of mastering language competencies is faste.

Despite the fact that at the moment corpus linguistics is not a fully studied field of knowledge within the framework of Russian linguistics, the interest of the Russian scientific society in its research increases every year, as it creates promising prospects in the field of linguistics. First, it is a new view of discourse as a real, not a fictitious element of communication. The ideology of corpus linguistics is based on the fact that the work uses not artificially created texts, but examples of living use of the language.

Secondly, it is an emphasis on the quantitative analysis of language, namely, the study of the elements most often used in speech [6, p. 84].

Third, it is working with the language in the framework of synchronic and diachronic approaches.

Language corpora can be used in various scientific fields, such as:

Lexicography. On the basis of the corpus, a large number of not only paper, but also online dictionaries are created, for example, the reversocontext dictionary, which is of particular value for translators and teachers of foreign languages. In this dictionary, the meaning of a lexical unit is recognized in context by comparing two texts in different languages. In contrast to the usual dictionary, which contains already fixed language norms, the corpus can provide information about the current status of a word and its functioning in speech;

Linguodidactics. In the field of teaching foreign languages, corpora are reflected as the most relevant sources of language material, which, moreover, can be constantly updated as the corpora themselves are modernized. The need to use corpora in language classes also it is due to the fact that the information obtained from them (for example, the frequency of use of certain lexical and grammatical phenomena) helps in determining the content of learning;

Translation studies. When teaching translation, parallel corpora allow you to see certain patterns and linguistic laws in the original text and the translation text. In turn, the object of comparative corpora is the same communicative orientation in two multilingual texts;

Study of the lexical and grammatical structure of the language. Within the framework of this aspect, corpora make it possible to track the appearance of neologisms, the compatibility of certain grammatical constructions, the processes of denomination, etc.

In most cases, the popularity of the use of corpora is explained by the possibility of studying language patterns on the material of a large database of texts processed and presented in the form of an electronic platform. Moreover, the corpus, as a rule, is not an analogue of the standard electronic library catalog, as it allows you to search for individual fragments of texts according to special parameters and criteria selected by the researcher. The structure of the corpus makes it possible to consider the language from different sides, highlighting certain patterns and formulating new linguistic laws. It is also worth saying that corpus studies are characterized by the evaluation of speech samples in the context of real language application. In addition, an array of language data built on the principles of corpus linguistics can repeated linguistic studies and verification of their results [11, p. 64]. The relevance of the development of corpus linguistics as a new direction in science is not in doubt and is not disputed by modern scientists. It has emerged relatively recently and is the result of a synthesis of knowledge in various fields of linguistics. For example, within the framework of comparative historical linguistics, corpus linguistics uses technologies for the reconstruction of ancient languages for linguistic analysis. Text corpora can also be used as empirical and illustrative material for various lexical and grammatical phenomena. Sociolinguistics refers to the use of corpus criteria to create manuals for the study of varieties. In addition, corpus linguistics allows you to find common ground between the humanities and technical sciences.

Thus, the creation of corpora has become a revolution in the field of discourse analysis, allowing you to do countless operations with text in seconds, such as splitting text fragments according to the necessary criteria, marking and marking them. In this regard, the corpus allows you to objectively and quickly consider the language as it really is on actual and "live" examples.

REFERENCES

Vadyaev S. E. Electronic lexicography and corpus linguistics / Aspects of the formation and functioning of West German languages. Samara, 2003, pp. 83-92.

Volosnova Yu. A. Corpus linguistics: problems and prospects // Bulletin of the Moscow State University of Forest-Flattering Bulletin. - Moscow, 2006-No. 7-pp. 43-49.

Grudeva E. V. How corpus linguistics changed our ideas about language / / Materials of the XIII visiting school-seminar "Problems of speech generation and perception". - Cherepovets, 2015-p. 108-115.

Zakharov V. P., Bogdanova S. Yu. Corpus linguistics: a textbook for students of humanitarian universities. - Irkutsk: IGLU, 2011-161 p.

Izotov A. I. New directions of Slavic linguistics: corpus linguistics // Language, consciousness, communication. - Moscow, 2015-p. 82-93.

Karamnov A. S. Quantitative assessment of the repeatability and complexity of vocabulary in the corpus of the English textbook. Questions of theory and practice. – 2014– № 06 (36). – Pp. 82-86.

Просмотров работы: 6