Introduction
Modern Artificial Intelligence (AI) Technologies, and specifically Neural Network-based Language Models (Large Language Models, LLMs), are becoming increasingly integral to the processing of textual information. Their deployment across various industries facilitates the automation of complex data analysis tasks, which enhances operational efficiency, accelerates workflows, and reduces costs associated with information processing. Currently, language models are applied across a diverse array of domains, ranging from automated translation and the processing of legal documents to medical analytics and journalism. Despite the rapid development of this technology, several challenges remain concerning the accuracy, reliability, and objectivity of the generated outputs [1].
One of the primary issues when using neural network models is their vulnerability to errors, which stem both from limitations in training datasets and from the inherent architecture of the models themselves. Unlike traditional information retrieval algorithms, which rely on strictly structured databases, language models operate by making probabilistic predictions. This means that their responses are not always based on verifiable facts, and instead, they may represent generalized or even inaccurate interpretations. Moreover, these models are known to occasionally produce what are called "hallucinations"—outputs that appear plausible but are factually unfounded. This phenomenon is particularly problematic in fields like law and medicine, where errors can lead to significant real-world consequences [2 – 3]. Equally critical is the challenge of contextual interpretation. While modern models are trained on vast corpora of text, they still struggle with understanding subtle nuances, ambiguous phrasing, and specialized terminology. For example, in medical texts, the same word may have different meanings depending on the context (e.g., “crisis” in endocrinology versus cardiology). In legal documents, it is crucial to account for the interrelationships between various legal norms, which requires deep semantic processing. Similarly, journalistic content often features intricate narrative structures, where models may lose the coherence of the story or omit key details.
Another significant issue is the bias embedded in training datasets. Most neural network models are trained on publicly available data, which makes them susceptible to the social, political, and cultural biases present in these datasets. This bias can manifest in several ways, such as misrepresenting information, reinforcing particular viewpoints, or even perpetuating stereotypes. In the medical domain, for instance, this bias may result in models being less effective in analyzing rare diseases because they were predominantly trained on data related to more common conditions. In the legal domain, biases in training data can cause models to favor particular outcomes, either supporting the prosecution or the defense, depending on the nature of the training data [4].
Given these challenges, the study of the reliability and accuracy of language models remains an essential and timely issue, especially with their increasing use in critical domains. The present paper seeks to analyze the performance accuracy of the neural network models GigaChat, DeepSeek, and ChatGPT across various thematic categories. Specifically, we will investigate the most common types of errors encountered in linguistic analysis, historical interpretation, journalistic content, medical recommendations, and legal conclusions. In addition, a comparative analysis will be conducted to assess the models’ resilience to hallucinations, information distortions, and incomplete queries [5].
The purpose of the study: to analyze the performance and accuracy of neural network models in various fields by means of evaluating three models—GigaChat, DeepSeek, and ChatGPT—across six thematic categories: linguistics and logic, history, journalism, medicine, and law.
Materials and methods of research
The study utilized a comprehensive methodology that integrated both quantitative and qualitative analyses to evaluate the performance of neural network models. The primary objective was to assess the accuracy of information interpretation, the models’ resilience to external factors such as noise and imprecise formulations, and their adaptability to complex or non-standard texts. To evaluate the accuracy of the models, a scoring system was employed, where each prediction was rated on a scale from 0 to 1. Experts assigned scores based on several key criteria: the factual accuracy of the response, ensuring it corresponded with verified data; the logical consistency of the model’s reasoning, ensuring there were no internal contradictions; the completeness of the answer, confirming that all aspects of the question were addressed; and the linguistic accuracy, which involved checking for grammatical and stylistic errors. The results were then averaged and presented as comparative graphs.
The errors made by the models were classified into several types. Factual errors occurred when the model generated false information or provided responses that were inconsistent with real data. In some cases, the models distorted the context, losing logical coherence between sentences. The models were also found to be sensitive to noise, meaning their accuracy decreased when dealing with typos, low-quality scans, or complex formulations. Additionally, there were instances of misunderstanding terminology, which is especially problematic in fields such as medicine and law, where precise terms are crucial for accurate interpretation. To assess the models’ resilience to variable conditions, a "noise test" method was applied, where the same questions were presented with altered formulations to evaluate the models' ability to maintain accuracy under different conditions. This allowed for a more thorough examination of how the models respond to different types of input and how well they adapt to changes in the data presented.
Results of the research and discussions
Topic/Neural network |
GigaChat |
DeepSeek |
ChatGPT |
Linguistics & Logic |
0,8 |
0,7 |
1 |
History |
0,9 |
0,8 |
1 |
Journalism & News |
0,6 |
0,8 |
0,9 |
Medicine |
0,7 |
0,9 |
1 |
Law |
0,9 |
0,9 |
1 |
Overall Score |
3,9 |
4,1 |
4,9 |
Analysis of the heatmap indicates that ChatGPTexhibits the highest performance across all considered topics, as reflected in bright red shades indicating high accuracy in each category. Its success is particularly evident in the fields of medicine and law, where data processing accuracy reaches 1.00, highlighting the model’s exceptional ability to accurately interpret and apply knowledge in these critically important domains. DeepSeek consistently occupies a middle position, showing good results, especially in more complex topics such as medicine and law, where scores range from 0.80 to 0.90. This suggests that DeepSeek demonstrates strong, albeit not maximal, accuracy, making it suitable for various tasks but less effective than ChatGPT overall. GigaChat, on the other hand, shows the lowest accuracy scores, particularly in fields such as journalism and logic, where its scores do not exceed 0.60–0.70. These results highlight its insufficient contextual robustness and difficulties with complex formulations, limiting its applicability in text analysis requiring a high degree of interpretation and contextual awareness. This also indicates specific constraints in processing non-standard formulations and intricate texts, makingGigaChat less of a universal model compared to the other neural networksexamined.
For a deeper analysis of test results, it is essential to consider the overall accuracy of the models. The diagram displays the average scores of the models obtained from the testing process. ChatGPT outperforms competitors, particularly in complex topics requiring precise contextual understanding, whereas GigaChat exhibits the least stability (Figure 2).
The most significant performance gaps are found in journalism and news, where models demonstrate varying levels of accuracy and adaptability to contextual shifts. GigaChat and DeepSeek have the lowest scores in this domain, around 0.60 and 0.70, which may indicate these models' insufficient capability to accurately interpret highly complex and variable texts. This is particularly important for journalistic content, where texts may be ambiguous, contain slang, metaphors, and other elements requiring a high level of contextual understanding. In contrast, ChatGPT significantly outperforms its competitors in this area, achieving much higher results with scores approaching 1.00. suggests that ChatGPT has superior capabilities for analyzing journalistic texts, including handling multi-meaning and contextually rich expressions. Model likely employs advanced natural language processing techniques that allow it to consider various contexts and subtle nuances in information, making it more effective for analyzing dynamic and complex texts. In other fields, such as medicine and law, the performance gap between models is smaller, with scores ranging between 0.80 and 1.00. These fields generally require stricter adherence to formalities and precision in data interpretation, which explains the more uniform performance across all models.
Overall, the graph demonstrates that models are not universal, and their efficiency strongly depends on the topic. This underscores the necessity of selecting the appropriate neural network based on the specific task at hand, as well as highlights the importance of improving algorithms for more effective processing of different data types.
For a more comprehensive evaluation of neural networks,it is essential to assess their resilience to various data distortions. The graph illustrates the level of resistance of the models totypos, incorrect formulations, and noisy data.ChatGPT demonstrates the best results, whileDeepSeek and GigaChat are more susceptible to disruptions (Figure 4).
Figure 4. Diagram displaying the average error rate of neural networks across different domains.
The error analysis diagram shows that the highest number of errors is observed in journalism and history, which is associated with the need for comprehensive source analysis and the abilityto account forsubtle contextual nuances.
Conclusion
The analysis revealed that the primary issue across all models is the loss of context, particularly in long and multi-meaning texts. This leads to incorrect conclusions, as the algorithms fail to properly connect information from different parts of the document. This limitation is especially critical in domains like law and medicine, where a single term or phrase can dramatically alter its meaning. Another significant challenge is the high sensitivity to noise — typos, low-quality scans, and complex sentence structures often result in misinterpretations. Additionally, the generation of plausible yet factually incorrect data emerged as a major issue, which is especially problematic in fields where accuracy is paramount.
Among the models analyzed, ChatGPT demonstrates the best accuracy across most of the studied topics, particularly in medical and legal texts, where precise formulations and terminology are essential. DeepSeek maintains a stable performance but does not reach maximum accuracy, while GigaChat proves to be less robust in handling context and non-standard formulations, resulting in a higher error rate.
Future research directions include the development of hybrid neural network systems, combining the strengths of different models, as well as enhancing explainability algorithms. Improving model adaptability, reducing sensitivity to noise, and advancing contextual processing remain central challenges for further progress in the field of automated document analysis.
Belozus R.O., Chernyatina Y.A. Application of Neural Networks in Text Data Processing Tasks // Scientific and Technical Bulletin of Information Technologies, Mechanics, and Optics. 2008. No. 46. P. 28-33.
Alpatov A.N. Reasoning Models in Artificial Intelligence // Scientific Revolutions as a Key Factor in the Development of Science and Technology: Proceedings of the National (All-Russian) Scientific and Practical Conference with International Participation. Ufa, 2024. P. 22-29.
Nazarov N.A., Tolcheev V.O. Analysis of WOS Samples Using Neural Network Classifiers // The 21st National Conference on Artificial Intelligence with International Participation (KII-2023): Conference Proceedings. In 2 Volumes. Smolensk, 2023. P. 201-210.
Surkov E.V. Comparative Analysis of Feedforward Neural Networks and Recurrent Neural Networks // Energy-efficient Engineering Systems, LNG Technologies: Abstracts of the XIII Congress of Young Scientists of ITMO University. St. Petersburg, 2024. P. 86.
Udilov N.S. Neural Networks: A Historical Overview of Their Development, Formation, and Characteristics Compared to Biological Networks // Trends in the Development of Science and Education. 2024. No. 106-11. P. 119-125.