Imagine that you have short comments. Is it possible to understand what emotion this text conveys: joy, surprise, anger?I say it's possible. For example, there is a whole field in computational linguistics that deals with the study of opinions and emotions in text documents - sentiment analysis. As part of our task we will classify messages as positive or negative. There are many ways to solve this problem, in this article I will elaborate on how I coped with the task.
Why is sentiment analysis important?
Nearly 80% of the world’s digital data is unstructured, and data obtained from social media sources is no exception. Since the information is not organized in any predefined way, it’s difficult to sort and analyze. Fortunately thanks to the developments in Machine Learning, it is now possible to create models that learn from examples and can be used to process and organize text data.
I will use the machine learning method with a teacher to analyze the emotional tonality of the text. Now it's time to learn how to do a mood analysis of the text.
We can define four main steps in this process:
1. Data collection
2. Data preparing
3.Select and create a sentiment analysis model.
4.Testing the resulting model.
Let's consider all the steps in order.
1. Data collection
To analyze the mood of messages I need a data set with which I will train a machine learning model (classifier). It was possible to collect all this set, but I used an already ready corpus of the text which was formed on the basis of Russian-language messages on such platform of micro blogging as Twitter. This corpus consists of positive and negative entries.
2. Data preparing
Once you’ve captured the tweets you need for your sentiment analysis, it’s time to prepare the data. As we mentioned earlier social media data is unstructured. That means it’s raw, noisy and needs to be cleaned before we can start working on our sentiment analysis model. This is an important step because the quality of the data will lead to more reliable results.
Preprocessing a dataset involves a number of tasks. The most important thing is to remove unnecessary information, such as emoticons, special characters, extra spaces, words with a neutral emotional color, such as the word " for example”, proper names, some prepositions. The same pre-treatment may include improving the format of the text, removing duplicate messages.
To implement this step I have created a program that processed the text, removed all links, hashtags, usernames and other stop words.
Next I have split the data set into two samples in a 4:1 ratio: the first sample was required for training the classifier, and the second for testing it.
3.Select and create a sentiment analysis model.
This point in the analysis of the mood of messages is the most important and voluminous. First, you need to present each document from the corpus as a vector. If you explain in simple language, the computer can not perceive our language, and in order to fix it, we must present each document in the form with which the computer can work.
The most common way to represent a document in computational linguistics and search problems is either as a set of words (bag-of-words) or as a set of N-grams
For example, the sentence "I study at the University" we can submit like set of unigrams ( I, study, at, the, University) or set of digrams( I study, study at, at the, the University)
The next step in compiling a feature vector is to assign each feature its weight. I used TF-IDF method. TF-term frequency (the number of times that term t occurs in document d.), IDF-inverse document frequency (a measure of how much information the word provides, i.e., if it's common or rare across all documents).
After we have received the weight of each word we proceed to the training of the classifier. I decided to take two classifiers and see which one is best suited to solve my problem. The first classifier is Random Forest, a machine learning algorithm that uses a Committee (ensemble) of decision trees. The second algorithm is the naive Bayes classifier, a simple probabilistic classifier based on the application of Bayes ' theorem with strict (naive) independence assumptions. The mathematical basis of these algorithms in this article will not be considered.
To teach the classifier, I chose the scikit-learn library for Python. Let me remind you that when processing the initial data, I divided the entire set into two parts: a block for training and a block for testing. If we consider the first block, then at the time of learning the model, it consists of words, the weight of these words and notes to them, from which class this word is. With this data we train classifiers.
4.Testing the resulting model.
The trained model should be tested, for this purpose we will use the second set. It differs from the first one because there are no marks about the class of these messages, they just have to be guessed by the classifier.
As a result of testing we received the following results:
Random Forest proved to be much better than Naive Bayes classifier, the accuracy of the first – 85.5%, and the second – 70%. Both are far from ideal, but for these classifiers these are decent results.
Conclusion
Textual classification is one of the most commonly used tasks in the field of natural language analysis. In this article we have seen what methods can be implemented in emotional analysis of messages.
Analysis of emotions helps to understand people on the Internet better, it can be done manually, but when there are huge amounts of information it becomes unrealistic. Large companies use such methods to analyze the effectiveness of brands in social networks, and provide a wide range of opportunities for improvement.
Machine learning models can do pretty amazing things, even if we are talking about human language.