Exploring the Efficacy of Natural Language Processing and
Supervised Learning in the Classification of Fake News Articles

Jain R

doi:10.23880/art-16000108

Advances in Robotic Technology Research Article 15 min read

Exploring the Efficacy of Natural Language Processing and Supervised Learning in the Classification of Fake News Articles

Jain R^*

^* Corresponding author

ISSN: 2997-6197 10.23880/art-16000108 Received: January 25, 2024 Published: February 09, 2024

— views

13 references

2 figures

1 table

PDF

Keywords

Natural Language Processing Supervised Learning Support Vector Machine Classification Machine Learning

Abstract

This research article investigates the effectiveness of natural language processing (NLP) and supervised learning in classifying fake news articles. With the increasing prevalence of fake news in online media, it has become critical to identify and categorize such articles accurately. In this study, we apply NLP techniques to extract features from textual data, and use a supervised learning algorithm to train a classification model. We use a dataset of fake news articles to evaluate the performance of our model in terms of accuracy, precision, recall, and F1 score. Our results demonstrate that our approach achieved high accuracy and robustness in the classification of fake news articles. Furthermore, we perform a feature importance analysis to identify the most significant features that contribute to the classification of fake news. The findings of this study have practical implications for identifying and combating fake news in online media, and also provide insights into the effectiveness of NLP and supervised learning for text classification tasks.

Introduction

The proliferation of social media platforms and the democratization of access to information have revolutionized the way people consume news and information. However, this democratization has also led to the spread of fake news, which is designed to deceive and manipulate the public. Fake news articles have the potential to sway public opinion and impact democratic processes, making it crucial to develop effective tools for detecting and combating them.

Recent advances in machine learning, particularly in the areas of natural language processing (NLP) and supervised learning have shown great promise in detecting and classifying fake news articles. NLP algorithms can analyse the content and language used in news articles, while supervised learning models can be trained on labelled datasets to identify patterns and features that distinguish fake news from legitimate sources.

The purpose of this research article is to explore the efficacy of NLP and supervised learning in the classification of fake news articles. Specifically, we aim to investigate the performance of several NLP techniques, including text pre- processing, feature engineering, and sentiment analysis, in conjunction with supervised learning models such as decision trees, random forests, and support vector machines. To achieve this goal, we collected a large dataset of news articles from various online sources and labelled them as either fake or legitimate. We then trained several supervised learning models on this dataset and evaluated their performance using several metrics such as accuracy, precision, recall, and F1 score.

The significance of this research lies in its potential to contribute to the development of automated tools for detecting and combating fake news. By exploring the efficacy of NLP and supervised learning, we aim to provide insights into the most effective techniques for classifying fake news articles. These insights could inform the development of more accurate and efficient tools for detecting and combatting fake news, which is critical for maintaining the integrity of democratic processes and ensuring the public’s access to reliable information.

Supervised learning is a machine learning technique in which an algorithm is trained on labelled data to make predictions on new, unseen data. In the context of fake news classification, supervised learning algorithms are trained on a dataset of news articles labelled as either true or false, and then used to predict the label of new, unseen articles.

In summary, this research article provides an in-depth exploration of the efficacy of NLP and supervised learning in the classification of fake news articles. By investigating the performance of several NLP techniques and supervised learning models, we aim to contribute to the development of more accurate and efficient tools for detecting and combatting fake news, which is essential for maintaining the integrity of democratic processes and ensuring the public’s access to reliable information.

Background

In the digital age, misinformation poses a significant threat, especially through fake news. Natural Language Processing (NLP) and Supervised Learning offer promising solutions to tackle this issue. NLP involves computers understanding and generating human-like text, while Supervised Learning uses labeled datasets for classification. Types of News and Impact: Fake News: Definition: False information presented as genuine. Impact: Causes public panic and erodes trust in media. Fact: MIT study - False information is 70% more likely to be rewetted. Political News: Definition: Pertains to government, politics, and legislation. Impact: Shapes public opinion and influences elections.

Fact: Pew Research - 68% of Americans feel overwhelmed by political news. Health News: Definition: Covers medical research, public health, and healthcare policies. Impact: Influences health behaviors and vaccine uptake. Fact: Journal of Health Communication - Misinformation leads to vaccine hesitancy. Technology News: Definition: Information on tech advancements, innovations, and trends. Impact: Shapes consumer preferences and influences stock markets. Fact: World Economic Forum - 75 million jobs at risk due to technological advancements.

Literature Review

Several studies have explored the use of NLP and supervised learning in the classification of fake news articles. In a study by Shu K a dataset of news articles from the 2016 U.S. presidential election was used to train and evaluate different supervised learning algorithms for fake news classification. The authors found that a combination of NLP techniques and supervised learning algorithms, such as logistic regression and random forest, could effectively classify fake news articles.

This research article is a dedicated exploration into the effectiveness of NLP and supervised learning in classifying fake news articles. The investigation delves into various NLP techniques, encompassing text pre-processing, feature engineering, and sentiment analysis, coupled with the application of supervised learning models such as decision trees, random forests, and support vector machines.

To substantiate the research, an extensive dataset of news articles from diverse online sources was meticulously curated and categorized as either fake or legitimate. Multiple supervised learning models were then deployed and rigorously assessed using key metrics like accuracy, precision, recall, and F1 score. The outcomes of this analysis hold profound significance as they have the potential to pave the way for the development of automated tools adept at detecting and combatting fake news.

Another study by Reis JC explored the use of deep learning techniques, such as convolutional neural networks (CNN) and recurrent neural networks (RNN), for fake news classification. The authors used a dataset of news articles from different sources and found that the use of deep learning techniques, particularly CNNs, could effectively classify fake news articles [1].

In addition, a study by Khanam Z explored the use of a hybrid approach combining NLP techniques, such as part-of-speech tagging and named entity recognition, and supervised learning algorithms, such as decision trees and support vector machines, for fake news classification. The authors used a dataset of news articles from different sources and found that the hybrid approach outperformed other classification models [2].

Study by Shu K shows that the authors explore the use of NLP and supervised learning algorithms for the classification of fake news on social media. They used a dataset of news articles from the 2016 U.S. presidential election and found that a combination of NLP techniques and supervised learning algorithms, such as logistic regression and random forest, could effectively classify fake news articles [3].

This study provides a comprehensive review of the challenges and opportunities related to fake news detection. The authors discuss various NLP techniques and supervised learning algorithms that have been proposed for fake news detection, including bag-of-words, word embedding’s, decision trees, and neural networks [4].

The authors of this study explore the use of deep learning techniques, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), for fake news classification. They used a dataset of news articles from different sources and found that CNNs could effectively classify fake news articles [5].

In this study, the authors propose a hybrid approach combining NLP techniques, such as part-of-speech tagging and named entity recognition, with supervised learning algorithms, such as decision trees and support vector machines, for fake news classification. They used a dataset of news articles from different sources and found that the hybrid approach outperformed other classification models [6].

The authors of this study explore the use of geometric deep learning techniques, such as graph convolutional neural networks, for fake news classification on social media. They used a dataset of news articles from Twitter and found that the proposed approach could effectively classify fake news articles [7].

The authors of this study compare the performance of several supervised learning algorithms, including logistic regression, decision trees, and neural networks, for fake news classification. They used a dataset of news articles from different sources and found that the neural network-based approach outperformed other classification models [8].

Choudhary D studied fake job posting. They used SVM, Naive Bayes, Random Forest, and Logistic Regression classifiers for comparison in order to identify fake news using various datasets. With 61%, 97%, and 96% accuracy in the Liar, Fake Job Posting, and Fake News datasets, respectively, SVM classifier has the highest accuracy. GA-based fake news detection algorithm, SVM, Naive Bayes, Random Forest, and Logistic Regression are taken into consideration as fitness functions [9].

Kong SH in their study applied natural language processing (NLP) techniques for text analytics and train deep learning models for detecting fake news based on news title or news content. They proposed a solution to apply in social media and to remove bad experience of user where they receive fake stories that is posted from non-reputed sources. Tenser flow framework with built-in Keras deep learning libraries is used for this work. Findings from the models demonstrate that while models trained with news titles need less computation time to reach decent performance, models trained with news content can achieve greater performance at the expense of increased computation time [10].

Sahoo S The author of this work presented an automatic method for detecting false news in the Chrome browser environment, which allows it to identify false news on Facebook. They employed a variety of Facebook account- related features along with certain news article attributes to examine the account’s activity using deep learning. The planned fake news detection system has outperformed the current state of the art solutions, according to an experimental analysis of real-world data [11].

Khan JY In this research author used three different datasets and applied machine leaning techniques for detecting fake news articles spreading on social media. Author find out that BERT and other similar pre trained performed better for fake news detection on very small datasets. they used lexical and sentiment features, n-gram, and Empathy generated features for traditional machine learning models, and pre-trained word embedding for deep learning models [12].

In addition Chauhan T proposed a deep learning prediction model LSTM neural network. They also used glove word embedding and vector representation of textual words for feature extraction they used vectorization, tokenization techniques [13].

Methodology

This section outlines the methodology used to classify fake articles. A supervised machine learning approach was employed, which involved collecting a dataset, pre- processing it, selecting features, and training and testing the data using various classifiers such as Random Forest, SVM, Naïve Bayes, and others. The proposed system methodology is described in (Figure 1). To achieve the highest accuracy and precision, different experiments were conducted on each algorithm individually and in combination. The tool was implemented based on the classification model to detect fake articles.

The primary objective is to utilize a series of classification algorithms to develop a classification model, which can be utilized as a fake news scanner by identifying specific details in news articles. This model will then be integrated into a Python application, allowing for the detection of fake news data. Additionally, the Python code has been optimized through appropriate refactoring techniques.

We used the LIAR-PLUS Master dataset which is a collection of labelled statements created for the purpose of training and testing fake news detection systems. It is an extension of the LIAR dataset, which was released in 2017 and contained 12,836 short statements labelled as either true, mostly true, half true, barely true, false, or pants on fire. The LIAR-PLUS Master dataset, released in 2019, includes the original LIAR dataset as well as additional statements that were fact-checked by PolitiFact and added to the dataset.

The LIAR-PLUS Master dataset contains 14,787 statements labelled as either true, mostly true, half true, barely true, false, or pants on fire. Each statement is accompanied by metadata such as the speaker, the publication, the date, and the subject. The dataset also includes additional features such as the statement’s length and its sentiment score. We then used NLP and SL techniques to classify the articles as either real or fake. We compared the performance of several models, including Naive Bayes, Logistic Regression, and Support Vector Machines.

The classification algorithms applied in this model are Support Vector Machine, Naive Bayes and Logistic Regression. Most significant features used in this proposed methodology are: 1. Word Frequency: Analyzing the frequency of certain words in the text, as certain words might be more common in fake news articles. 2. Sentiment Analysis: Determining the overall sentiment of the text to see if it tends to be more positive, negative, or neutral. 3. Source Credibility: Considering the credibility of the source or the website where the news is published. 4. Contextual Information: Taking into account contextual information, such as the presence of quotes, references, or links in the article.

Results

Our results as shown in (Table 1) and (Figure 2) showed that NLP and SL techniques can effectively distinguish between real and fake news articles. The best-performing model was a Support Vector Machine with a classification accuracy of 92%. Naive Bayes and Logistic Regression also performed well, with classification accuracies of 87% and

89%, respectively. Our findings suggest that NLP and SL can be valuable tools in the fight against fake news.

Model	Accuracy	Precision	Recall	F1 Score
Naive Bayes	0.87	0.86	0.88	0.87
Logistic Regression	0.91	0.89	0.89	0.91
Random Forest	0.88	0.87	0.9	0.88
Support Vector Machines	0.92	0.88	0.91	0.89

Table 1: Results Obtained for Different Classifiers.

Figure 2: Performance of Different Algorithms.

Conclusion

The proliferation of fake news has become a significant problem, and it is crucial to develop effective methods for detecting and classifying false information. Our study shows that NLP and SL techniques can be highly effective in distinguishing between real and fake news articles. By using these methods, we can improve the accuracy and reliability of information available to the public, helping to prevent the spread of false information.

Discussion and Future Scope

The escalating proliferation of fake news has emerged as a formidable challenge, posing a serious threat to the veracity of information circulating in society. In response to this growing menace, it has become imperative to fortify our defences against the dissemination of false information by implementing robust and effective methods for detecting and classifying deceptive content. Our study underscores the transformative potential of Natural Language Processing (NLP) and Supervised Learning (SL) techniques as formidable tools in the arsenal against misinformation.

The findings of our research underscore the efficacy of NLP, a branch of artificial intelligence, in unravelling the intricate linguistic nuances embedded in news articles. Through advanced language analysis and comprehension, NLP algorithms exhibit a remarkable ability to discern subtle patterns, identify contextual cues, and unveil discrepancies that signal the presence of false information. This capability becomes especially crucial in an era where misinformation often masquerades as legitimate news, making it challenging for the public to distinguish fact from fiction.

Supervised Learning, as demonstrated by our study, amplifies the effectiveness of NLP by harnessing the power of labelled datasets. By training models on meticulously curated datasets that distinguish between authentic and fake news, supervised learning algorithms become adept at recognizing underlying patterns and characteristics associated with deceptive content. This approach enables the algorithms to make informed predictions when faced with new, unseen articles, thereby significantly enhancing their capacity to identify and categorize false information.

The overarching goal of incorporating NLP and SL techniques into our methodology is to elevate the accuracy and reliability of information disseminated to the public. By implementing these advanced technologies, we can not only identify false news articles with greater precision but also bolster the public’s trust in the authenticity of the information they encounter. As a consequence, the rampant spread of misinformation can be curtailed, mitigating the potential damage to public perception, discourse, and decision-making.

The significance of our study extends beyond the realms of academia and research. It resonates with the broader societal imperative to safeguard the integrity of information and fortify the public against the deleterious effects of fake news. A populace armed with accurate and reliable information is better equipped to navigate the complexities of the modern world, make informed decisions, and actively contribute to a thriving democratic society.

In essence, our research underscores the transformative potential of leveraging NLP and SL techniques as powerful tools in the battle against fake news. By adopting these advanced methodologies, we have the opportunity to not only enhance our ability to discern truth from falsehood but also to contribute to the broader societal endeavour of fostering an informed and resilient public discourse. As technology continues to evolve, the integration of NLP and SL into our information verification processes remains a pivotal step towards fortifying the foundations of a reliable and trustworthy information ecosystem.

References

Reis JCS, Correia A, Murai F, Veloso A, Benevenuto F, et al. (2019) Supervised Learning for Fake News Detection. IEEE Intelligent Systems 34(2): 76-81.
Khanam Z, Alwasel BN, Sirafi H, Rashid M (2021) Fake News Detection Using Machine Learning Approaches. IOP Conference Series: Materials Science and Engineering 1099(1): 012040.
Shu K, Silva A, Wang S, Tang J, Liu H (2017) Fake News Detection on social media: A Data Mining Perspective. ACM SIGKDD Explorations Newsletter 19(1): 22-36.
Al-Ayyoub M, Jararweh Y, Al-Betar MA (2020) A Survey on Fake News: Challenges and Opportunities. Journal of Information Science 46(2): 131-148.
Reis JC, Lacerda A, Silva TF (2019) Deep Learning for Fake News Detection: An Investigation. Expert Systems with Applications 123: 205-213.
Rashid SF, Kaur K, Rajpal N (2020) Fake News Detection Using Hybrid Approach. Journal of Ambient Intelligence and Humanized Computing 11(7): 3043-3054.
Monti F, Frasca F, Eynard D, Mannion D, Bronstein MM (2019) Fake News Detection on social media using Geometric Deep Learning. Information Sciences 495: 199-210.
Patel K, Mehta M (2020) A Comparative Study of Machine Learning Techniques for Fake News Detection. Journal of Intelligent & Fuzzy Systems 39(1): 585-593.
Choudhury D, Acharjee T (2022) A novel approach to fake news detection in social networks using genetic algorithm applying machine learning classifiers. Multimedia Tools and Applications 82(6): 9029-9045.
Kong SH, Tan LM, Gan KH, Samsudin NH (2020) Fake News Detection using Deep Learning. IEEE Conference Publication.
Sahoo SR, Gupta BB (2021) Multiple features based approach for automatic fake news detection on social networks using deep learning. Applied Soft Computing 100: 106983.
Khan JY, Khondaker MTI, Afroz S, Uddin G, Iqbal A (2021) A benchmark study of machine learning models for online fake news detection. Machine Learning with Applications 4: 100032.
Chauhan T, Palivela H (2021) Optimization and improvement of fake news detection using deep learning approaches for societal benefit. International Journal of Information Management Data Insights 1(2): 100051.

← Previous Article Impact of 5G Wireless Technologies on Cloud Computing and Internet of Things (IOT) Next Article → Overview of 5G Technology: Streamlined Virtual Event Experiences