News Articles Classification

Web scraping and machine learning text classification

I collected with web scraping 22000 category labeled articles from multiple Romanian news sites and did data exploration and text classification.

The dataset contains news articles with columns: URL, source, date, category, title, content.

There are articles published from November 2016 up to October 2021 in the dataset, however most of them are from 2020 and 2021.

The articles are evenly split among 5 categories to avoid imbalanced classifications (4400 per class): politics, social, economy, science, sports.

The articles are collected from 5 news sites: Digi24, Mediafax, Libertatea, News.ro, RFI.

I did web scraping using the requests and BeautifulSoup libraries in Python. I used a timeout between requests to not overload the servers.

Most frequent words can be visualised with a word cloud.

We can also break it down per categories with bar charts.

In order to train a machine learning model, it is necessary to preprocess the text first. I removed punctuation, special characters, digits, lowercased and lemmatized the words. Then I vectorized the text with TF-IDF.

If we use a dimensionality reduction algorithm, such as t-SNE, we can visualize the news articles. The ones that are close to each other are supposed to be more related. I recommend opening the scatter plot below in full screen view and hovering the cursor over articles for more information.

I tested multiple machine learning classifiers. I did a grid search with 10-fold cross validation for hyperparameter tuning. SGD was the best classifier found at 86.91% cross validation accuracy. An ensemble voting of all the classifiers combined got 86.90% accuracy, tied with linear SVC. See table below.

Classifier Accuracy
SGD 86.91%
Voting 86.90%
Linear SVC 86.90%
Logistic Regression 86.61%
Multinomial NB 85.83%
Extra Trees 84.99%
K Neighbors 84.18%

Source code is here.