News Multi-Class Classification using State of the Art HuggingFace BERT Model

10 min readOct 6, 2021

1 Basic Preporcessing and Data Downloading
1.1 Data Downloading
1.2 Business Problem & Challenges
1.3 Importing Libraries
1.4 Data Loading Into Pandas Dataframe
2 EDA-Exploratory Data Analysis
2.1 Data Points in the dataset
2.2 Unique Number of Categories in the Dependent Variable
2.3 Percentage Distribution of Output Class
2.4 Null Value Check
2.5 Data Visualization
3 Data Preproecessing & Vectorization of Text Data
3.1 Text Preprocessing: StopWord Removal,Special Character Removal
3.2 Train Test Split 80 20
3.3 Dependent Variable y Label Encoding
3.4 BOW representation of Text Data
3.5 TFIDF Representation of Text Data
4 Machine Learning Model Building
4.1 Naive Bayes Model
4.2 Logistic Regression Model
5 Deep Learning Model Building, Transfer Learning using HuggingFace Transformers
5.1 Train Test Split 90 10
5.2 Dependent Variable One Hot Encoding
5.3 Tokenization of Text data using “bert-base-uncased”
5.4 Converting the tokenized ids and Mask into tensorflow Tensors
5.5 Preparaing Train and Test Dataset for Tensorflow Format
5.6 Making Mini Batches using tensforflow Functions
5.7 Model Building Using Pretrained bert-base-uncased
6 Comparison of all the Models

1. Basic Preporcessing and Data Downloading

1.1. Data Downloading

Script to downlaod data from Kaggle

wget is shell script command to download the data from web without any login into the system
The dataset is downloaded from here
This command directly downloads the data into the Google colab server with out downloading into the local system

1.2 Business Problem & Challenges

This perticular problem utilise a BBC public dataset with 2225 items divided into five categories: business, entertainment, political, sport, and technology.
The dataset is divided into 1490 training records and 735 testing records. The objective is to create a system that can properly categorise previously unknown news stories into the correct category.

1.3 Importing Libraries

Importing ML,Visualization etc. libraries

1.4 Data Loading Into Pandas Dataframe

Pandas is mostly concerned with tables, it can read csv, tsv, json files formats
Using pd.read_csv we can load the data into table like pandas dataframe
There are 2 columns in the independent variables ArticleId, Text
Target column is Category or we can say it is a dependent variable

2 EDA-Exploratory Data Analysis

2.1 Data Points in the dataset

Data.shape[0] gives the first dimension of dataframe i.e it gives number of rows in the dataframe
Data.shape[1] gives the second dimension of dataframe i.e it gives number of coulmuns in the dataframe

==================================================
Number of rows or Datapoints in the Datasets: 1490
--------------------------------------------------
number of clumns in the datasets:  3
==================================================

2.2 Unique Number of Categories in the Dependent Variable

============================================================
number of output classes in the dataset: 5
============================================================

As there are 5 unique categories in the dependent variable this particular problem is Multi-Class Classification problem
This is a classification problem assignment with more than two categories, as the term indicates. The essential concept in multi-class classifications is that the matching class-label for every data point in the dataset will be the one and only between all possible class labels.
Such occurrences are known as “mutually exclusive events” in probability and statistics, which indicates that the chance of all events occurring at the same moment is ZERO.

Binary Classification vs Multi-Class Classification (source)

2.3 Percentage Distribution of Output Class

data[‘Category’].value count method gives number of unique classes in the column category, normalize=True gives percentage of each class in that column.

============================================================
%  of output classes in the dataset:
------------------------------------------------------------
sport            23.221477
business         22.550336
politics         18.389262
entertainment    18.322148
tech             17.516779
Name: Category, dtype: float64
============================================================

As all the 5 categories are having same almost same number of datapoints in the dataset.
We can say that this particular News Dataset is balanced dataset

2.3.1 Evaluation Metric

As seen from above analysis dataset is balanced so we are using the accuracy as evaluation metric for this particular business problem

2.4 Null Value Check

Panadas.isnull function detects the missing values in the particular column
After isnull function,.sum() function is applied this gives total number of null values in the dataset (dataframe)

==================================================
Null values in the Dataset : 0
==================================================

2.5 Data Visualization

2.5.1 Bar Plot of Output Class

This bar plot is plotted using plotly library, here we are plotting the number of datapoints per category
On x axis: different categories, y axis: number of datapoints

From above bar graph we can observe business category has 336 datapoints
Entertainment category has 273 datapoints
Politics category has 274 datapoints
Sport category has 346 datapoints
Sport category has 261 datapoints

2.5.2 Wordcloud Representation of each class

WordCloud is graphic representations of words which gives words that occur most frequently extra importance.
Here wordcloud representation is plotted separately for the Business, Politics,Technology,Sports & Entertainment
Different Masks are used to shape the outline of each wordcloud.
Using the library wordcloud, wordcloud is plotted.

2.5.3 Number of Words Distribution per Classes

Here we are analysing the min, max & average number of words in the each output class in dependent variable
Pandas function .mean(), .max(), .min() gives average value, max value & min value respectively.

====================================================================
Mean number of words in the category business:: 501.8582375478927
Max number of words in the category business:: 1549
Min number of words in the category business:: 188
====================================================================
====================================================================
Mean number of words in the category tech:: 334.16964285714283
Max number of words in the category tech:: 902
Min number of words in the category tech:: 145
====================================================================
====================================================================
Mean number of words in the category politics:: 335.3468208092485
Max number of words in the category politics:: 1671
Min number of words in the category politics:: 116
====================================================================
====================================================================
Mean number of words in the category sport:: 333.9120879120879
Max number of words in the category sport:: 2448
Min number of words in the category sport:: 144
====================================================================
====================================================================
Mean number of words in the category entertainment:: 449.68978102189783
Max number of words in the category entertainment:: 3345
Min number of words in the category entertainment:: 90
====================================================================

We can observe that Business news has most number of average words in the particular dataset followed by Entertainment.
Sport news has least number of average words.

2.5.3.1 Total Numbers of words Distribution

go.Histogram function of plotly Library gives histogram representation for number of words in the each output class.

Histogram Entertainment & Politics is highly skewed as compare Tech,Business & Sport.

3 Data Preproecessing & Vectorization of Text Data

3.1 Text Preprocessing: StopWord Removal,Special Character Removal

Words or set of words that are abbreviated by removing letters and substituting them with a comma are known as contractions.
When decontracted function applied on text data following result can be obtained

Source:
Original text: I'll be there within 5 min. Shouldn't you be there too?I'd love to see u there my dear. It's awesome to meet new friends.We've been waiting for this day for so long.Expanded_text: I will be there within 5 min. should not you be there too?I would love to see you there my dear. it is awesome to meet new friends.we have been waiting for this day for so long.

Html tag removal, special character removal, stopwords removal

3.2 Train Test Split 80:20

Cleaned text taken as X independent variable & Category column is taken as depenedent variable
When ml algorithms are used to generate decisions and predictions that was not used to train a model, the train-test split technique is being used to analyze their performance.
Train_test_spilt function from sklearn library used to split the dataset
Dataset is divided into 80% as X_train data & 20% as X_test data
Stratify(source): This stratify parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter stratify.For example, if variable y is a binary categorical variable with values 0 and 1 and there are 25% of zeros and 75% of ones, stratify=y will make sure that your random split has 25% of 0's and 75% of 1's.

3.3 Dependent Variable y Label Encoding

Label encoding is the process of translating labels into numerical form so that they may be read by machines. Ml algorithms then can better decide how such labels should be used. In supervised learning, this is a crucial pre-processing step for the organized dataset.
LabelEncoder() transforms the ‘Business’ label into 0, similarly following encoding takes place,

3.4 BOW representation of Text Data

The sklearn package in Python has a fantastic feature called CountVectorizer(). It is used to convert a document into a vector based on frequency (count) of every word that appears throughout the document. This is useful when dealing with a large number of such documents and converting every word into a vector .

source

====================================================================
Before vectorizations:
--------------------------------------------------------------------
(1192,) (1192,)
(298,) (298,)
====================================================================
After vectorizations:
--------------------------------------------------------------------
X_train_bow shape: (1192, 4274) , 
y_train_label_encoded shape: (1192,)
X_test_bow shape: (298, 4274) , 
y_test_label_encoded.shape: (298,)
====================================================================

3.5 TFIDF Representation of Text Data

====================================================================
Before vectorizations:
--------------------------------------------------------------------
(1192,) (1192,)
(298,) (298,)
====================================================================
After vectorizations:
--------------------------------------------------------------------
X_train_tfidf shape: (1192, 4274) , 
y_train_label_encoded shape: (1192,)
X_test_tfidf shape: (298, 4274) , 
y_test_label_encoded.shape: (298,)
====================================================================

4 Machine Learning Model Building

4.1 Naive Bayes Model

4.1.1 Naive Bayes Model on BOW text Data

4.1.1.1 Hyperparameter Tuning of Naive Bayes Model using GridSearchCV(BOW text feature)

===============================================
Best_hyperparameter_NB_BoW: {'alpha': 0.01}
===============================================

4.1.1.2 Training of Naive Bayes Model on Best Hyperparameter & Confusion Matrix

4.1.1.3 Classification Report

4.1.2 Naive Bayes Model on TFIDF text Data

4.1.2.1 Hyperparameter Tuning of Naive Bayes Model using GridSearchCV

====================================================================
Best_hyperparameter_NB_tfidf: {'alpha': 0.1}
====================================================================

4.1.2.2 Training of Naive Bayes Model on Best Hyperparameter & Confusion Matrix

4.1.2.3 Classification Report

4.2 Logistic Regression Model

4.2.1 Hyperparameter Tuning of Logistic Rgression model on text data(TFIDF) using GridSearchCV

===================================================================
Best_hyperparameter_LR_tfidf: {'C': 1000}
===================================================================

4.2.2 Training of Logistic Regression Model on Best Hyperparameter & Confusion Matrix

4.2.3 Classification Report

5 Deep Learning Model Building, Transfer Learning using HuggingFace Transformers

5.1 Train Test Split 90 10

5.2 Dependent Variable One Hot Encoding

==================================================
shape of y_train one encoded : (1341, 5)
--------------------------------------------------
shape of y_test one encoded : (149, 5)
==================================================

5.3 Tokenization of Text data using “bert-base-uncased”

==================================================
shape of X_train_ids: (1341, 512)
--------------------------------------------------
shape of X_train_mask: (1341, 512)
==================================================
==================================================
shape of X_test_ids: (149, 512)
--------------------------------------------------
shape of X_test_mask: (149, 512)
==================================================

5.4 Converting the tokenized ids and Mask into tensorflow Tensors

5.5 Preparaing Train and Test Dataset for Tensorflow Format

5.6 Making Mini Batches using tensforflow Functions

5.7 Model Building Using Pretrained bert-base-uncased

Model: "tf_bert_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bert (TFBertMainLayer)       multiple                  109482240 
=================================================================
Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0
_________________________________________________________________

5.7.1 Customizing the output layer for our Problem

5.7.2 Model Summary

Model: "model"
____________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
====================================================================
input_ids (InputLayer)          [(None, 512)]        0                                            
____________________________________________________________________
attention_mask (InputLayer)     [(None, 512)]        0                                            
____________________________________________________________________
bert (TFBertMainLayer)          TFBaseModelOutputWit 109482240   input_ids[0][0]                  
                                                                 attention_mask[0][0]             
____________________________________________________________________
dropout_37 (Dropout)            (None, 512, 768)     0           bert[0][0]                       
____________________________________________________________________
bidirectional (Bidirectional)   (None, 1536)         9443328     dropout_37[0][0]                 
____________________________________________________________________
outputs (Dense)                 (None, 5)            7685        bidirectional[0][0]              
====================================================================
Total params: 118,933,253
Trainable params: 9,451,013
Non-trainable params: 109,482,240
____________________________________________________________________

5.7.3 Model Compilation & Model Training

Epoch 1/2
84/84 [==============================] - 206s 2s/step - loss: 0.3731 - accuracy: 0.8680 - val_loss: 0.0541 - val_accuracy: 0.9799

Epoch 00001: val_accuracy improved from -inf to 0.97987, saving model to best_model.hdf5
Epoch 2/2
84/84 [==============================] - 184s 2s/step - loss: 0.0938 - accuracy: 0.9754 - val_loss: 0.0746 - val_accuracy: 0.9597

Epoch 00002: val_accuracy did not improve from 0.97987

5.7.4 Performance Plotting

6 Comaparison of all the Models

+-------------------+---------------------+---------------+
|     Vectorizer    |        Model        | Test Accuracy |
+-------------------+---------------------+---------------+
|        BOW        |     Naive_Bayes     |      0.94     |
|       TFIDF       |     Naive_Bayes     |      0.95     |
|       TFIDF       | Logistic_Regression |      0.96     |
| Transfer Learning |    BERT Embedding   |    0.979866   |
+-------------------+---------------------+---------------+