News Multi-Class Classification using State of the Art HuggingFace BERT Model

Abhinandan Pise
10 min readOct 6, 2021
Source

Table of Contents

1 Basic Preporcessing and Data Downloading
1.1 Data Downloading
1.2 Business Problem & Challenges
1.3 Importing Libraries
1.4 Data Loading Into Pandas Dataframe
2 EDA-Exploratory Data Analysis
2.1 Data Points in the dataset
2.2 Unique Number of Categories in the Dependent Variable
2.3 Percentage Distribution of Output Class
2.4 Null Value Check
2.5 Data Visualization
3 Data Preproecessing & Vectorization of Text Data
3.1 Text Preprocessing: StopWord Removal,Special Character Removal
3.2 Train Test Split 80 20
3.3 Dependent Variable y Label Encoding
3.4 BOW representation of Text Data
3.5 TFIDF Representation of Text Data
4 Machine Learning Model Building
4.1 Naive Bayes Model
4.2 Logistic Regression Model
5 Deep Learning Model Building, Transfer Learning using HuggingFace Transformers
5.1 Train Test Split 90 10
5.2 Dependent Variable One Hot Encoding
5.3 Tokenization of Text data using “bert-base-uncased”
5.4 Converting the tokenized ids and Mask into tensorflow Tensors
5.5 Preparaing Train and Test Dataset for Tensorflow Format
5.6 Making Mini Batches using tensforflow Functions
5.7 Model Building Using Pretrained bert-base-uncased
6 Comparison of all the Models

1. Basic Preporcessing and Data Downloading

1.1. Data Downloading

Script to downlaod data from Kaggle
  • wget is shell script command to download the data from web without any login into the system
  • The dataset is downloaded from here
  • This command directly downloads the data into the Google colab server with out downloading into the local system

1.2 Business Problem & Challenges

  • This perticular problem utilise a BBC public dataset with 2225 items divided into five categories: business, entertainment, political, sport, and technology.
  • The dataset is divided into 1490 training records and 735 testing records. The objective is to create a system that can properly categorise previously unknown news stories into the correct category.

1.3 Importing Libraries

Importing ML,Visualization etc. libraries

1.4 Data Loading Into Pandas Dataframe

  • Pandas is mostly concerned with tables, it can read csv, tsv, json files formats
  • Using pd.read_csv we can load the data into table like pandas dataframe
  • There are 2 columns in the independent variables ArticleId, Text
  • Target column is Category or we can say it is a dependent variable

2 EDA-Exploratory Data Analysis

2.1 Data Points in the dataset

  • Data.shape[0] gives the first dimension of dataframe i.e it gives number of rows in the dataframe
  • Data.shape[1] gives the second dimension of dataframe i.e it gives number of coulmuns in the dataframe
==================================================
Number of rows or Datapoints in the Datasets: 1490
--------------------------------------------------
number of clumns in the datasets: 3
==================================================

2.2 Unique Number of Categories in the Dependent Variable

============================================================
number of output classes in the dataset: 5
============================================================
  • As there are 5 unique categories in the dependent variable this particular problem is Multi-Class Classification problem
  • This is a classification problem assignment with more than two categories, as the term indicates. The essential concept in multi-class classifications is that the matching class-label for every data point in the dataset will be the one and only between all possible class labels.
  • Such occurrences are known as “mutually exclusive events” in probability and statistics, which indicates that the chance of all events occurring at the same moment is ZERO.
Binary Classification vs Multi-Class Classification (source)

2.3 Percentage Distribution of Output Class

============================================================
% of output classes in the dataset:
------------------------------------------------------------
sport 23.221477
business 22.550336
politics 18.389262
entertainment 18.322148
tech 17.516779
Name: Category, dtype: float64
============================================================
  • As all the 5 categories are having same almost same number of datapoints in the dataset.
  • We can say that this particular News Dataset is balanced dataset

2.3.1 Evaluation Metric

  • As seen from above analysis dataset is balanced so we are using the accuracy as evaluation metric for this particular business problem
Accuracy Formula

2.4 Null Value Check

  • Panadas.isnull function detects the missing values in the particular column
  • After isnull function,.sum() function is applied this gives total number of null values in the dataset (dataframe)
==================================================
Null values in the Dataset : 0
==================================================

2.5 Data Visualization

2.5.1 Bar Plot of Output Class

  • This bar plot is plotted using plotly library, here we are plotting the number of datapoints per category
  • On x axis: different categories, y axis: number of datapoints
  • From above bar graph we can observe business category has 336 datapoints
  • Entertainment category has 273 datapoints
  • Politics category has 274 datapoints
  • Sport category has 346 datapoints
  • Sport category has 261 datapoints

2.5.2 Wordcloud Representation of each class

  • WordCloud is graphic representations of words which gives words that occur most frequently extra importance.
  • Here wordcloud representation is plotted separately for the Business, Politics,Technology,Sports & Entertainment
  • Different Masks are used to shape the outline of each wordcloud.
  • Using the library wordcloud, wordcloud is plotted.

2.5.3 Number of Words Distribution per Classes

  • Here we are analysing the min, max & average number of words in the each output class in dependent variable
  • Pandas function .mean(), .max(), .min() gives average value, max value & min value respectively.
====================================================================
Mean number of words in the category business:: 501.8582375478927
Max number of words in the category business:: 1549
Min number of words in the category business:: 188
====================================================================
====================================================================
Mean number of words in the category tech:: 334.16964285714283
Max number of words in the category tech:: 902
Min number of words in the category tech:: 145
====================================================================
====================================================================
Mean number of words in the category politics:: 335.3468208092485
Max number of words in the category politics:: 1671
Min number of words in the category politics:: 116
====================================================================
====================================================================
Mean number of words in the category sport:: 333.9120879120879
Max number of words in the category sport:: 2448
Min number of words in the category sport:: 144
====================================================================
====================================================================
Mean number of words in the category entertainment:: 449.68978102189783
Max number of words in the category entertainment:: 3345
Min number of words in the category entertainment:: 90
====================================================================
  • We can observe that Business news has most number of average words in the particular dataset followed by Entertainment.
  • Sport news has least number of average words.

2.5.3.1 Total Numbers of words Distribution

  • go.Histogram function of plotly Library gives histogram representation for number of words in the each output class.
  • Histogram Entertainment & Politics is highly skewed as compare Tech,Business & Sport.

3 Data Preproecessing & Vectorization of Text Data

3.1 Text Preprocessing: StopWord Removal,Special Character Removal

  • Words or set of words that are abbreviated by removing letters and substituting them with a comma are known as contractions.
  • When decontracted function applied on text data following result can be obtained
Source:
Original text: I'll be there within 5 min. Shouldn't you be there too?I'd love to see u there my dear. It's awesome to meet new friends.We've been waiting for this day for so long.
Expanded_text: I will be there within 5 min. should not you be there too?I would love to see you there my dear. it is awesome to meet new friends.we have been waiting for this day for so long.
  • Html tag removal, special character removal, stopwords removal

3.2 Train Test Split 80:20

  • Cleaned text taken as X independent variable & Category column is taken as depenedent variable
  • When ml algorithms are used to generate decisions and predictions that was not used to train a model, the train-test split technique is being used to analyze their performance.
  • Train_test_spilt function from sklearn library used to split the dataset
  • Dataset is divided into 80% as X_train data & 20% as X_test data
  • Stratify(source): This stratify parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter stratify.For example, if variable y is a binary categorical variable with values 0 and 1 and there are 25% of zeros and 75% of ones, stratify=y will make sure that your random split has 25% of 0's and 75% of 1's.

3.3 Dependent Variable y Label Encoding

  • Label encoding is the process of translating labels into numerical form so that they may be read by machines. Ml algorithms then can better decide how such labels should be used. In supervised learning, this is a crucial pre-processing step for the organized dataset.
  • LabelEncoder() transforms the ‘Business’ label into 0, similarly following encoding takes place,

3.4 BOW representation of Text Data

  • The sklearn package in Python has a fantastic feature called CountVectorizer(). It is used to convert a document into a vector based on frequency (count) of every word that appears throughout the document. This is useful when dealing with a large number of such documents and converting every word into a vector .
source
source
====================================================================
Before vectorizations:
--------------------------------------------------------------------
(1192,) (1192,)
(298,) (298,)
====================================================================
After vectorizations:
--------------------------------------------------------------------
X_train_bow shape: (1192, 4274) ,
y_train_label_encoded shape: (1192,)
X_test_bow shape: (298, 4274) ,
y_test_label_encoded.shape: (298,)
====================================================================

3.5 TFIDF Representation of Text Data

====================================================================
Before vectorizations:
--------------------------------------------------------------------
(1192,) (1192,)
(298,) (298,)
====================================================================
After vectorizations:
--------------------------------------------------------------------
X_train_tfidf shape: (1192, 4274) ,
y_train_label_encoded shape: (1192,)
X_test_tfidf shape: (298, 4274) ,
y_test_label_encoded.shape: (298,)
====================================================================

4 Machine Learning Model Building

4.1 Naive Bayes Model

4.1.1 Naive Bayes Model on BOW text Data

4.1.1.1 Hyperparameter Tuning of Naive Bayes Model using GridSearchCV(BOW text feature)

===============================================
Best_hyperparameter_NB_BoW: {'alpha': 0.01}
===============================================

4.1.1.2 Training of Naive Bayes Model on Best Hyperparameter & Confusion Matrix

4.1.1.3 Classification Report

4.1.2 Naive Bayes Model on TFIDF text Data

4.1.2.1 Hyperparameter Tuning of Naive Bayes Model using GridSearchCV

====================================================================
Best_hyperparameter_NB_tfidf: {'alpha': 0.1}
====================================================================

4.1.2.2 Training of Naive Bayes Model on Best Hyperparameter & Confusion Matrix

4.1.2.3 Classification Report

4.2 Logistic Regression Model

source

4.2.1 Hyperparameter Tuning of Logistic Rgression model on text data(TFIDF) using GridSearchCV

===================================================================
Best_hyperparameter_LR_tfidf: {'C': 1000}
===================================================================

4.2.2 Training of Logistic Regression Model on Best Hyperparameter & Confusion Matrix

4.2.3 Classification Report

5 Deep Learning Model Building, Transfer Learning using HuggingFace Transformers

Deep Learning True Story (source)

5.1 Train Test Split 90 10

5.2 Dependent Variable One Hot Encoding

==================================================
shape of y_train one encoded : (1341, 5)
--------------------------------------------------
shape of y_test one encoded : (149, 5)
==================================================

5.3 Tokenization of Text data using “bert-base-uncased”

==================================================
shape of X_train_ids: (1341, 512)
--------------------------------------------------
shape of X_train_mask: (1341, 512)
==================================================
==================================================
shape of X_test_ids: (149, 512)
--------------------------------------------------
shape of X_test_mask: (149, 512)
==================================================

5.4 Converting the tokenized ids and Mask into tensorflow Tensors

5.5 Preparaing Train and Test Dataset for Tensorflow Format

5.6 Making Mini Batches using tensforflow Functions

5.7 Model Building Using Pretrained bert-base-uncased

Model: "tf_bert_model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bert (TFBertMainLayer) multiple 109482240
=================================================================
Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0
_________________________________________________________________

5.7.1 Customizing the output layer for our Problem

Source

5.7.2 Model Summary

Model: "model"
____________________________________________________________________
Layer (type) Output Shape Param # Connected to
====================================================================
input_ids (InputLayer) [(None, 512)] 0
____________________________________________________________________
attention_mask (InputLayer) [(None, 512)] 0
____________________________________________________________________
bert (TFBertMainLayer) TFBaseModelOutputWit 109482240 input_ids[0][0]
attention_mask[0][0]
____________________________________________________________________
dropout_37 (Dropout) (None, 512, 768) 0 bert[0][0]
____________________________________________________________________
bidirectional (Bidirectional) (None, 1536) 9443328 dropout_37[0][0]
____________________________________________________________________
outputs (Dense) (None, 5) 7685 bidirectional[0][0]
====================================================================
Total params: 118,933,253
Trainable params: 9,451,013
Non-trainable params: 109,482,240
____________________________________________________________________

5.7.3 Model Compilation & Model Training

Epoch 1/2
84/84 [==============================] - 206s 2s/step - loss: 0.3731 - accuracy: 0.8680 - val_loss: 0.0541 - val_accuracy: 0.9799

Epoch 00001: val_accuracy improved from -inf to 0.97987, saving model to best_model.hdf5
Epoch 2/2
84/84 [==============================] - 184s 2s/step - loss: 0.0938 - accuracy: 0.9754 - val_loss: 0.0746 - val_accuracy: 0.9597

Epoch 00002: val_accuracy did not improve from 0.97987

5.7.4 Performance Plotting

6 Comaparison of all the Models

+-------------------+---------------------+---------------+
| Vectorizer | Model | Test Accuracy |
+-------------------+---------------------+---------------+
| BOW | Naive_Bayes | 0.94 |
| TFIDF | Naive_Bayes | 0.95 |
| TFIDF | Logistic_Regression | 0.96 |
| Transfer Learning | BERT Embedding | 0.979866 |
+-------------------+---------------------+---------------+

--

--