文本分析：NLP 魔法！_代码007(未授权)

本文介绍: 这是一个关于 NLP 和分类项目的博客。NLP 是自然语言处理，目前需求量很大。让我们了解如何利用 NLP。我们将通过编码来理解流程和概念。我将在本博客中介绍 Ba gOfWords 和 n-g ram 以及朴素贝叶斯分类模型。这个博客的独特之处（这使得它很长！）是我已经展示了如何根据我们手中的数据集为我们选择正确的模型。那么，让我们开始吧。

这是一个关于 NLP 和分类项目的博客。NLP 是自然语言处理，目前需求量很大。让我们了解如何利用 NLP。我们将通过编码来理解流程和概念。我将在本博客中介绍 Ba gOfWords 和 n-g ram 以及朴素贝叶斯分类模型。这个博客的独特之处（这使得它很长！）是我已经展示了如何根据我们手中的数据集为我们选择正确的模型。那么，让我们开始吧。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import regex as re
#below code is for not showing up any warnings that might appear for the ongoing improvements in functions
import warnings
warnings.filterwarnings('ignore')

#Reading the csv file into the dataframe
df = pd.read_csv('movie.csv')

#Let's look into the first ten records
df.head(10)  
#if no parameter is provided then the head() function will show 5 records, else it will show as many as you will provide

# to get the information about the dataset
df.info()

df['label'].value_counts()

sns.countplot(x = 'label', data = df)
plt.xlabel('Sentiments')

plt.show()

现在我们进入有趣的部分了！处理文本。我将首先向您展示词袋方法。那么它是什么？
当我们使用文本数据时，我们没有在结构化表格数据中看到的功能。因此，我们需要一些措施来从文本数据中获取特征。如果我们可以从一个句子中取出每个单词，并获得某种度量，通过它我们可以找出该单词是否存在于另一个句子中及其重要性，该怎么办？通过称为“词袋模型”的过程，这当然是可能的。也就是说，我们的电影评论数据集中的每个句子都被视为一个词袋，因此每个句子被称为一个文档。所有文档共同构成一个语料库。

如果这听起来让您感到困惑，请不要担心！这个解释会让事情变得更清楚 -我们将首先创建一个包含语料库中使用的所有唯一单词的字典（这意味着数据集中存在的所有文档或评论）。在计算字数时，我们不考虑像 the、an、is 等语法，因为这对于理解文本上下文没有任何重要意义。然后，我们将所有文档（个人评论）转换为向量，该向量将表示特定文档中字典中单词的存在。BoW 模型中可以通过三种方式来识别单词的重要性 –

计数向量模型将计算整个句子中单词出现的次数。直观地理解会更好，所以假设我们有以下语句 –
re view1 = ‘电影非常非常好’
re view2 = ‘电影令人失望’
在计数向量模型中，评论将这样显示 –

词频模型 – 在此模型中，每个文档（或句子）中每个单词的频率是相对于整个文档中观察到的单词总数来计算的。它的计算公式为 –
TF = 第 i 个文档中单词出现的次数 / 第 i 个文档中单词的总数

术语频率-逆文档频率模型 – TFIDF 衡量特定句子中单词的重要性。句子中某个单词的重要性与其在文档中出现的次数成正比，与整个语料库中同一单词的出现频率成反比。它的计算公式为 –
TF-IDF = TF x ln (1+N/Ni)，其中 N 是语料库中的文档总数，Ni 是包含单词 i 的文档。

from sklearn.feature_extraction.text import CountVectorizer

#initializing the CountVectorizer
count_vector = CountVectorizer()

#creating dictionary of words from the corpus
features = count_vector.fit(df['text'])

#Let's see the feature names extracted by the CountVectorizer
feature_names = features.get_feature_names_out()
feature_names

print('Total Number of features extracted are - ',len(feature_names))

#Let's randomly pickup 10 feature names out of it
random.sample(set(feature_names), 10)

feature_vector = count_vector.transform(df['text'])
feature_vector.shape
(8488, 48618)

feature_vector.getnnz()
1158500

# To get the non-zero value density in the document
feature_vector.getnnz()/(feature_vector.shape[0]*feature_vector.shape[1])
0.0028073307190965642

feature_vector.todense()

from nltk.corpus import stopwords

#since the reviews are in english, stopwords will be in english that we need to set as below -
all_stopwords = set(stopwords.words('english'))

#this is how stop words looks like - 
list(all_stopwords)[:10]

["doesn't",
 "weren't",
 'each',
 "she's",
 'himself',
 'did',
 'about',
 'through',
 'the',
 'should']

count_vector2 = CountVectorizer(stop_words=list(all_stopwords))
feature_names2 = count_vector2.fit(df['text'])

feature_vector2 = count_vector2.transform(df['text'])
feature_vector2.shape
(8488, 48473)

feature_names = feature_names2.get_feature_names_out()
feature_counts = np.sum(feature_vector2.toarray(), axis = 0)

pd.DataFrame(dict(Features = feature_names, Count = feature_counts))

#we will use the regex module to go through each document and look for the non english characters and will replace them with a space in our document
for word in df.text[:][:10]:
    review = re.sub('[^a-zA-Z]',' ',word)

sentences = []
for word in df.text:
    review = re.sub('[^a-zA-Z]',' ',word)
    review = review.lower()
    sentences.append(review)

count_vector3 = CountVectorizer(stop_words=list(all_stopwords))
feature_names3 = count_vector3.fit(sentences)
feature_vector3 = count_vector3.transform(sentences)
feature_vector3.shape
(8488, 47672)

feature_names = feature_names3.get_feature_names_out()
feature_counts = np.sum(feature_vector3.toarray(), axis = 0)
pd.DataFrame(dict(Features = feature_names, Count = feature_counts))

from nltk.stem.porter import PorterStemmer
#object for porterstemmer is needed
ps = PorterStemmer()

# we have sentences turned into lowercase now we will stem individual words and then look into if its a stop word or not.
# we will create a list removing all the stop words
sentences_stemmed = []
for texts in sentences:
    reviews = [ps.stem(word) for word in texts.split() if not word in all_stopwords]
    sentences_stemmed.append(' '.join(reviews))

#Let's call the Countvectorizer process now 
count_vector4 = CountVectorizer() 
feature_names4 = count_vector4.fit(sentences_stemmed) 
feature_vector4 = count_vector4.transform(sentences_stemmed)

feature_vector4.shape
(8488, 32342)

from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
sentences_lemma = [] 
for texts in sentences: 
     reviews = [lemma.lemmatize(word) for word in texts.split() if not word in all_stopwords] 
     sentences_lemma.append(' '.join(reviews))

#Let's call the Countvectorizer process now 
count_vector5 = CountVectorizer() 
feature_names5 = count_vector5.fit(sentences_lemma) 
feature_vector5 = count_vector5.transform(sentences_lemma)

feature_vector5.shape
(8488, 42521)

def get_clean_text(df, col):
    sentence = []

    for word in df[col][:]:
        review = re.sub('[^a-zA-Z]',' ',word)
        review = review.lower()
        review = review.split()
        review = [ps.stem(word) for word in review if not word in all_stopwords]
        review = ' '.join(review)
        sentence.append(review)

    return sentence

df['clean_text'] = get_clean_text(df, 'text')
df.head(10)

#Now we need to vectorize it. We will do it in the same way, that is using countvectorizer -
cv = CountVectorizer()
features = cv.fit_transform(df['clean_text'])

 dataset into train and test
x = features.toarray()
y = df['label']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.10, random_state = 42)

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(x_train, y_train)

y_pred = classifier.predict(x_test)

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[305 124]
 [194 226]]

round(accuracy_score(y_test, y_pred), 3)
0.625

y_pred_train = classifier.predict(x_train)
round(accuracy_score(y_train, y_pred_train), 3)
0.902

from sklearn.naive_bayes import BernoulliNB

classifier2 = BernoulliNB()
classifier2.fit(x_train, y_train)

y_pred2 = classifier2.predict(x_test)
cm = confusion_matrix(y_test, y_pred2)
sns.heatmap(cm, annot = True, fmt='.2f')

round(accuracy_score(y_test, y_pred2), 3)
0.81

y_pred_train2 = classifier2.predict(x_train)
round(accuracy_score(y_train, y_pred_train2), 3)
0.916

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred2))

              precision    recall  f1-score   support

           0       0.77      0.88      0.82       429
           1       0.86      0.74      0.79       420

    accuracy                           0.81       849
   macro avg       0.82      0.81      0.81       849
weighted avg       0.82      0.81      0.81       849

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer() 
features = tfidf.fit_transform(df['clean_text'])
x1 = features.toarray()
x_train, x_test, y_train, y_test = train_test_split(x1, y, test_size = 0.10, random_state = 42)

#Let's use GaussianNB first
classifier = GaussianNB() 
classifier.fit(x_train, y_train)

y_pred = classifier.predict(x_test) 
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot = True, fmt='.2f')

round(accuracy_score(y_test, y_pred), 3)
0.63

#Now using BernaulliNB
classifier2 = BernoulliNB()
classifier2.fit(x_train, y_train)

y_pred2 = classifier2.predict(x_test)
cm = confusion_matrix(y_test, y_pred2)
sns.heatmap(cm, annot = True, fmt='.2f')

round(accuracy_score(y_test, y_pred2), 3)
0.81

y_pred_train2 = classifier2.predict(x_train)
round(accuracy_score(y_train, y_pred_train2), 3)
0.916

print(classification_report(y_test, y_pred2))

              precision    recall  f1-score   support

           0       0.77      0.88      0.82       429
           1       0.86      0.74      0.79       420

    accuracy                           0.81       849
   macro avg       0.82      0.81      0.81       849
weighted avg       0.82      0.81      0.81       849

tfidf2 = TfidfVectorizer(ngram_range=(1,2), max_features=5000)
feature2 = tfidf2.fit_transform(df['clean_text'])

x = feature2.toarray()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.10, random_state = 42)

#using bernaulli since its performing best 
#Now using BernaulliNB 

classifier = BernoulliNB() 
classifier.fit(x_train, y_train) 
 y_pred = classifier.predict(x_test) 
cm = confusion_matrix(y_test, y_pred) 
sns.heatmap(cm, annot = True, fmt='.2f')

accuracy_score(y_test, y_pred)
0.8244994110718492

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.84      0.83       429
           1       0.83      0.81      0.82       420

    accuracy                           0.82       849
   macro avg       0.82      0.82      0.82       849
weighted avg       0.82      0.82      0.82       849

#removing the words of 1 letter or 0 letter 
sentences_clean = [] 
for listed in df['clean_text'].str.split(' '): 
     review = [word for word in listed if len(word) != 1 and len(word) != 0] 
     review = ' '.join(review) 
     sentences_clean.append(review)

#removing all same letters from string 
def allCharactersSame(s) : 
     n = len(s) 
     for i in range(1, n) : 
         if s[i] != s[0] : 
             return False
         return True 

cleaned = []
for sentences in sentences_clean: 
     word_list = [] 
     for word in sentences.split(' '): 
         if allCharactersSame(word): 
             pass 
         else: 
             word_list.append(word) 
     word_list = ' '.join(word_list) 
     cleaned.append(word_list)

df['clean_text'] = cleaned
df.head()

tfidf = TfidfVectorizer(ngram_range=(1,2), max_features=10000)
feature = tfidf.fit_transform(df['clean_text'])
x = feature.toarray()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.10, random_state = 42)

#using bernaulli since its performing best
#Now using BernaulliNB
classifier = BernoulliNB()
classifier.fit(x_train, y_train)

y_pred = classifier.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot = True, fmt='.2f')

accuracy_score(y_test, y_pred)
0.823321554770318

y_pred1 = classifier.predict(x_train) 
#training accuracy calculation
accuracy_score(y_train, y_pred1)
0.8984160230396648

#This is something I have written for testing
reviews = [
    "I didn't liked the movie. It was so boring.",
    'I am not happy that the movie ended so badly',
    "The movie is terrific. It's a must watch for every one."
]

dataframe = pd.DataFrame({'Text':reviews})
dataframe

#cleaning up the data and applying stemming
test_sent = get_clean_text(dataframe, 'Text')

#converting into vectors using last trained n-grams model
x1 = tfidf.transform(test_sent).toarray()

#predicting unseen data using last trained classifier
y_pred_res = classifier.predict(x1)

dataframe['predictions'] = y_pred_res.tolist()
dataframe