博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Naive Bayesian for Text Classification (MLE, Gaussian Naive Bayesian)
阅读量:4046 次
发布时间:2019-05-24

本文共 9301 字,大约阅读时间需要 31 分钟。

The Naive Bayesian is a baseline for text classification problem.

A spam email example. We need to count the frequency of words which occurs in the span/normal email.

Such as, ad., purchase,  link  ,etc. We could considered this email as spam.

But sometimes, the words mentioned above will exist in the normal email, so the problem is complicated.

there are two  steps for naive bayesian:

1) Training

count each words in vocabulary ,and calculate the contributions between each word and the probability of spam/normal email.

p(advertisement/ span)   p(advertisement/ normal)

2) predict

 

Training :

p(购买 | 正常) = 3 / (24 * 10) = 1/80

p(购买 | 垃圾) = 7 / (12 * 10) = 7/120

p(物品 | 正常) =  4 / 240 = 1 / 60

p(物品 | 垃圾) = 4 / 120 = 1 / 30

p(不是 | 正常) = 4 / 240 = 1 / 60

p(不是 | 垃圾) = 3 / 120 = 1 / 40

p(广告 | 正常) = 5 / 240 = 1 / 48

p(广告 | 正常) = 4 / 120 = 1 / 56

p(这 | 正常) = 3 / 240 = 1 / 80

p(这 | 垃圾) = 0 / 120 = 0

 

Priori Probability(先验概率)

正常邮件在所有邮件中的概率 24 / 36 = 2 / 3

垃圾邮件在所有邮件中的概率 12 / 36 = 1 / 3

 

We need to calculate the condition probability of span/ normal base on the context of the email. P(spam / context) and P(normal / context)

Bayesian Theorem

P(X | Y): likelihood

P(Y): prior

P(X) = normalization

P(Y | X) = posterior

Prediction:

Conditional independence  P(x, y | z) = P(x | z) * P(y | z)

 

But the result is abnormal due to P(这|垃圾) = 0.

We need to do some smooth process.

Add-one smoothing:

 

A problem:

 

为了避免underflow 可以加上log

log(p1 * p2 * p3) = logp1 + log p2 + log p3

 

Naive Bayesian Sample in python:

import pandas as pdimport numpy as npimport matplotlib.mlab as mlabimport matplotlib.pyplot as plt# read span.csvdf = pd.read_cv("spam.csv", encoding = 'latin')df.head()

 

v1 	v2 	Unnamed: 2 	Unnamed: 3 	Unnamed: 40 	ham 	Go until jurong point, crazy.. Available only ... 	NaN 	NaN 	NaN1 	ham 	Ok lar... Joking wif u oni... 	NaN 	NaN 	NaN2 	spam 	Free entry in 2 a wkly comp to win FA Cup fina... 	NaN 	NaN 	NaN3 	ham 	U dun say so early hor... U c already then say... 	NaN 	NaN 	NaN4 	ham 	Nah I don't think he goes to usf, he lives aro... 	NaN 	NaN 	NaN

 

Rename some columns:

# rename  the column of v1 and v2df.rename(columns = {'v1' : 'Label', 'v2' : 'Text'}, inplace = True)df.head()
Label 	Text 	Unnamed: 2 	Unnamed: 3 	Unnamed: 40 	ham 	Go until jurong point, crazy.. Available only ... 	NaN 	NaN 	NaN1 	ham 	Ok lar... Joking wif u oni... 	NaN 	NaN 	NaN2 	spam 	Free entry in 2 a wkly comp to win FA Cup fina... 	NaN 	NaN 	NaN3 	ham 	U dun say so early hor... U c already then say... 	NaN 	NaN 	NaN4 	ham 	Nah I don't think he goes to usf, he lives aro... 	NaN 	NaN 	NaN

Map Label to number

# map'ham' and 'span' to 0 and 1df['numLabel'] = df['Label'].map({'ham' : 0, 'spam' : 1})df.head()

 

Label 	Text 	Unnamed: 2 	Unnamed: 3 	Unnamed: 4 	numLabel0 	ham 	Go until jurong point, crazy.. Available only ... 	NaN 	NaN 	NaN 	01 	ham 	Ok lar... Joking wif u oni... 	NaN 	NaN 	NaN 	02 	spam 	Free entry in 2 a wkly comp to win FA Cup fina... 	NaN 	NaN 	NaN 	13 	ham 	U dun say so early hor... U c already then say... 	NaN 	NaN 	NaN 	04 	ham 	Nah I don't think he goes to usf, he lives aro... 	NaN 	NaN 	NaN 	0

Count the number of spam/ham emails

# count number of ham and spamprint ('# of ham : ', len(df[df.numLabel == 0]), ' # of spam: ', len(df[df.numLabel == 1]))print ('# of total samples: ', len(df))
# of ham :  4825  # of spam:  747# of total samples:  5572

Plot the histogram for text length:

# count the length of  text, and plot a histogramtext_lengths = [len(df.loc[i, 'Text']) for i in range(len(df))]plt.hist(terxt_lengths, 100, facecolor = 'blue', alpha = 0.5)plt.xlim([0, 200])plt.show()

 

 

# import English vocabularyfrom sklearn.feature_extraction.text import CountVectorizer# construct word vector (base on the frequency of the word)vectorizer = CountVectorizer()X = vectorizer.fit_transform(df.Text)y = df.numLabel# split the data into train and test data setfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 100)print('# of samples in the train data set: ', X_train.shape[0], '# of samples in test data set: ', X_test.shape[0])

Output:

# of samples in the train data set:  4457 # of samples in test data set: 1115
# use the Naive Bayesian for model trainingfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.metrics import accuracy_scoreclf = MultinomialNB(alpha = 1.0, fit_prior = True)clf.fit(X_train, y_train)y_pred = clf.predict(X_test)print("accuracy on test data: ", accuracy_score(y_test, y_pred))
accuracy on test data:  0.97847533632287
# print confusion matrixfrom sklearn.metrics import confusion_matrixconfusion_matrix(y_test, y_pred, labels = [0, 1])
array([[956,  14],       [ 10, 135]])

 

Summary:

Maximum Likelihood estimation for parameter of Naive Bayesian:

Non-constraint Optimization Problem

Constrained Optimization

 

Maximum Likelihood estimation for Naive  Bayesian

We add parameter θ and π for our object function

π is the vector for the prior probability of each classification with K x 1 dimension

θ is the matrix which stores the probability as row as the word, column for each classification that is θij = p(wi | yj) i = 1,...,V,

V is the size of the vocabulary, j = 1,..,K, K is the size of the classification.

Construction for Lagrangian Multipler and solve for π

 

solve for θ

 

Gaussian Naive Bayesian for continus random variable

We can use the Gaussian distribution to present this random variable.

The Gaussian distribution has the properties that: the sum or product of two gaussian distributions is also gaussian distributions

the condition probability of two gaussian distribution is also gaussian distributions.

Central limit theorem

In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

例:

Progress:

1)for each classification c, we choose all of the samples xi related to c, then we fit a gaussian distribution.

We fit independ gaussian distribution for each of the classification.

2) Then we can predict any xi as P(xi | y = c)

Examples:

there are two  feature age and income which are continus random variable.

we choose gaussian distribution to fit these distribution

In real world, if we have a lot of continus features, we will not choose the naive bayesian model. we choose logistic regression, XGBoost etc.

But the naive bayesian is a base line for text classification.

 

A python Implementation for Naive Bayesian

reference:

# author sesiria 2019# a simple Naive Bayesian classifier implementationimport numpy as np# **********************definition of the Naive Bayesian***************************class NaiveBayesianClassifier:    def __init__(self):        pass        # currently we only support the digital number for labels.    def fit(self, data, label):        classes = np.unique(label)        nWords = data.shape[1]        # matrix to store the probability for each word in each category.        self.paramMatrix = np.zeros([nWords, len(classes)], dtype = np.float64)        self.priorVector = np.zeros(len(classes), dtype = np.float64)        self.labels = [] # class label        for i in range(len(classes)):            c = classes[i]            nCurrentSize = len(label[label == c])            # build category hashtable            self.labels.append(c)            # we calculate the priorVector            self.priorVector[i] = nCurrentSize / len(label)            # calculate the paramMatrix with smoothing            count = np.sum(data[label == c, :], axis = 0) + 1            count = count / (nCurrentSize + nWords)            self.paramMatrix[:, i] = count                def predict(self, test):        if (len(test.shape) == 1):            return self.getCategory(test)        predictions = np.zeros(test.shape[0])        for i in range(test.shape[0]):            predictions[i] = self.getCategory(test[i, :])        return predictions    def getCategory(self, test):        assert test.shape[0] == self.paramMatrix.shape[0]        p = np.zeros(len(self.labels))        for idx in range(len(self.labels)):            # we use the log trick to avoid the underflow            p[idx] = np.sum(np.log(self.paramMatrix[:, idx]) * test)        return self.labels[np.argmax(p)]        # **************************unit test function.************************************def sanity_check():    X = np.array([[1, 1, 1, 0, 0, 0, 0, 0, 0],                  [0, 0, 0, 1, 1, 0, 0, 0, 0],                  [1, 0, 1, 0, 0, 1, 0, 0, 0],                  [0, 0, 0, 0, 0, 0, 1, 0, 0],                  [0, 0, 1, 0, 0, 0, 0, 1, 1],                  [0, 0, 1, 1, 0, 0, 0, 0, 0]                ])    Y = np.array([1, 1, 1, 0, 0, 0])    X_test = np.array([[1, 0, 0, 1, 2, 0, 1, 0, 0],                        [1, 0, 0, 0, 0, 0, 1, 1, 0]                       ])    clf = NaiveBayesianClassifier()    clf.fit(X, Y)    result = clf.predict(X_test)if __name__ == '__main__':    sanity_check()

 

转载地址:http://jwwci.baihongyu.com/

你可能感兴趣的文章
nodejs Stream使用中的陷阱
查看>>
windows 自制后台运行进程、exe开机自启动服务
查看>>
MongoDB 索引
查看>>
10gen工程师谈MongoDB组合索引的优化
查看>>
MongoDB 数据文件备份与恢复
查看>>
数据库索引介绍及使用
查看>>
MongoDB数据库插入、更新和删除操作详解
查看>>
MongoDB文档(Document)全局唯一ID的设计思路
查看>>
mongoDB简介
查看>>
Mongodb集群搭建的三种方式
查看>>
MongoDB修改oplog size
查看>>
对 MongoDB 的一些吐槽
查看>>
mongodb分片
查看>>
node.js mongodb ReplSet
查看>>
为什么中国的程序员总被称为码农?
查看>>
MongoDB数据文件内部结构
查看>>
redis 持久化(persistence)
查看>>
Redis持久化实践及灾难恢复模拟
查看>>
详解keepalived配置和使用
查看>>
Keepalived构建双主MySQL
查看>>