本文共 9301 字,大约阅读时间需要 31 分钟。
The Naive Bayesian is a baseline for text classification problem.
A spam email example. We need to count the frequency of words which occurs in the span/normal email.
Such as, ad., purchase, link ,etc. We could considered this email as spam.
But sometimes, the words mentioned above will exist in the normal email, so the problem is complicated.
there are two steps for naive bayesian:
1) Training
count each words in vocabulary ,and calculate the contributions between each word and the probability of spam/normal email.
p(advertisement/ span) p(advertisement/ normal)
2) predict
Training :
p(购买 | 正常) = 3 / (24 * 10) = 1/80
p(购买 | 垃圾) = 7 / (12 * 10) = 7/120
p(物品 | 正常) = 4 / 240 = 1 / 60
p(物品 | 垃圾) = 4 / 120 = 1 / 30
p(不是 | 正常) = 4 / 240 = 1 / 60
p(不是 | 垃圾) = 3 / 120 = 1 / 40
p(广告 | 正常) = 5 / 240 = 1 / 48
p(广告 | 正常) = 4 / 120 = 1 / 56
p(这 | 正常) = 3 / 240 = 1 / 80
p(这 | 垃圾) = 0 / 120 = 0
Priori Probability(先验概率)
正常邮件在所有邮件中的概率 24 / 36 = 2 / 3
垃圾邮件在所有邮件中的概率 12 / 36 = 1 / 3
We need to calculate the condition probability of span/ normal base on the context of the email. P(spam / context) and P(normal / context)
Bayesian Theorem
P(X | Y): likelihood
P(Y): prior
P(X) = normalization
P(Y | X) = posterior
Prediction:
Conditional independence P(x, y | z) = P(x | z) * P(y | z)
But the result is abnormal due to P(这|垃圾) = 0.
We need to do some smooth process.
Add-one smoothing:
A problem:
为了避免underflow 可以加上log
log(p1 * p2 * p3) = logp1 + log p2 + log p3
Naive Bayesian Sample in python:
import pandas as pdimport numpy as npimport matplotlib.mlab as mlabimport matplotlib.pyplot as plt# read span.csvdf = pd.read_cv("spam.csv", encoding = 'latin')df.head()
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 40 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN1 ham Ok lar... Joking wif u oni... NaN NaN NaN2 spam Free entry in 2 a wkly comp to win FA Cup fina... NaN NaN NaN3 ham U dun say so early hor... U c already then say... NaN NaN NaN4 ham Nah I don't think he goes to usf, he lives aro... NaN NaN NaN
Rename some columns:
# rename the column of v1 and v2df.rename(columns = {'v1' : 'Label', 'v2' : 'Text'}, inplace = True)df.head()
Label Text Unnamed: 2 Unnamed: 3 Unnamed: 40 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN1 ham Ok lar... Joking wif u oni... NaN NaN NaN2 spam Free entry in 2 a wkly comp to win FA Cup fina... NaN NaN NaN3 ham U dun say so early hor... U c already then say... NaN NaN NaN4 ham Nah I don't think he goes to usf, he lives aro... NaN NaN NaN
Map Label to number
# map'ham' and 'span' to 0 and 1df['numLabel'] = df['Label'].map({'ham' : 0, 'spam' : 1})df.head()
Label Text Unnamed: 2 Unnamed: 3 Unnamed: 4 numLabel0 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN 01 ham Ok lar... Joking wif u oni... NaN NaN NaN 02 spam Free entry in 2 a wkly comp to win FA Cup fina... NaN NaN NaN 13 ham U dun say so early hor... U c already then say... NaN NaN NaN 04 ham Nah I don't think he goes to usf, he lives aro... NaN NaN NaN 0
Count the number of spam/ham emails
# count number of ham and spamprint ('# of ham : ', len(df[df.numLabel == 0]), ' # of spam: ', len(df[df.numLabel == 1]))print ('# of total samples: ', len(df))
# of ham : 4825 # of spam: 747# of total samples: 5572
Plot the histogram for text length:
# count the length of text, and plot a histogramtext_lengths = [len(df.loc[i, 'Text']) for i in range(len(df))]plt.hist(terxt_lengths, 100, facecolor = 'blue', alpha = 0.5)plt.xlim([0, 200])plt.show()
# import English vocabularyfrom sklearn.feature_extraction.text import CountVectorizer# construct word vector (base on the frequency of the word)vectorizer = CountVectorizer()X = vectorizer.fit_transform(df.Text)y = df.numLabel# split the data into train and test data setfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 100)print('# of samples in the train data set: ', X_train.shape[0], '# of samples in test data set: ', X_test.shape[0])
Output:
# of samples in the train data set: 4457 # of samples in test data set: 1115
# use the Naive Bayesian for model trainingfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.metrics import accuracy_scoreclf = MultinomialNB(alpha = 1.0, fit_prior = True)clf.fit(X_train, y_train)y_pred = clf.predict(X_test)print("accuracy on test data: ", accuracy_score(y_test, y_pred))
accuracy on test data: 0.97847533632287
# print confusion matrixfrom sklearn.metrics import confusion_matrixconfusion_matrix(y_test, y_pred, labels = [0, 1])
array([[956, 14], [ 10, 135]])
Summary:
Maximum Likelihood estimation for parameter of Naive Bayesian:
Non-constraint Optimization Problem
Constrained Optimization
Maximum Likelihood estimation for Naive Bayesian
We add parameter θ and π for our object function
π is the vector for the prior probability of each classification with K x 1 dimension
θ is the matrix which stores the probability as row as the word, column for each classification that is θij = p(wi | yj) i = 1,...,V,
V is the size of the vocabulary, j = 1,..,K, K is the size of the classification.
Construction for Lagrangian Multipler and solve for π
solve for θ
Gaussian Naive Bayesian for continus random variable
We can use the Gaussian distribution to present this random variable.
The Gaussian distribution has the properties that: the sum or product of two gaussian distributions is also gaussian distributions
the condition probability of two gaussian distribution is also gaussian distributions.
Central limit theorem
In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.
例:
Progress:
1)for each classification c, we choose all of the samples xi related to c, then we fit a gaussian distribution.
We fit independ gaussian distribution for each of the classification.
2) Then we can predict any xi as P(xi | y = c)
Examples:
there are two feature age and income which are continus random variable.
we choose gaussian distribution to fit these distribution
In real world, if we have a lot of continus features, we will not choose the naive bayesian model. we choose logistic regression, XGBoost etc.
But the naive bayesian is a base line for text classification.
A python Implementation for Naive Bayesian
reference:
# author sesiria 2019# a simple Naive Bayesian classifier implementationimport numpy as np# **********************definition of the Naive Bayesian***************************class NaiveBayesianClassifier: def __init__(self): pass # currently we only support the digital number for labels. def fit(self, data, label): classes = np.unique(label) nWords = data.shape[1] # matrix to store the probability for each word in each category. self.paramMatrix = np.zeros([nWords, len(classes)], dtype = np.float64) self.priorVector = np.zeros(len(classes), dtype = np.float64) self.labels = [] # class label for i in range(len(classes)): c = classes[i] nCurrentSize = len(label[label == c]) # build category hashtable self.labels.append(c) # we calculate the priorVector self.priorVector[i] = nCurrentSize / len(label) # calculate the paramMatrix with smoothing count = np.sum(data[label == c, :], axis = 0) + 1 count = count / (nCurrentSize + nWords) self.paramMatrix[:, i] = count def predict(self, test): if (len(test.shape) == 1): return self.getCategory(test) predictions = np.zeros(test.shape[0]) for i in range(test.shape[0]): predictions[i] = self.getCategory(test[i, :]) return predictions def getCategory(self, test): assert test.shape[0] == self.paramMatrix.shape[0] p = np.zeros(len(self.labels)) for idx in range(len(self.labels)): # we use the log trick to avoid the underflow p[idx] = np.sum(np.log(self.paramMatrix[:, idx]) * test) return self.labels[np.argmax(p)] # **************************unit test function.************************************def sanity_check(): X = np.array([[1, 1, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 1, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 1, 0, 0, 0, 0, 1, 1], [0, 0, 1, 1, 0, 0, 0, 0, 0] ]) Y = np.array([1, 1, 1, 0, 0, 0]) X_test = np.array([[1, 0, 0, 1, 2, 0, 1, 0, 0], [1, 0, 0, 0, 0, 0, 1, 1, 0] ]) clf = NaiveBayesianClassifier() clf.fit(X, Y) result = clf.predict(X_test)if __name__ == '__main__': sanity_check()
转载地址:http://jwwci.baihongyu.com/