An Optimal Weighting Method in Supervised Learning of Linguistic Model for Text Classification

  • cc icon
  • ABSTRACT

    This paper discusses a new weighting method for text analyzing from the view point of supervised learning. The term frequency and inverse term frequency measure (tf-idf measure) is famous weighting method for information retrieval, and this method can be used for text analyzing either. However, it is an experimental weighting method for information retrieval whose effectiveness is not clarified from the theoretical viewpoints. Therefore, other effective weighting measure may be obtained for document classification problems. In this study, we propose the optimal weighting method for document classification problems from the view point of supervised learning. The proposed measure is more suitable for the text classification problem as used training data than the tf-idf measure. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of newspaper article and the customer review which is posted on the web site.


  • KEYWORD

    Text Classification , Weighting Method , Vector Space Model , Cosine Similarity

  • 1. INTRODUCTION

    Due to development of information technology, the effectiveness of knowledge discovery from enormous document data is suggested in much of the literatures on this subject (Hearst. 1999). There are many web sites where customers can post their free comments about merchandise that they bought and used. On the internet, the number of customer reviews is increasing day by day. Therefore it has been easy to get a large amount of document data and analyze it for several purposes. Customer reviews consist of not only free comments but customer information and the degree of satisfaction about items as metadata. The analysis using the metadata is more helpful for knowledge discovery than using only text data. The techniques for text mining are developed for the purpose of getting information. Various methods have been proposed in this research field, for example, vector space model (Manning et al., 2008), (Mikawa et al., 2012), probabilistic model (Hofmann. 1999), (Bishop, 2006) and so on.

    In this paper, a vector space model is the focus for document analysis. To construct a vector space model for document analysis, the documents are separated into terms or words (morphemes) by using the morphological analysis (Nagata. 1994). After that, each document is represented by a vector whose elements express the information of word frequency of appearance. Because the vector space is built by the information of word frequency, the characteristics of a document vector model should be remarkable: high dimension and sparseness. Generally speaking, untold thousands of words or more should be treated to represent a document vector using effective words appearing in all documents.

    As mentioned above, there are enormous words which are appeared in whole documents. In addition, term frequency of each word varies widely in length. Therefore, the performance of text analyzing depends on term frequency of words which is appeared each documents. That is, it depends on the length of documents. To avoid this, several weighting approach for each word has been proposed. For instance, tf-idf weighting (Salton et al., 1988), PWI (Probability-weighted amount of information) (Aizawa, 2000, 2003), mutual information (McCallum et al., 1998) and so on. And tf-idf weighting is one of the most famous method for weighting terms. However, it is proposed for information retrieval and the effectiveness is empirically shown. Therefore, the theoretical optimality is not proved. In addition, it doesn't use the metadata or side information for weighting each word. Nowadays, it can be easy to get or use those, and by using that information, it supposes to improve the performance of each analysis.

    From above discussion, the purpose of this study is to propose a new weighting method for each word from the view point of supervised learning. We show the way of estimating an optimal word weighting by solving maximization problems. The effectiveness of this method is clarified by case experiments of applications to the customer review which are posted on web sites and newspaper articles which are used as a bench mark data.

    In section 2, basic formulation of vector space model and weighting methods which have already proposed are explained. In section 3, the proposed method of weighting each word and the way of its estimation is explained. The illustration of simulation experiments in order to clarify the effectiveness of our proposal and the results acquired from the experiments are explained in section 4. Finally, the conclusion of this study is stated in section 5.

    2. BASIC INFORMATION FOR ANALYSIS OF TEXT DOCUMENTS

    In this paper, the vector space model is adopted to represent the document data. In this section, the premises and the notations of this research are defined and relative methods are explained.

       2.1 Similarity among Documents

    Let the set of documents be Δ={d1, d2 ,…, dD}, and the set of categories to which each document in training data belongs be C ={c1, c2 ,…, cN}. Here, D and N are the numbers of documents and categories respectively.

    All documents in Δ are separated into morphologies by the morphological analysis. Selecting valid words from the given morphologies by using frequencies, mutual information, or other methods, a word set is constructed to represent documents in the vector space. Let the word set from the all documents in Δ be Σ = {w1, w2 ,…, wW}

    Then each document can be expressed as a point in the vector space. Here, W is the number of different valid words which appear in Δ and it is equivalent to the dimension of the vector space. Here, let the frequency of a word wj in the document di be vij and it can be expressed by W-dimensional vector as follows:

    image

    where, T means transposition of a vector. That is, the document space can be constructed by regarding each component of vectors as the information about the frequency of each word.

    By expressing document as a vector, a similarity or distance metrics between documents can be measured in the vector space. Here, the distance metric of document vectors di and dj can be expressed by using the Euclid distance which is traditionally used as follows:

    image

    However, the Euclid distance sometimes cannot work effectively to treat the document data. For example, many elements in a document vector are almost 0 and this property is called “sparseness.” If there are only a few different words between two documents, these documents are judged as having high similarity.

    On the other hand, the cosine similarity measure between document vectors has asymptotically good performance in a sparse vector space (Goto et al., 2007), (Goto et al., 2008).

    image

       2.2 Weighting Method for Documents

    The way of weighting can be expressed the Hadamard product of document vector di and weighting vector f . Here, let weighting vector f be

    image

    Here, fk is weighting function for the word wk By using that, let the weighted document vector be d*i

    image

    where, let d*i = fdi is Hadamard product of f and di.

    There are several weighting methods for text data analyzing. As mentioned above, the most famous method is is tf-idf weighting. It can be calculated the product of tf(term frequency, that is vij and idf(inverse document frequency). Here, the tf is representing the frequencies of each term. And idf is a monotonically decrease function of an appearance rate of term in documents. Let df(wk) be a number of documents in which the word wk appears. Then the idf weighting for the word wk is given by

    image

    Here, let the weight function fk be idf(wk), tf-idf weighting can be expressed as follows:

    image

    3. THE METHOD OF SUPERVISED WEIGHTING FOR TEXT CLASSIFICATION

    As mentioned above, tf-idf weighting is not made use of side information or metadata when calculating each word weight. In the basic text classification problem, training data has information which category training data belongs to. Therefore, by making the most use of the information for weighting each word can be improved its performance.

    In this section, we propose the optimal weighting method and its estimation for text classification. And we derive the expression of optimal weight is given by the solution of maximization problem.

    To formulate the learning algorithm to estimate the optimal weighting from training data set, the centroid of each category is defined as equation (8). The centroid can be calculated by using the training data with known category. Let the centroid vector of category cn (n = 1, 2, …, N) be gn = (g1, g2, …, gnW)T. Then the centroid vector gn is given by

    image

    Here, the |cn| is a number of documents contained category cn. And the same as equation (5), weighted centroid of category cn is given by

    image

    Normally, training data should be allocated by near the centroid of its category because a new document data is classified by the distance from the centroids. Due to that, the optimization of the weight should maximize the similarity between training data and its centroid.

    From the above discussion, the estimation of optimal weighting vector f is to maximize cosine similarity between weighted document vector d*i and weighted centroid vector g*n. If the weighting vector can be obtained by the maximization of the similarity between data and its category’s centroid, the proposed weighting method is expected to have a good performance to classify the data into proper categories. Then, our optimal weighting vector f is given by

    image

    Here, f = (f1, f2, …, fW)T is weighting vector, and it satisfies

    image

    Equation (10) can be expressed by matrix operation by using the diagonal matrix which is constructed its elements by (f k)2. To use that, the optimal weight of each word can be estimated to maximize cosine similarity between training data di and each category’s centroid gn by using the matrix.

    In order to solve the above problem, we introduce the diagonal metric matrix M = [mk] (k = 1, 2, …, W) with the elements (f k)2 that is

    image

    Here, M = [mk] is satisfying |M| = 1, and |M| is determinant of M.

    From above discussion, the equation (10) can be rewritten as follows:

    image

    Supject to |M| = 1

    From the above formulation, the following theorem is obtained.

    Theorem 1: The metric matrix

    image

    which is satisfies the equation (11) is given by

    image

    (For the proof, see APPENDIX.)

    From above discussion, the similarity between document vector and centroid can be calculated by using the metric matrix

    image

    given by the equation (12) as follows:

    image

    After calculating similarity among test data and all centroids by equation (13), test data is classified to the most similar category.

    4. EXPERIMENTS

       4.1 Experimental Condition

    In this section, simulation experiments are conducted to verify the effectiveness of our method by using Japanese document data in practice. The experiments are performed for the two data sets, i.e, customer reviews which are posted on a web site and newspaper articles with pre-assigned categories. The suitable weight is estimated by learning through these data. And the effectiveness is confirmed by the performance of classification of test data into the correct categories.

    The basic process of experiments is as follows.

    Step1: Calculation of all centroids from training data.

    Step2: From equation (12), learning metric matrix

    image

    from training data.

    Step3: From equation (13), to calculate similarity among test data and all category’s centroid. And it classifies that with most similar category.

    As mentioned above, two different types of data are used on this experiment. The first one is articles from the Mainichi newspaper in 2005 which are used as a benchmark data for document classification. The second one is customer review that is posted on a web site in order to recognize the performance of our method for real data. News articles in the Mainichi newspaper consist of several categories (economics, sports, politics and so on) and this time, three and five categories are extracted at random. Customer review consists of not only text data but degree of customer’s satisfaction as mentioned above. In this experiment, we use two categories of that which are highest degree of satisfaction and lowest one.

    A condition of experiments is shown in Table 1.

    For the sake of comparison, the experiment of the common cosine measure with only term frequency (tf measure) and tf-idf measure between different two data points was also performed. The criteria of evaluation are classification accuracy rate. It’s the ratio of the number of documents which are classified into correct categories to the total number of documents.

       4.2 Result of Experiments

    The results of experiments are shown in Figure 1, Figure 2 and Figure 3. Figure 1 and Figure 2 show the cases of the newspaper article. The former is the case of three categories and the latter is that of five.

    In addition, Figure 3 shows the case of customer review. It shows supervised weighting (proposed method), tf-idf weighting and only use term frequency (tf weighting).

    From the Figure 1 and Figure 2, the proposed method is basically superior to tf-idf weighting and tf weighting methods. However, from the Figure 3, the proposed method and the other two methods are almost the same performance. That is, the performance of proposed method is more suitable when newspaper article is used.

       4.3 Discussion

    From the result, the proposed method is more suitable than conventional methods when newspaper article is used. In the proposed method, category information of training data as weighting parameter is taken into consideration. This property can work well for text classification. On the other hand, tf-idf weighting is not used each category’s characteristics. As the result, the performance is improved.

    However, the performance of the proposed method which can be solved regarded as the optimal solution is not improved drastically in the customer review classification. Because customer write their comment freely for merchandise which they bought and used. Due to the fact, it may appear many invalid words and the tendency of words is not different between each category. The customer reviews are too high of freedom of text data for the proposed method to estimate such complex statistical structure. By these reason, the proposed method doesn’t work effectively. It is necessary to improve the method as getting better performance for customer review (real data) as the future work.

    5. CONCLUSION

    In this paper, the suitable weighting method by estimating optimal weight from a training data set is proposed. Our proposal is based on supervised learning with optimal metric matrix

    image

    which can be calculated by a training data set.

    From the simulation experiment, the proposed method is superior than conventional method when newspaper article is used.

    A future work is to calculate the contribution ratio in each word which is weighted by the proposed method for the text classification in order to improve performance.

    APPENDIX: Proof of Theorem 1

    We show the proof of Theorem 1 as follows.

    To solve for M under the |M| = 1 is equivalent to solve following maximization:

    image

    In the following, to simplify the calculation, document d*i and centroid g*n is normalized by its norm. In other words, it can be said ||d*i|| = ||g*n|| = 1. 1) So equation (14) becomes

    image

    Here, M is diagonal matrix and |M| =1. Therefore,

    image

    Therefore,

    image

    is derived.

    To maximize the equation (15) under the equation (16), Lagrange multipliers method is used. Here, let Lagrange multiplier be λ, and Lagrange function can be defined

    image

    Here, L is partially differentiated with respect to mk. And put them 0,

    image

    is derived. To solve the equation (18) for mk,

    image

    is derived.

    Here, let M?1 be

    image

    Then form the equation (19),

    image

    is

    image

    Here, let set A be A = [akl] and set ak be

    image

    Then A is

    image

    Therefore,

    image

    is derived. From the equation (22), A = λM?1, is acquired. By the characteristics of M,

    image

    is derived. Therefore, λ becomes

    image

    Accordingly, form equations (19) and (25),

    image

    becomes

    image

    Here, because A is diagonal matrix, |A| is given by

    image

    Consequently,

    image

    is derived.

    Accordingly, from equations (26) and (28), each

    image

    becomes

    image

    The proof is complete. □

    In case of ||d*i|| ≠ 1, ||g*n|| ≠ 1, it can be easy to extend.

  • 1. Aizawa A. (2000) The Feature Quantity: An Information Theoretic Perspective of Tfidf-like Measures [Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval] P.104-111 google
  • 2. Aizawa A. (2003) An Information-theoric perspective tf-idf Measure [Information Processing and Management] Vol.39 P.45-65 google doi
  • 3. Bishop C. M. (2006) Pattern Recognition and Machine Learning google
  • 4. Goto M., Ishida T., Hirasawa S. (2007) Statistical Evaluation of Measure and Distance on Document Classification Problems in Text Mining [IEEE International Conference on Computer and Information Technology] P.674-679 google
  • 5. Goto M., Ishida T., Suzuki M., Hirasawa S. (2008) Asymptotic Evaluation of Distance Measure on High Dimensional Vector Space in Text Mining [International Symposium on Information Theory and its Applications] google
  • 6. Hearst M. A. (1999) Untangling text data mining [ACL ’99 Proceedings] P.3-10 google
  • 7. Hofmann T. (1999) Probabilistic Latent Semantic Indexing [Proceeding of the 22nd International Conference on Research and Development in Information Retrieval] P.50-57 google
  • 8. Manning C. D., Raghavan P., Schuetze H. (2008) Introduction to Information Retrieval google
  • 9. McCallum A., Nigam K. (1998) A Comparison of Event Models for Naive Bayes Text Classification [Proceeding of AAAI-98 Workshop on Learning for Text Categorization] P.41-48 google
  • 10. Mikawa K., Ishida T., Goto M. (2012) A Proposal of Extended Cosine Measure for Distance Metric Learning in Text Classification, Proceeding of 2011 IEEE International Conference on the Systems [Cybernetics (SMC)] P.1741-1746 google
  • 11. Nagata M. (1994) A Stochastic Japanese morphological analyzer using a forward-DP backward-A* best search algorithm [Proceeding of the 15th International Conference on Computational Linguistics] P.201-207 google
  • 12. Salton G., Buckley C. (1988) Term-Weighting Approaches in Automatic Text Retrieval [Information Processing and Management] Vol.24 P.513-523 google doi
  • [Table 1.] A Condition of Experiments.
    A Condition of Experiments.
  • [Figure 1.] Result for Classification Accuracy Using Newspaper (3 Categories).
    Result for Classification Accuracy Using Newspaper (3 Categories).
  • [Figure 2.] Result for Classification Accuracy Using Newspaper (5 Categories).
    Result for Classification Accuracy Using Newspaper (5 Categories).
  • [Figure 3.] Result for Classification Accuracy Using Customer Review.
    Result for Classification Accuracy Using Customer Review.