SlideShare a Scribd company logo
TELKOMNIKA, Vol.15, No.1, March 2017, pp. 383~390
ISSN: 1693-6930, accredited A by DIKTI, Decree No: 58/DIKTI/Kep/2013
DOI: 10.12928/TELKOMNIKA.v15i1.4303  383
Received November 15, 2016; Revised January 10, 2017; Accepted January 30, 2017
A Novel Text Classification Method Using
Comprehensive Feature Weight
Jian Liu*1
, Weisheng Wu2
1
Information and Computer Engineering,Pingxiang University, Pingxiang, Jiangxi 337055, P. R. China
2
Modern Educational Technology Center, Pingxiang University, Pingxiang, Jiangxi 337055, P. R. China
*Corresponding author, e-mail: liujianpxu@163.com
Abstract
Currently, since the categorical distribution of short text corpus is not balanced, it is difficult to
obtain accurate classification results for long text classification. To solve this problem, this paper proposes
a novel method of short text classification using comprehensive feature weights. This method takes into
account the situation of the samples in the positive and negative categories, as well as the category
correlation of words, so as to improve the existing feature weight calculation method and obtain a new
method of calculating the comprehensive feature weight. The experimental result shows that the proposed
method is significantly higher than other feature-weight methods in the micro and macro average value,
which shows that this method can greatly improve the accuracy and recall rate of short text classification.
Keywords: text classification, text categorization, comprehensive feature weight, feature selection
Copyright © 2017 Universitas Ahmad Dahlan. All rights reserved.
1. Introduction
In recent years, with the development of instant communication technology and the
popularity of internet-based applications, QQ chat, Facebook, micro-blog, news comments and
other short text data is growing exponentially, and has become an important way of information
transmission. The emergence of a large number of short texts has both advantages and
disadvantages. On the one hand, the short text classification has important commercial value
and application background in question answering community, advertising classification and so
on. On the other hand, it is also an important carrier of harmful network information such as the
bad public opinion and the rubbish information. Therefore, short texts classification, short texts
clustering, topic detection, and tracking of short texts have become a focus in the field of data
mining and information security [1-3].
The short text has generally only about 100 words, the text length is short, the noise
data is more, the useful information contained is very little, so the information can be extracted
is deficient [4,5]. Therefore, it put forward higher requirements to the related research of short
text classification. Although the existing text classification field has made great achievements, it
is difficult to apply directly to the short text. Short text classification is faced with two major
problems in the field of text classification: high vector dimension and sparse feature [6-9], which
resulting in unbalanced of the categorical distribution of short text corpus.
So far, there are two ways to improve the performance of short text classification. One
is to use the external data source or knowledge base (such as WordNet, Wikipedia, etc.) to
expand the feature of short text [10-12]. This method is simple and intuitive, can solve the
problem of sparse feature words, and can achieve good results, but the classification accuracy
needs to be improved. The other methods [13-17] are to improve the classification precision and
recall rate by improving the performance of the classifier, but related articles to solve this
problem from the point of view of the short text feature weight calculation is very few. Sriram et
al [18] improved the classification precision of short text by combining the improved short text
feature weight formula and feature extended algorithm based on the collinear relationship of
features. In some articles [19-21], it is used in another method, which uses implicit semantic
analysis to reduce the dimension of short text and noise.
Then we use the independent component analysis method to extract the most
expressive features. These methods are combined with other methods to improve the
 ISSN: 1693-6930
TELKOMNIKA Vol. 15, No. 1, March 2017 : 383 – 390
384
effectiveness of short text classification, which cannot explain the effectiveness of the use of
feature extraction method alone. Furthermore, we take into account the situation of the samples
in the positive and negative categories, as well as the category correlation of words, so as to
improve the existing feature weight calculation method and obtain a new method of calculating
the comprehensive feature weight.
2. Feature Weight of Text
Explaining research chronological, including research design, research procedure (in
the form of algorithms, Pseudocode or other), how to test and data acquisition. The description
of the course of research should be supported references, so the explanation can be accepted
scientifically.
In order to convert the text into a form that a computer can understand, a text is
represented as a vector to be analyzed and calculated. Vectoring is the basis of text processing.
Given a certain weight to the words in the text which mean the importance of a word to the
characterization of the text, the greater the weight, the more important word to the text.
In the information retrieval field, the term weight calculation method is divided into two
categories: one is the unsupervised tf (term frequency) and (term frequency-inverse
document frequency) method; another is the supervised method like (
gain), (ratio) and so on. The term weight calculation method mainly
references the feature selection algorithm, as feature selection. These terms are also given
different values to measure their contribution to text classification. However, these term weight
calculation methods are mainly for the long text, and cannot be directly applied to the short text.
This is due to a prominent feature of the short text that the sample distribution is very
uneven, that is, the number of samples in the dataset is much larger than other categories,
resulting that small category text is submerged in a large number of other types of documents
and it is difficult to identify. But in the massive text data to be processed, sometimes the data
that system really cares about is only a small part, for example, in the network public opinion
analysis and the hot sensitive topic detection and tracking problem, the valuable data in the
practical environment is a small proportion. We refer the category has little samples as the
positive category and the category has more samples as the negative category.
However, the existing feature weight calculation methods treat all categories in the
same way, in fact, when applying the traditional method of long text feature weight calculation
on the short text, the result of text classification is more inclined to negative category while
ignoring normal category. The problem is particularly evident in the short text classification
performance, leading to no high short text classification precision and recall rate, which cannot
meet the need of practical application.
3. Comprehensive Category Feature Selection Method
In this paper, we regard the current category of short text dataset as a PC (Positive
Category). In addition to the current category, another category is regarded as the NC (Negative
Category). Related element information is shown in Table 1.
Table 1. Data Element Table
Feature The number of samples including
/not including s in PC
The number of samples including
/not including s in NC
s sP SN
s and represent the circumstance of feature term s appear and disappear in sample
respectively. sP represents the frequency of samples which contain feature s in PC;
represents the frequency of samples which do not contain feature s in PC. SN represents the
s sp sN
s
sp
TELKOMNIKA ISSN: 1693-6930 
A novel text classification method using comprehensive feature weight (Jian Liu)
385
frequency of samples which contain feature s in NC; represents the frequency of samples
which do not contain feature s in NC.
Based on the idea of plain text, this paper has 3 important related analyses: 1) For a
given term that contains si, if it occurs frequently in the text, that is when the document
frequency is high, it will have a strong expression ability; 2) When the frequency of si in positive
category is higher than that in the negative category, it shows that it has good classification
ability which is called inverse document frequency; 3) When the ratio between the frequency of
si not in negative category and the frequency of si in positive category is high, this shows that
the relevancy frequency between si and the text category is high.
According to the above analysis, this paper proposes a feature selection method for
short text. Since this method takes into account the case of the sample in the positive and
negative categories, it is named as the integrate category IC (Integrate Category), the
calculation formula is as follows:
(1)
where df is the document frequency, for a given term si, the value of df is derived from the
calculation , and it represents the document frequency of term si in the positive
category; tP is the number of samples which contains si in positive category. rf is the relevance
frequency derived from the calculation . rf is proportional to the frequency of s not in
the negative category and inversely proportional to the frequency of s in the positive category.
The bigger the rf value, the more the correlation between the term s and the category. icf is
inverse document frequency and is derived from the calculation , |C| is the
total categories, cf is the number of categories which contains term si.
In the case of a certain number of positive samples, the sample distribution of dataset
can be roughly divided into the following three conditions: 1) the number of negative samples is
more than the positive, and it is more evenly distributed in different categories; 2) the number of
negative samples is less than positive, and it is more evenly distributed in different categories;
3) the number of negative samples is less than the positive, and the distribution is not balanced,
that is, in a small number of categories. In order to express the above relations more visually, 20
categories are given, including 3 simple term examples of 200 short texts, simulating the
distribution of positive and negative samples in the short text dataset, see Table 2.
Table 2. Comparison of the Relationship between PC and NC Terms
Term sp cf
s1 20 100 0 20 3 8.538 39.294
s2 20 4 4 20 4 7.636 68.478
s3 20 80 80 20 4 7.636 15.526
As can be seen from table 2, the distribution of samples contain s2 and samples contain
s3 in the positive and negative categories are very different, among which samples contains s3
are mostly in the negative class, while the proportion of samples contains s2 in the negative
class is small, but the value of the two are the same; the value of s2 is significantly
increased after taking the relevance frequency of s in the positive category and negative
category into consideration. In contrast, the increase in the value of s3 is smaller, so that the gap
between s2 and s3 is increasing. The frequency of s1 in positive category is similar to that of s2
and s3, but its distribution range in negative category is small, it is concentrated in one category,
sN
log(sP +1)
rf =
sN
sP +1
icf = log(
|C |
cf
+1)
NC
C2
NC
C3
PC
C1
 ISSN: 1693-6930
TELKOMNIKA Vol. 15, No. 1, March 2017 : 383 – 390
386
so the value of s1 is slightly larger than that of s3, and the value is
between s2 and s3, this is consistent with the previous analysis.
The global feature weight of feature s for the entire corpus is shown in formula (2):
(2)
The advantage of this method is that it takes into account the distribution of entries in a
single sample, and takes into account the category information of the text. The relevance
evaluation of the terms in the positive and negative categories makes the feature words more
discriminative in the imbalanced datasets. At the same time, selecting the maximum value of the
global feature can help us obtain the feature with the highest category discrimination degree.
Algorithm description is as follows:
1) To pre-process the documents in the training corpus: word segmentation, remove the stop
words;
2) Calculate IC(s, Cj) of each feature term and category;
3) Calculate ICmax(s) of all categories based on the second step results;
4) Sort IC value in descending order, take the first M values as feature words to be retained, M
is the dimension of feature space. In this paper, the M value is 820.
4. Results and Analysis
In order to verify the performance of the short text feature weight algorithm, K-Nearest
Neighbor (KNN) classification algorithm is adopted to classify the short text. KNN is a traditional
pattern recognition algorithm, the algorithm is simple and intuitive, the classification accuracy is
high, and the new training text does not need to be trained, so as to reduce the training time, it
is widely used in text automatic classification. K represents the number of nearest neighbor
samples in a class. Testing different odd value of K in the range of [3, 35], and use the result of
the optimal K value for comparison.
4.1. Experimental Design
The multi classification problem can be decomposed into two classification problems, so
the focus of this paper is to improve the classification performance of positive and negative two
categories. At present, there is no general short text dataset, so this paper is based on the API
of sina micro-blog to grab the micro-blog data as the text corpus. We select 8478 micro-blog s,
each micro-blog has more than 6 words and the average text length is 42. We take about 100
micro-blogs of the Chinese super league in 2015 as a positive category, and 4236 micro-blogs
of the traffic jam as a negative category.
Since the sample of dataset is very unevenly distributed, we use a 5-fold cross-
validation method to carry out the classification results comparison. That is split the dataset into
5 folds, train on 4 folds and evaluate on the 5th, then iterate each fold as the validation fold,
evaluate the performance and finally average the performance to avoid the contingency of
experimental results and make sure the training data and testing data have no intersection. The
results thus obtained can be considered to be credible.
4.2. Evaluation Method
There are three currently used classification performance evaluation index: precision,
recall, and average test value. For short text classification with uneven distribution of samples,
precision and recall ration will ignore the influence of small category PC. Micro-average and
macro-average are two methods for global evaluation of classification results: micro-average is
the arithmetic average of the performance indicator for each instance document; macro-average
is first to calculate the classification results of each category then calculate the average value of
all categories. Specific definitions are as follows:
(3)
ICmax(s) = max
1£j£m
{IC(s,Cj )}
TELKOMNIKA ISSN: 1693-6930 
A novel text classification method using comprehensive feature weight (Jian Liu)
387
(4)
where MicroP and MicroR represent the precision and recall rate of micro-average respectively;
MacroP and MacroR represent the precision and recall rate of macro-average respectively.
Before calculating the evaluation index, first we introduce the data elements that are related to
the evaluation index, as shown in Table 3.
Table 3. Corresponding Table of Data Element
Retrieval Relevant Irrelevant
Yes RY NR
No RN NN
Suppose the total number of text is 1, then:
(5)
(6)
(7)
(8)
Where
.
4.3. Experimental Results and Analysis
In practical application, the proportion of positive category text is very small. In order to
compare the effect of negative category text data size on different feature weight method, let the
initial positive and negative micro-blog text have 100 each, gradually increase the number of
negative text, until the total text number reached 4300. Under the KNN classifier, the MicroA
values and MacroA values of 6 different feature weight calculation methods are shown in
Figure 1 and Figure 2 respectively. As can be seen, with the increase of data size, the
performance value of all the feature weight method is on the rise. The reason may be that in a
certain number of positive categories, when the data size is small, the text of each category is
not sufficient to characterize this category, resulting in lower classification accuracy. This
situation improves with the increase of data size.
Pl =
RYl
RYl + NRl
,Rl =
RYl
RYl + RNl
 ISSN: 1693-6930
TELKOMNIKA Vol. 15, No. 1, March 2017 : 383 – 390
388
The result of the experiments on Figure 1 and Figure 2 shows that since the MacroA is
the average of each category and is greatly influenced by small categories, the MacroA values
of different feature weight calculation method is greater than the MicroA values generally.
Figure 1. MicroA Value of Different Features Weight Calculation Method
Figure 2. MacroA Value of Different Features Weight Calculation Method
Figure 3 shows the average of the highest classification result value of 5 times repeated
trials on 6 different feature weight calculation methods. From the table, we can see that
supervised term weight calculation method is not always better than unsupervised term weight
calculation method, such as tf and perform less well, and which is very popular
in the field of text classification is not so excellent in this experiment. However,
method measures the relationship between terms and categories in the point of view whether
the term shows in the positive category and negative category or not, which can maintain good
classification performance in different date scale. This is because, in the traditional text
classification, the training sets generally have the following characteristics: the category
distribution is balanced, each document of the category can better represent this category, a
document of the category is more concentrated in the arrangement of feature space. However,
in practical application, the real text corpus and the use of the environment are often not
satisfied with the above characteristics, therefore, the existing feature weight calculation method
can’t be effective in the short text.
TELKOMNIKA ISSN: 1693-6930 
A novel text classification method using comprehensive feature weight (Jian Liu)
389
Figure 3. Comparison of the MicroA and MacroA Value
5. Conclusion
In the present study of short text classification, the feature weight calculation is still used
in the traditional long text method, but the distribution of short text categories is often not so
balanced, the traditional method can’t get good classification results. Aiming at this problem, this
paper makes an improvement on the existing feature weight calculation method, takes the
correlation of terms into consideration when calculating the feature weight, and tests the
performance of the method in the real corpus environment. Experimental results show that this
method can improve the classification effect to a certain extent, and it is found that the
classification performance of short text has not been improved greatly. The reason is the
collection of the real text from the network data, including a large number of network languages
and not standardized expression, resulting in the system can’t accurately identify. The next step
will be to introduce the method of semantic analysis to reduce the effect of the variety of words
on the classification system.
ACKNOWLEDGEMENTS
This work was supported by National Natural Science Foundation of China under grant.
References
[1] Berry MW. Survey of text mining. Computing Reviews. 2004; 45 (9): 548.
[2] Aggarwal C C, Zhai CX. A survey of text classification algorithms. In Mining text data. 2012; 163-222.
[3] Rao Y, Xie H, Li J, et al. Social emotion classification of short text via topic-level maximum entropy
model. Information & Management. 2016.
[4] Sriram B, Fuhry D, Demir E, et al. Short text classification in twitter to improve information filtering. In
Proceedings of the 33rd international ACM SIGIR conference on Research and development in
information retrieval. 2010; 841-842.
[5] Wang F, Wang Z, Li Z, et al. Concept-based short text classification and ranking. Proceedings of the
23rd ACM International Conference on Conference on Information and Knowledge Management.
ACM. 2014: 1069-1078.
[6] Forsyth RS, Holmes DI. Feature-finding for text classification. Literary and Linguistic Computing.
1996; 11 (4): 163-174.
[7] Forman G. An extensive empirical study of feature selection metrics for text classification. The
Journal of machine learning research. 2003; 3: 1289-1305.
[8] Qi X, Davison BD. Web page classification: Features and algorithms. ACM Computing Surveys
(CSUR). 2009; 41 (2): 12.
[9] Wang P, Xu B, Xu J, et al. Semantic expansion using word embedding clustering and convolutional
neural network for improving short text classification. Neurocomputing. 2016; 174: 806-814.
[10] Rogati M, Yang. Y. High-performing feature selection for text classification. In Proceedings of the
eleventh international conference on Information and knowledge management. 2002; 659-661.
[11] Sriram B, Fuhry D, Demir E, et al. Short text classification in twitter to improve information filtering. In
Proceedings of the 33rd international ACM SIGIR conference on Research and development in
information retrieval. 2010; 841-842.
 ISSN: 1693-6930
TELKOMNIKA Vol. 15, No. 1, March 2017 : 383 – 390
390
[12] Chen M, Jin X, Shen D. Short text classification improved by learning multi-granularity topics. In IJCAI.
2011; 1776-1781.
[13] Bobicev V, Sokolova. M. An Effective and Robust Method for Short Text Classification. In AAAI. 2008;
1444-1445.
[14] Sahami M, Heilman TD. A web-based kernel function for measuring the similarity of short text
snippets. In Proceedings of the 15th international conference on World Wide Web. 2006; 377-386.
[15] Cavnar WB, Trenkle JM. N-gram-based text categorization. Ann Arbor MI. 1994; 48113 (2): 161-175.
[16] Li-guo D, Peng D, Ai-ping L. A New Naive Bayes Text Classification Algorithm. Indonesian Journal of
Electrical Engineering and Computer Science. 2014; 12(2): 947-952.
[17] Zhang PY. A HowNet-based semantic relatedness kernel for text classification. Indonesian Journal of
Electrical Engineering and Computer Science. 2013; 11(4): 1909-1915.
[18] Sun A. Short text classification using very few words. InProceedings of the 35th international ACM
SIGIR conference on Research and development in information retrieval. 2012; 1145-1146.
[19] Mihalcea R, Corley C, Strapparava C. Corpus-based and knowledge-based measures of text
semantic similarity. In AAAI. 2006; 6: 775-780.
[20] Phan XH, Nguyen LM, Horiguchi S. Learning to classify short and sparse text and web with hidden
topics from large-scale data collections. In Proceedings of the 17th international conference on World
Wide Web. 2008; 91-100.
[21] Zhou Q, Zhao M, Hu M. Study on feature selection in Chinese text categorization. Journal of Chinese
information processing. 2004; 18(3): 17-23.
Ad

More Related Content

What's hot (20)

Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Quinsulon Israel
 
Exploiting rhetorical relations to
Exploiting rhetorical relations toExploiting rhetorical relations to
Exploiting rhetorical relations to
IJNSA Journal
 
An efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentAn efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-document
SaleihGero
 
Using lexical chains for text summarization
Using lexical chains for text summarizationUsing lexical chains for text summarization
Using lexical chains for text summarization
Anthony-Claret Onwutalobi
 
Improvement of Text Summarization using Fuzzy Logic Based Method
Improvement of Text Summarization using Fuzzy Logic Based  MethodImprovement of Text Summarization using Fuzzy Logic Based  Method
Improvement of Text Summarization using Fuzzy Logic Based Method
IOSR Journals
 
Modeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector QuantizationModeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector Quantization
TELKOMNIKA JOURNAL
 
Query Answering Approach Based on Document Summarization
Query Answering Approach Based on Document SummarizationQuery Answering Approach Based on Document Summarization
Query Answering Approach Based on Document Summarization
IJMER
 
Medical Persona Classification in Social Media
Medical Persona Classification in Social MediaMedical Persona Classification in Social Media
Medical Persona Classification in Social Media
IIIT Hyderabad
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
IOSR Journals
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
IJERA Editor
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine Learning
IRJET Journal
 
Semantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with IdiomsSemantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with Idioms
Waqas Tariq
 
Sentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersSentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clusters
MOHDSAIFWAJID1
 
Feature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsFeature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documents
IJECEIAES
 
A Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie ReviewsA Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie Reviews
Editor IJMTER
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
Ninad Samel
 
Advantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalAdvantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information Retrieval
Onur Yılmaz
 
K0936266
K0936266K0936266
K0936266
IOSR Journals
 
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTION
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTIONTEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTION
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTION
ijistjournal
 
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Quinsulon Israel
 
Exploiting rhetorical relations to
Exploiting rhetorical relations toExploiting rhetorical relations to
Exploiting rhetorical relations to
IJNSA Journal
 
An efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentAn efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-document
SaleihGero
 
Improvement of Text Summarization using Fuzzy Logic Based Method
Improvement of Text Summarization using Fuzzy Logic Based  MethodImprovement of Text Summarization using Fuzzy Logic Based  Method
Improvement of Text Summarization using Fuzzy Logic Based Method
IOSR Journals
 
Modeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector QuantizationModeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector Quantization
TELKOMNIKA JOURNAL
 
Query Answering Approach Based on Document Summarization
Query Answering Approach Based on Document SummarizationQuery Answering Approach Based on Document Summarization
Query Answering Approach Based on Document Summarization
IJMER
 
Medical Persona Classification in Social Media
Medical Persona Classification in Social MediaMedical Persona Classification in Social Media
Medical Persona Classification in Social Media
IIIT Hyderabad
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
IOSR Journals
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine Learning
IRJET Journal
 
Semantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with IdiomsSemantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with Idioms
Waqas Tariq
 
Sentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersSentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clusters
MOHDSAIFWAJID1
 
Feature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsFeature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documents
IJECEIAES
 
A Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie ReviewsA Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie Reviews
Editor IJMTER
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
Ninad Samel
 
Advantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalAdvantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information Retrieval
Onur Yılmaz
 
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTION
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTIONTEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTION
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTION
ijistjournal
 

Similar to A Novel Text Classification Method Using Comprehensive Feature Weight (20)

Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
C017321319
C017321319C017321319
C017321319
IOSR Journals
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
IJCSIS Research Publications
 
Cc35451454
Cc35451454Cc35451454
Cc35451454
IJERA Editor
 
A simplified classification computational model of opinion mining using deep ...
A simplified classification computational model of opinion mining using deep ...A simplified classification computational model of opinion mining using deep ...
A simplified classification computational model of opinion mining using deep ...
IJECEIAES
 
A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzer
IAESIJAI
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
ijdmtaiir
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
gerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
gerogepatton
 
IRJET-Semantic Based Document Clustering Using Lexical Chains
IRJET-Semantic Based Document Clustering Using Lexical ChainsIRJET-Semantic Based Document Clustering Using Lexical Chains
IRJET-Semantic Based Document Clustering Using Lexical Chains
IRJET Journal
 
Semantic Based Document Clustering Using Lexical Chains
Semantic Based Document Clustering Using Lexical ChainsSemantic Based Document Clustering Using Lexical Chains
Semantic Based Document Clustering Using Lexical Chains
IRJET Journal
 
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Editor IJCATR
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
IJDKP
 
A comparative study on term weighting methods for automated telugu text categ...
A comparative study on term weighting methods for automated telugu text categ...A comparative study on term weighting methods for automated telugu text categ...
A comparative study on term weighting methods for automated telugu text categ...
IJDKP
 
An in-depth review on News Classification through NLP
An in-depth review on News Classification through NLPAn in-depth review on News Classification through NLP
An in-depth review on News Classification through NLP
IRJET Journal
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
IJDKP
 
N045038690
N045038690N045038690
N045038690
IJERA Editor
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATIONEXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
IJNSA Journal
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
IJCSIS Research Publications
 
A simplified classification computational model of opinion mining using deep ...
A simplified classification computational model of opinion mining using deep ...A simplified classification computational model of opinion mining using deep ...
A simplified classification computational model of opinion mining using deep ...
IJECEIAES
 
A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzer
IAESIJAI
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
ijdmtaiir
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
gerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
gerogepatton
 
IRJET-Semantic Based Document Clustering Using Lexical Chains
IRJET-Semantic Based Document Clustering Using Lexical ChainsIRJET-Semantic Based Document Clustering Using Lexical Chains
IRJET-Semantic Based Document Clustering Using Lexical Chains
IRJET Journal
 
Semantic Based Document Clustering Using Lexical Chains
Semantic Based Document Clustering Using Lexical ChainsSemantic Based Document Clustering Using Lexical Chains
Semantic Based Document Clustering Using Lexical Chains
IRJET Journal
 
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Editor IJCATR
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
IJDKP
 
A comparative study on term weighting methods for automated telugu text categ...
A comparative study on term weighting methods for automated telugu text categ...A comparative study on term weighting methods for automated telugu text categ...
A comparative study on term weighting methods for automated telugu text categ...
IJDKP
 
An in-depth review on News Classification through NLP
An in-depth review on News Classification through NLPAn in-depth review on News Classification through NLP
An in-depth review on News Classification through NLP
IRJET Journal
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
IJDKP
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATIONEXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
IJNSA Journal
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
Ad

More from TELKOMNIKA JOURNAL (20)

Amazon products reviews classification based on machine learning, deep learni...
Amazon products reviews classification based on machine learning, deep learni...Amazon products reviews classification based on machine learning, deep learni...
Amazon products reviews classification based on machine learning, deep learni...
TELKOMNIKA JOURNAL
 
Design, simulation, and analysis of microstrip patch antenna for wireless app...
Design, simulation, and analysis of microstrip patch antenna for wireless app...Design, simulation, and analysis of microstrip patch antenna for wireless app...
Design, simulation, and analysis of microstrip patch antenna for wireless app...
TELKOMNIKA JOURNAL
 
Design and simulation an optimal enhanced PI controller for congestion avoida...
Design and simulation an optimal enhanced PI controller for congestion avoida...Design and simulation an optimal enhanced PI controller for congestion avoida...
Design and simulation an optimal enhanced PI controller for congestion avoida...
TELKOMNIKA JOURNAL
 
Improving the detection of intrusion in vehicular ad-hoc networks with modifi...
Improving the detection of intrusion in vehicular ad-hoc networks with modifi...Improving the detection of intrusion in vehicular ad-hoc networks with modifi...
Improving the detection of intrusion in vehicular ad-hoc networks with modifi...
TELKOMNIKA JOURNAL
 
Conceptual model of internet banking adoption with perceived risk and trust f...
Conceptual model of internet banking adoption with perceived risk and trust f...Conceptual model of internet banking adoption with perceived risk and trust f...
Conceptual model of internet banking adoption with perceived risk and trust f...
TELKOMNIKA JOURNAL
 
Efficient combined fuzzy logic and LMS algorithm for smart antenna
Efficient combined fuzzy logic and LMS algorithm for smart antennaEfficient combined fuzzy logic and LMS algorithm for smart antenna
Efficient combined fuzzy logic and LMS algorithm for smart antenna
TELKOMNIKA JOURNAL
 
Design and implementation of a LoRa-based system for warning of forest fire
Design and implementation of a LoRa-based system for warning of forest fireDesign and implementation of a LoRa-based system for warning of forest fire
Design and implementation of a LoRa-based system for warning of forest fire
TELKOMNIKA JOURNAL
 
Wavelet-based sensing technique in cognitive radio network
Wavelet-based sensing technique in cognitive radio networkWavelet-based sensing technique in cognitive radio network
Wavelet-based sensing technique in cognitive radio network
TELKOMNIKA JOURNAL
 
A novel compact dual-band bandstop filter with enhanced rejection bands
A novel compact dual-band bandstop filter with enhanced rejection bandsA novel compact dual-band bandstop filter with enhanced rejection bands
A novel compact dual-band bandstop filter with enhanced rejection bands
TELKOMNIKA JOURNAL
 
Deep learning approach to DDoS attack with imbalanced data at the application...
Deep learning approach to DDoS attack with imbalanced data at the application...Deep learning approach to DDoS attack with imbalanced data at the application...
Deep learning approach to DDoS attack with imbalanced data at the application...
TELKOMNIKA JOURNAL
 
Brief note on match and miss-match uncertainties
Brief note on match and miss-match uncertaintiesBrief note on match and miss-match uncertainties
Brief note on match and miss-match uncertainties
TELKOMNIKA JOURNAL
 
Implementation of FinFET technology based low power 4×4 Wallace tree multipli...
Implementation of FinFET technology based low power 4×4 Wallace tree multipli...Implementation of FinFET technology based low power 4×4 Wallace tree multipli...
Implementation of FinFET technology based low power 4×4 Wallace tree multipli...
TELKOMNIKA JOURNAL
 
Evaluation of the weighted-overlap add model with massive MIMO in a 5G system
Evaluation of the weighted-overlap add model with massive MIMO in a 5G systemEvaluation of the weighted-overlap add model with massive MIMO in a 5G system
Evaluation of the weighted-overlap add model with massive MIMO in a 5G system
TELKOMNIKA JOURNAL
 
Reflector antenna design in different frequencies using frequency selective s...
Reflector antenna design in different frequencies using frequency selective s...Reflector antenna design in different frequencies using frequency selective s...
Reflector antenna design in different frequencies using frequency selective s...
TELKOMNIKA JOURNAL
 
Reagentless iron detection in water based on unclad fiber optical sensor
Reagentless iron detection in water based on unclad fiber optical sensorReagentless iron detection in water based on unclad fiber optical sensor
Reagentless iron detection in water based on unclad fiber optical sensor
TELKOMNIKA JOURNAL
 
Impact of CuS counter electrode calcination temperature on quantum dot sensit...
Impact of CuS counter electrode calcination temperature on quantum dot sensit...Impact of CuS counter electrode calcination temperature on quantum dot sensit...
Impact of CuS counter electrode calcination temperature on quantum dot sensit...
TELKOMNIKA JOURNAL
 
A progressive learning for structural tolerance online sequential extreme lea...
A progressive learning for structural tolerance online sequential extreme lea...A progressive learning for structural tolerance online sequential extreme lea...
A progressive learning for structural tolerance online sequential extreme lea...
TELKOMNIKA JOURNAL
 
Electroencephalography-based brain-computer interface using neural networks
Electroencephalography-based brain-computer interface using neural networksElectroencephalography-based brain-computer interface using neural networks
Electroencephalography-based brain-computer interface using neural networks
TELKOMNIKA JOURNAL
 
Adaptive segmentation algorithm based on level set model in medical imaging
Adaptive segmentation algorithm based on level set model in medical imagingAdaptive segmentation algorithm based on level set model in medical imaging
Adaptive segmentation algorithm based on level set model in medical imaging
TELKOMNIKA JOURNAL
 
Automatic channel selection using shuffled frog leaping algorithm for EEG bas...
Automatic channel selection using shuffled frog leaping algorithm for EEG bas...Automatic channel selection using shuffled frog leaping algorithm for EEG bas...
Automatic channel selection using shuffled frog leaping algorithm for EEG bas...
TELKOMNIKA JOURNAL
 
Amazon products reviews classification based on machine learning, deep learni...
Amazon products reviews classification based on machine learning, deep learni...Amazon products reviews classification based on machine learning, deep learni...
Amazon products reviews classification based on machine learning, deep learni...
TELKOMNIKA JOURNAL
 
Design, simulation, and analysis of microstrip patch antenna for wireless app...
Design, simulation, and analysis of microstrip patch antenna for wireless app...Design, simulation, and analysis of microstrip patch antenna for wireless app...
Design, simulation, and analysis of microstrip patch antenna for wireless app...
TELKOMNIKA JOURNAL
 
Design and simulation an optimal enhanced PI controller for congestion avoida...
Design and simulation an optimal enhanced PI controller for congestion avoida...Design and simulation an optimal enhanced PI controller for congestion avoida...
Design and simulation an optimal enhanced PI controller for congestion avoida...
TELKOMNIKA JOURNAL
 
Improving the detection of intrusion in vehicular ad-hoc networks with modifi...
Improving the detection of intrusion in vehicular ad-hoc networks with modifi...Improving the detection of intrusion in vehicular ad-hoc networks with modifi...
Improving the detection of intrusion in vehicular ad-hoc networks with modifi...
TELKOMNIKA JOURNAL
 
Conceptual model of internet banking adoption with perceived risk and trust f...
Conceptual model of internet banking adoption with perceived risk and trust f...Conceptual model of internet banking adoption with perceived risk and trust f...
Conceptual model of internet banking adoption with perceived risk and trust f...
TELKOMNIKA JOURNAL
 
Efficient combined fuzzy logic and LMS algorithm for smart antenna
Efficient combined fuzzy logic and LMS algorithm for smart antennaEfficient combined fuzzy logic and LMS algorithm for smart antenna
Efficient combined fuzzy logic and LMS algorithm for smart antenna
TELKOMNIKA JOURNAL
 
Design and implementation of a LoRa-based system for warning of forest fire
Design and implementation of a LoRa-based system for warning of forest fireDesign and implementation of a LoRa-based system for warning of forest fire
Design and implementation of a LoRa-based system for warning of forest fire
TELKOMNIKA JOURNAL
 
Wavelet-based sensing technique in cognitive radio network
Wavelet-based sensing technique in cognitive radio networkWavelet-based sensing technique in cognitive radio network
Wavelet-based sensing technique in cognitive radio network
TELKOMNIKA JOURNAL
 
A novel compact dual-band bandstop filter with enhanced rejection bands
A novel compact dual-band bandstop filter with enhanced rejection bandsA novel compact dual-band bandstop filter with enhanced rejection bands
A novel compact dual-band bandstop filter with enhanced rejection bands
TELKOMNIKA JOURNAL
 
Deep learning approach to DDoS attack with imbalanced data at the application...
Deep learning approach to DDoS attack with imbalanced data at the application...Deep learning approach to DDoS attack with imbalanced data at the application...
Deep learning approach to DDoS attack with imbalanced data at the application...
TELKOMNIKA JOURNAL
 
Brief note on match and miss-match uncertainties
Brief note on match and miss-match uncertaintiesBrief note on match and miss-match uncertainties
Brief note on match and miss-match uncertainties
TELKOMNIKA JOURNAL
 
Implementation of FinFET technology based low power 4×4 Wallace tree multipli...
Implementation of FinFET technology based low power 4×4 Wallace tree multipli...Implementation of FinFET technology based low power 4×4 Wallace tree multipli...
Implementation of FinFET technology based low power 4×4 Wallace tree multipli...
TELKOMNIKA JOURNAL
 
Evaluation of the weighted-overlap add model with massive MIMO in a 5G system
Evaluation of the weighted-overlap add model with massive MIMO in a 5G systemEvaluation of the weighted-overlap add model with massive MIMO in a 5G system
Evaluation of the weighted-overlap add model with massive MIMO in a 5G system
TELKOMNIKA JOURNAL
 
Reflector antenna design in different frequencies using frequency selective s...
Reflector antenna design in different frequencies using frequency selective s...Reflector antenna design in different frequencies using frequency selective s...
Reflector antenna design in different frequencies using frequency selective s...
TELKOMNIKA JOURNAL
 
Reagentless iron detection in water based on unclad fiber optical sensor
Reagentless iron detection in water based on unclad fiber optical sensorReagentless iron detection in water based on unclad fiber optical sensor
Reagentless iron detection in water based on unclad fiber optical sensor
TELKOMNIKA JOURNAL
 
Impact of CuS counter electrode calcination temperature on quantum dot sensit...
Impact of CuS counter electrode calcination temperature on quantum dot sensit...Impact of CuS counter electrode calcination temperature on quantum dot sensit...
Impact of CuS counter electrode calcination temperature on quantum dot sensit...
TELKOMNIKA JOURNAL
 
A progressive learning for structural tolerance online sequential extreme lea...
A progressive learning for structural tolerance online sequential extreme lea...A progressive learning for structural tolerance online sequential extreme lea...
A progressive learning for structural tolerance online sequential extreme lea...
TELKOMNIKA JOURNAL
 
Electroencephalography-based brain-computer interface using neural networks
Electroencephalography-based brain-computer interface using neural networksElectroencephalography-based brain-computer interface using neural networks
Electroencephalography-based brain-computer interface using neural networks
TELKOMNIKA JOURNAL
 
Adaptive segmentation algorithm based on level set model in medical imaging
Adaptive segmentation algorithm based on level set model in medical imagingAdaptive segmentation algorithm based on level set model in medical imaging
Adaptive segmentation algorithm based on level set model in medical imaging
TELKOMNIKA JOURNAL
 
Automatic channel selection using shuffled frog leaping algorithm for EEG bas...
Automatic channel selection using shuffled frog leaping algorithm for EEG bas...Automatic channel selection using shuffled frog leaping algorithm for EEG bas...
Automatic channel selection using shuffled frog leaping algorithm for EEG bas...
TELKOMNIKA JOURNAL
 
Ad

Recently uploaded (20)

twin tower attack 2001 new york city
twin  tower  attack  2001 new  york citytwin  tower  attack  2001 new  york city
twin tower attack 2001 new york city
harishreemavs
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
AI Chatbots & Software Development Teams
AI Chatbots & Software Development TeamsAI Chatbots & Software Development Teams
AI Chatbots & Software Development Teams
Joe Krall
 
Machine foundation notes for civil engineering students
Machine foundation notes for civil engineering studentsMachine foundation notes for civil engineering students
Machine foundation notes for civil engineering students
DYPCET
 
Design of Variable Depth Single-Span Post.pdf
Design of Variable Depth Single-Span Post.pdfDesign of Variable Depth Single-Span Post.pdf
Design of Variable Depth Single-Span Post.pdf
Kamel Farid
 
2.3 Genetically Modified Organisms (1).ppt
2.3 Genetically Modified Organisms (1).ppt2.3 Genetically Modified Organisms (1).ppt
2.3 Genetically Modified Organisms (1).ppt
rakshaiya16
 
Frontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend EngineersFrontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend Engineers
Michael Hertzberg
 
Using the Artificial Neural Network to Predict the Axial Strength and Strain ...
Using the Artificial Neural Network to Predict the Axial Strength and Strain ...Using the Artificial Neural Network to Predict the Axial Strength and Strain ...
Using the Artificial Neural Network to Predict the Axial Strength and Strain ...
Journal of Soft Computing in Civil Engineering
 
acid base ppt and their specific application in food
acid base ppt and their specific application in foodacid base ppt and their specific application in food
acid base ppt and their specific application in food
Fatehatun Noor
 
Automatic Quality Assessment for Speech and Beyond
Automatic Quality Assessment for Speech and BeyondAutomatic Quality Assessment for Speech and Beyond
Automatic Quality Assessment for Speech and Beyond
NU_I_TODALAB
 
Mode-Wise Corridor Level Travel-Time Estimation Using Machine Learning Models
Mode-Wise Corridor Level Travel-Time Estimation Using Machine Learning ModelsMode-Wise Corridor Level Travel-Time Estimation Using Machine Learning Models
Mode-Wise Corridor Level Travel-Time Estimation Using Machine Learning Models
Journal of Soft Computing in Civil Engineering
 
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
PawachMetharattanara
 
Physical and Physic-Chemical Based Optimization Methods: A Review
Physical and Physic-Chemical Based Optimization Methods: A ReviewPhysical and Physic-Chemical Based Optimization Methods: A Review
Physical and Physic-Chemical Based Optimization Methods: A Review
Journal of Soft Computing in Civil Engineering
 
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
ijdmsjournal
 
Agents chapter of Artificial intelligence
Agents chapter of Artificial intelligenceAgents chapter of Artificial intelligence
Agents chapter of Artificial intelligence
DebdeepMukherjee9
 
Personal Protective Efsgfgsffquipment.ppt
Personal Protective Efsgfgsffquipment.pptPersonal Protective Efsgfgsffquipment.ppt
Personal Protective Efsgfgsffquipment.ppt
ganjangbegu579
 
Transport modelling at SBB, presentation at EPFL in 2025
Transport modelling at SBB, presentation at EPFL in 2025Transport modelling at SBB, presentation at EPFL in 2025
Transport modelling at SBB, presentation at EPFL in 2025
Antonin Danalet
 
Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...
Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...
Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...
Journal of Soft Computing in Civil Engineering
 
sss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptx
sss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptx
sss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptx
ajayrm685
 
How to Build a Desktop Weather Station Using ESP32 and E-ink Display
How to Build a Desktop Weather Station Using ESP32 and E-ink DisplayHow to Build a Desktop Weather Station Using ESP32 and E-ink Display
How to Build a Desktop Weather Station Using ESP32 and E-ink Display
CircuitDigest
 
twin tower attack 2001 new york city
twin  tower  attack  2001 new  york citytwin  tower  attack  2001 new  york city
twin tower attack 2001 new york city
harishreemavs
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
AI Chatbots & Software Development Teams
AI Chatbots & Software Development TeamsAI Chatbots & Software Development Teams
AI Chatbots & Software Development Teams
Joe Krall
 
Machine foundation notes for civil engineering students
Machine foundation notes for civil engineering studentsMachine foundation notes for civil engineering students
Machine foundation notes for civil engineering students
DYPCET
 
Design of Variable Depth Single-Span Post.pdf
Design of Variable Depth Single-Span Post.pdfDesign of Variable Depth Single-Span Post.pdf
Design of Variable Depth Single-Span Post.pdf
Kamel Farid
 
2.3 Genetically Modified Organisms (1).ppt
2.3 Genetically Modified Organisms (1).ppt2.3 Genetically Modified Organisms (1).ppt
2.3 Genetically Modified Organisms (1).ppt
rakshaiya16
 
Frontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend EngineersFrontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend Engineers
Michael Hertzberg
 
acid base ppt and their specific application in food
acid base ppt and their specific application in foodacid base ppt and their specific application in food
acid base ppt and their specific application in food
Fatehatun Noor
 
Automatic Quality Assessment for Speech and Beyond
Automatic Quality Assessment for Speech and BeyondAutomatic Quality Assessment for Speech and Beyond
Automatic Quality Assessment for Speech and Beyond
NU_I_TODALAB
 
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
PawachMetharattanara
 
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
ijdmsjournal
 
Agents chapter of Artificial intelligence
Agents chapter of Artificial intelligenceAgents chapter of Artificial intelligence
Agents chapter of Artificial intelligence
DebdeepMukherjee9
 
Personal Protective Efsgfgsffquipment.ppt
Personal Protective Efsgfgsffquipment.pptPersonal Protective Efsgfgsffquipment.ppt
Personal Protective Efsgfgsffquipment.ppt
ganjangbegu579
 
Transport modelling at SBB, presentation at EPFL in 2025
Transport modelling at SBB, presentation at EPFL in 2025Transport modelling at SBB, presentation at EPFL in 2025
Transport modelling at SBB, presentation at EPFL in 2025
Antonin Danalet
 
sss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptx
sss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptx
sss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptx
ajayrm685
 
How to Build a Desktop Weather Station Using ESP32 and E-ink Display
How to Build a Desktop Weather Station Using ESP32 and E-ink DisplayHow to Build a Desktop Weather Station Using ESP32 and E-ink Display
How to Build a Desktop Weather Station Using ESP32 and E-ink Display
CircuitDigest
 

A Novel Text Classification Method Using Comprehensive Feature Weight

  • 1. TELKOMNIKA, Vol.15, No.1, March 2017, pp. 383~390 ISSN: 1693-6930, accredited A by DIKTI, Decree No: 58/DIKTI/Kep/2013 DOI: 10.12928/TELKOMNIKA.v15i1.4303  383 Received November 15, 2016; Revised January 10, 2017; Accepted January 30, 2017 A Novel Text Classification Method Using Comprehensive Feature Weight Jian Liu*1 , Weisheng Wu2 1 Information and Computer Engineering,Pingxiang University, Pingxiang, Jiangxi 337055, P. R. China 2 Modern Educational Technology Center, Pingxiang University, Pingxiang, Jiangxi 337055, P. R. China *Corresponding author, e-mail: liujianpxu@163.com Abstract Currently, since the categorical distribution of short text corpus is not balanced, it is difficult to obtain accurate classification results for long text classification. To solve this problem, this paper proposes a novel method of short text classification using comprehensive feature weights. This method takes into account the situation of the samples in the positive and negative categories, as well as the category correlation of words, so as to improve the existing feature weight calculation method and obtain a new method of calculating the comprehensive feature weight. The experimental result shows that the proposed method is significantly higher than other feature-weight methods in the micro and macro average value, which shows that this method can greatly improve the accuracy and recall rate of short text classification. Keywords: text classification, text categorization, comprehensive feature weight, feature selection Copyright © 2017 Universitas Ahmad Dahlan. All rights reserved. 1. Introduction In recent years, with the development of instant communication technology and the popularity of internet-based applications, QQ chat, Facebook, micro-blog, news comments and other short text data is growing exponentially, and has become an important way of information transmission. The emergence of a large number of short texts has both advantages and disadvantages. On the one hand, the short text classification has important commercial value and application background in question answering community, advertising classification and so on. On the other hand, it is also an important carrier of harmful network information such as the bad public opinion and the rubbish information. Therefore, short texts classification, short texts clustering, topic detection, and tracking of short texts have become a focus in the field of data mining and information security [1-3]. The short text has generally only about 100 words, the text length is short, the noise data is more, the useful information contained is very little, so the information can be extracted is deficient [4,5]. Therefore, it put forward higher requirements to the related research of short text classification. Although the existing text classification field has made great achievements, it is difficult to apply directly to the short text. Short text classification is faced with two major problems in the field of text classification: high vector dimension and sparse feature [6-9], which resulting in unbalanced of the categorical distribution of short text corpus. So far, there are two ways to improve the performance of short text classification. One is to use the external data source or knowledge base (such as WordNet, Wikipedia, etc.) to expand the feature of short text [10-12]. This method is simple and intuitive, can solve the problem of sparse feature words, and can achieve good results, but the classification accuracy needs to be improved. The other methods [13-17] are to improve the classification precision and recall rate by improving the performance of the classifier, but related articles to solve this problem from the point of view of the short text feature weight calculation is very few. Sriram et al [18] improved the classification precision of short text by combining the improved short text feature weight formula and feature extended algorithm based on the collinear relationship of features. In some articles [19-21], it is used in another method, which uses implicit semantic analysis to reduce the dimension of short text and noise. Then we use the independent component analysis method to extract the most expressive features. These methods are combined with other methods to improve the
  • 2.  ISSN: 1693-6930 TELKOMNIKA Vol. 15, No. 1, March 2017 : 383 – 390 384 effectiveness of short text classification, which cannot explain the effectiveness of the use of feature extraction method alone. Furthermore, we take into account the situation of the samples in the positive and negative categories, as well as the category correlation of words, so as to improve the existing feature weight calculation method and obtain a new method of calculating the comprehensive feature weight. 2. Feature Weight of Text Explaining research chronological, including research design, research procedure (in the form of algorithms, Pseudocode or other), how to test and data acquisition. The description of the course of research should be supported references, so the explanation can be accepted scientifically. In order to convert the text into a form that a computer can understand, a text is represented as a vector to be analyzed and calculated. Vectoring is the basis of text processing. Given a certain weight to the words in the text which mean the importance of a word to the characterization of the text, the greater the weight, the more important word to the text. In the information retrieval field, the term weight calculation method is divided into two categories: one is the unsupervised tf (term frequency) and (term frequency-inverse document frequency) method; another is the supervised method like ( gain), (ratio) and so on. The term weight calculation method mainly references the feature selection algorithm, as feature selection. These terms are also given different values to measure their contribution to text classification. However, these term weight calculation methods are mainly for the long text, and cannot be directly applied to the short text. This is due to a prominent feature of the short text that the sample distribution is very uneven, that is, the number of samples in the dataset is much larger than other categories, resulting that small category text is submerged in a large number of other types of documents and it is difficult to identify. But in the massive text data to be processed, sometimes the data that system really cares about is only a small part, for example, in the network public opinion analysis and the hot sensitive topic detection and tracking problem, the valuable data in the practical environment is a small proportion. We refer the category has little samples as the positive category and the category has more samples as the negative category. However, the existing feature weight calculation methods treat all categories in the same way, in fact, when applying the traditional method of long text feature weight calculation on the short text, the result of text classification is more inclined to negative category while ignoring normal category. The problem is particularly evident in the short text classification performance, leading to no high short text classification precision and recall rate, which cannot meet the need of practical application. 3. Comprehensive Category Feature Selection Method In this paper, we regard the current category of short text dataset as a PC (Positive Category). In addition to the current category, another category is regarded as the NC (Negative Category). Related element information is shown in Table 1. Table 1. Data Element Table Feature The number of samples including /not including s in PC The number of samples including /not including s in NC s sP SN s and represent the circumstance of feature term s appear and disappear in sample respectively. sP represents the frequency of samples which contain feature s in PC; represents the frequency of samples which do not contain feature s in PC. SN represents the s sp sN s sp
  • 3. TELKOMNIKA ISSN: 1693-6930  A novel text classification method using comprehensive feature weight (Jian Liu) 385 frequency of samples which contain feature s in NC; represents the frequency of samples which do not contain feature s in NC. Based on the idea of plain text, this paper has 3 important related analyses: 1) For a given term that contains si, if it occurs frequently in the text, that is when the document frequency is high, it will have a strong expression ability; 2) When the frequency of si in positive category is higher than that in the negative category, it shows that it has good classification ability which is called inverse document frequency; 3) When the ratio between the frequency of si not in negative category and the frequency of si in positive category is high, this shows that the relevancy frequency between si and the text category is high. According to the above analysis, this paper proposes a feature selection method for short text. Since this method takes into account the case of the sample in the positive and negative categories, it is named as the integrate category IC (Integrate Category), the calculation formula is as follows: (1) where df is the document frequency, for a given term si, the value of df is derived from the calculation , and it represents the document frequency of term si in the positive category; tP is the number of samples which contains si in positive category. rf is the relevance frequency derived from the calculation . rf is proportional to the frequency of s not in the negative category and inversely proportional to the frequency of s in the positive category. The bigger the rf value, the more the correlation between the term s and the category. icf is inverse document frequency and is derived from the calculation , |C| is the total categories, cf is the number of categories which contains term si. In the case of a certain number of positive samples, the sample distribution of dataset can be roughly divided into the following three conditions: 1) the number of negative samples is more than the positive, and it is more evenly distributed in different categories; 2) the number of negative samples is less than positive, and it is more evenly distributed in different categories; 3) the number of negative samples is less than the positive, and the distribution is not balanced, that is, in a small number of categories. In order to express the above relations more visually, 20 categories are given, including 3 simple term examples of 200 short texts, simulating the distribution of positive and negative samples in the short text dataset, see Table 2. Table 2. Comparison of the Relationship between PC and NC Terms Term sp cf s1 20 100 0 20 3 8.538 39.294 s2 20 4 4 20 4 7.636 68.478 s3 20 80 80 20 4 7.636 15.526 As can be seen from table 2, the distribution of samples contain s2 and samples contain s3 in the positive and negative categories are very different, among which samples contains s3 are mostly in the negative class, while the proportion of samples contains s2 in the negative class is small, but the value of the two are the same; the value of s2 is significantly increased after taking the relevance frequency of s in the positive category and negative category into consideration. In contrast, the increase in the value of s3 is smaller, so that the gap between s2 and s3 is increasing. The frequency of s1 in positive category is similar to that of s2 and s3, but its distribution range in negative category is small, it is concentrated in one category, sN log(sP +1) rf = sN sP +1 icf = log( |C | cf +1) NC C2 NC C3 PC C1
  • 4.  ISSN: 1693-6930 TELKOMNIKA Vol. 15, No. 1, March 2017 : 383 – 390 386 so the value of s1 is slightly larger than that of s3, and the value is between s2 and s3, this is consistent with the previous analysis. The global feature weight of feature s for the entire corpus is shown in formula (2): (2) The advantage of this method is that it takes into account the distribution of entries in a single sample, and takes into account the category information of the text. The relevance evaluation of the terms in the positive and negative categories makes the feature words more discriminative in the imbalanced datasets. At the same time, selecting the maximum value of the global feature can help us obtain the feature with the highest category discrimination degree. Algorithm description is as follows: 1) To pre-process the documents in the training corpus: word segmentation, remove the stop words; 2) Calculate IC(s, Cj) of each feature term and category; 3) Calculate ICmax(s) of all categories based on the second step results; 4) Sort IC value in descending order, take the first M values as feature words to be retained, M is the dimension of feature space. In this paper, the M value is 820. 4. Results and Analysis In order to verify the performance of the short text feature weight algorithm, K-Nearest Neighbor (KNN) classification algorithm is adopted to classify the short text. KNN is a traditional pattern recognition algorithm, the algorithm is simple and intuitive, the classification accuracy is high, and the new training text does not need to be trained, so as to reduce the training time, it is widely used in text automatic classification. K represents the number of nearest neighbor samples in a class. Testing different odd value of K in the range of [3, 35], and use the result of the optimal K value for comparison. 4.1. Experimental Design The multi classification problem can be decomposed into two classification problems, so the focus of this paper is to improve the classification performance of positive and negative two categories. At present, there is no general short text dataset, so this paper is based on the API of sina micro-blog to grab the micro-blog data as the text corpus. We select 8478 micro-blog s, each micro-blog has more than 6 words and the average text length is 42. We take about 100 micro-blogs of the Chinese super league in 2015 as a positive category, and 4236 micro-blogs of the traffic jam as a negative category. Since the sample of dataset is very unevenly distributed, we use a 5-fold cross- validation method to carry out the classification results comparison. That is split the dataset into 5 folds, train on 4 folds and evaluate on the 5th, then iterate each fold as the validation fold, evaluate the performance and finally average the performance to avoid the contingency of experimental results and make sure the training data and testing data have no intersection. The results thus obtained can be considered to be credible. 4.2. Evaluation Method There are three currently used classification performance evaluation index: precision, recall, and average test value. For short text classification with uneven distribution of samples, precision and recall ration will ignore the influence of small category PC. Micro-average and macro-average are two methods for global evaluation of classification results: micro-average is the arithmetic average of the performance indicator for each instance document; macro-average is first to calculate the classification results of each category then calculate the average value of all categories. Specific definitions are as follows: (3) ICmax(s) = max 1£j£m {IC(s,Cj )}
  • 5. TELKOMNIKA ISSN: 1693-6930  A novel text classification method using comprehensive feature weight (Jian Liu) 387 (4) where MicroP and MicroR represent the precision and recall rate of micro-average respectively; MacroP and MacroR represent the precision and recall rate of macro-average respectively. Before calculating the evaluation index, first we introduce the data elements that are related to the evaluation index, as shown in Table 3. Table 3. Corresponding Table of Data Element Retrieval Relevant Irrelevant Yes RY NR No RN NN Suppose the total number of text is 1, then: (5) (6) (7) (8) Where . 4.3. Experimental Results and Analysis In practical application, the proportion of positive category text is very small. In order to compare the effect of negative category text data size on different feature weight method, let the initial positive and negative micro-blog text have 100 each, gradually increase the number of negative text, until the total text number reached 4300. Under the KNN classifier, the MicroA values and MacroA values of 6 different feature weight calculation methods are shown in Figure 1 and Figure 2 respectively. As can be seen, with the increase of data size, the performance value of all the feature weight method is on the rise. The reason may be that in a certain number of positive categories, when the data size is small, the text of each category is not sufficient to characterize this category, resulting in lower classification accuracy. This situation improves with the increase of data size. Pl = RYl RYl + NRl ,Rl = RYl RYl + RNl
  • 6.  ISSN: 1693-6930 TELKOMNIKA Vol. 15, No. 1, March 2017 : 383 – 390 388 The result of the experiments on Figure 1 and Figure 2 shows that since the MacroA is the average of each category and is greatly influenced by small categories, the MacroA values of different feature weight calculation method is greater than the MicroA values generally. Figure 1. MicroA Value of Different Features Weight Calculation Method Figure 2. MacroA Value of Different Features Weight Calculation Method Figure 3 shows the average of the highest classification result value of 5 times repeated trials on 6 different feature weight calculation methods. From the table, we can see that supervised term weight calculation method is not always better than unsupervised term weight calculation method, such as tf and perform less well, and which is very popular in the field of text classification is not so excellent in this experiment. However, method measures the relationship between terms and categories in the point of view whether the term shows in the positive category and negative category or not, which can maintain good classification performance in different date scale. This is because, in the traditional text classification, the training sets generally have the following characteristics: the category distribution is balanced, each document of the category can better represent this category, a document of the category is more concentrated in the arrangement of feature space. However, in practical application, the real text corpus and the use of the environment are often not satisfied with the above characteristics, therefore, the existing feature weight calculation method can’t be effective in the short text.
  • 7. TELKOMNIKA ISSN: 1693-6930  A novel text classification method using comprehensive feature weight (Jian Liu) 389 Figure 3. Comparison of the MicroA and MacroA Value 5. Conclusion In the present study of short text classification, the feature weight calculation is still used in the traditional long text method, but the distribution of short text categories is often not so balanced, the traditional method can’t get good classification results. Aiming at this problem, this paper makes an improvement on the existing feature weight calculation method, takes the correlation of terms into consideration when calculating the feature weight, and tests the performance of the method in the real corpus environment. Experimental results show that this method can improve the classification effect to a certain extent, and it is found that the classification performance of short text has not been improved greatly. The reason is the collection of the real text from the network data, including a large number of network languages and not standardized expression, resulting in the system can’t accurately identify. The next step will be to introduce the method of semantic analysis to reduce the effect of the variety of words on the classification system. ACKNOWLEDGEMENTS This work was supported by National Natural Science Foundation of China under grant. References [1] Berry MW. Survey of text mining. Computing Reviews. 2004; 45 (9): 548. [2] Aggarwal C C, Zhai CX. A survey of text classification algorithms. In Mining text data. 2012; 163-222. [3] Rao Y, Xie H, Li J, et al. Social emotion classification of short text via topic-level maximum entropy model. Information & Management. 2016. [4] Sriram B, Fuhry D, Demir E, et al. Short text classification in twitter to improve information filtering. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 2010; 841-842. [5] Wang F, Wang Z, Li Z, et al. Concept-based short text classification and ranking. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM. 2014: 1069-1078. [6] Forsyth RS, Holmes DI. Feature-finding for text classification. Literary and Linguistic Computing. 1996; 11 (4): 163-174. [7] Forman G. An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research. 2003; 3: 1289-1305. [8] Qi X, Davison BD. Web page classification: Features and algorithms. ACM Computing Surveys (CSUR). 2009; 41 (2): 12. [9] Wang P, Xu B, Xu J, et al. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing. 2016; 174: 806-814. [10] Rogati M, Yang. Y. High-performing feature selection for text classification. In Proceedings of the eleventh international conference on Information and knowledge management. 2002; 659-661. [11] Sriram B, Fuhry D, Demir E, et al. Short text classification in twitter to improve information filtering. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 2010; 841-842.
  • 8.  ISSN: 1693-6930 TELKOMNIKA Vol. 15, No. 1, March 2017 : 383 – 390 390 [12] Chen M, Jin X, Shen D. Short text classification improved by learning multi-granularity topics. In IJCAI. 2011; 1776-1781. [13] Bobicev V, Sokolova. M. An Effective and Robust Method for Short Text Classification. In AAAI. 2008; 1444-1445. [14] Sahami M, Heilman TD. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th international conference on World Wide Web. 2006; 377-386. [15] Cavnar WB, Trenkle JM. N-gram-based text categorization. Ann Arbor MI. 1994; 48113 (2): 161-175. [16] Li-guo D, Peng D, Ai-ping L. A New Naive Bayes Text Classification Algorithm. Indonesian Journal of Electrical Engineering and Computer Science. 2014; 12(2): 947-952. [17] Zhang PY. A HowNet-based semantic relatedness kernel for text classification. Indonesian Journal of Electrical Engineering and Computer Science. 2013; 11(4): 1909-1915. [18] Sun A. Short text classification using very few words. InProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 2012; 1145-1146. [19] Mihalcea R, Corley C, Strapparava C. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI. 2006; 6: 775-780. [20] Phan XH, Nguyen LM, Horiguchi S. Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In Proceedings of the 17th international conference on World Wide Web. 2008; 91-100. [21] Zhou Q, Zhao M, Hu M. Study on feature selection in Chinese text categorization. Journal of Chinese information processing. 2004; 18(3): 17-23.
  翻译: