An empirical evaluation of text classification and feature selection methods

Muazzam Ahmed Siddiqui

Abstract


An extensive empirical evaluation of classifiers and feature selection methods for text categorization is presented. More than 500 models were trained and tested using different combinations of corpora, term weighting schemes, number of features, feature selection methods and classifiers. The performance measures used were micro-averaged F measure and classifier training time. The experiments used five benchmark corpora, three term weighting schemes, three feature selection methods and four classifiers. Results indicated only slight performance improvement with all the features over only 20% features selected using Information Gain and Chi Square. More importantly, this performance improvement was not deemed statistically significant. Support Vector Machine with linear kernel reigned supreme for text categorization tasks producing highest F measures and low training times even in the presence of high class skew. We found statistically significant difference between the performance of Support Vector Machine and other classifiers on text categorization problems.

Full Text:

PDF


DOI: https://doi.org/10.5430/air.v5n2p70

Refbacks

  • There are currently no refbacks.


Artificial Intelligence Research

ISSN 1927-6974 (Print)   ISSN 1927-6982 (Online)

Copyright © Sciedu Press 
To make sure that you can receive messages from us, please add the 'Sciedupress.com' domain to your e-mail 'safe list'. If you do not receive e-mail in your 'inbox', check your 'bulk mail' or 'junk mail' folders.