Feature selection: a key technique for data mining

Klipi teostus: Mirjam Paales, 30.10.2013 3284 vaatamist Arvutiteadus

Feature selection: a key technique for data mining
Jean-Charles Lamirel

Since the 1990s, advances in computing and storage capacity allow the
manipulation of very large data. Whether in bioinformatics or in text
mining, it is not uncommon that main mining algorithms like classifiers
have to work with data description space of several thousands or even
tens of thousands of features. One might think that such algorithms
should be more efficient if there are a large number of features.
However, the first problem that arises is the increase in computation
time. Moreover, the fact that a significant number of features are
redundant or irrelevant to the classification task significantly
perturbs their operation. As well as it is crucial in human learning,
the integration of a feature selection process is also a main concern in
the framework of the classification of high dimensional data.

For presenting the existing methods and their limitations in an highly
multidimensional context, the following course will use a central
example related to a complex and “real life” text mining task.

The first part of the presentation will introduce the main principle of
classification and provides some examples of classifiers and of  their
application in the data mining domain. It will illustrate as well the
effect of the manipulation of high dimensional data on the classifier
results. Additional and usual problems related to the management of rare
or imbalanced data and to highly similar classes will be also discussed
in this part.

The second part of the presentation will focus on the description of the
feature selection principle and on the one of the main categories of
feature selection methods. Pros and cons of the usual methods will be
discussed here and the effect of their application in combination with
classifiers in the context of highly multidimensional data will be
highlighted. The additional use of resampling techniques for the
management of imbalanced data will also be investigated in this part.

The last part of the presentation will focus on one of our recent
research in the domain of feature selection. The principle of this new
promising approach based on the original theory of feature maximization
and its associated metric will be explained. The behavior of the method
will be compared to the ones of usual methods in the above mentioned
context. Additional advantages related to specific class labeling and
graph visualization capabilities and intrinsic properties of the method,
like incrementality and non parametric behavior, will be finally discussed.