Ensemble Algorithms for Feature Selection
Learning ensembles have achieved much success in machine learning by combining multiple weak learners to improved generalisation ability. This technique is shown to work well with base learners that tend to make their errors in different parts of the data. Due to the unstable nature of decision trees, they are ideal for this purpose and are the chosen learner here. Random input selection is used in order to promote further diversity by selecting random subsets of the features to perform the learning task. The nature of this technique lends itself to feature selection well, due to its exploration of the feature subsets. For ensemble feature selection the features must be selected following a slightly modified criterion. The features selected should be predictive of the target and preferably uncorrelated to each other, but should also aid in the diversity of the ensemble. The approach taken here is to maintain the diversity by including all of the features and updating the feature sampling distribution after the construction of each learner. A new method is introduced to evaluate the feature importance by a weighted average of the information gain values achieved during the construction of the trees. The weighting process uses a node complexity measure, which reflects the reliability of the information gain achieved in terms of where in the tree the node was split and is based on ideas from information theory. This measure is shown to produce a more accurate estimate of the feature importance. Care must be taken to ensure that the rate at which the feature sampling distribution is altered, is controlled to prevent initial overweighting of some of the features. The method developed here addresses this problem by constructing confidence intervals over the estimates of feature importance. Initially a uniform sampling distribution is chosen, which is kept as uniform as possible whilst remaining inside of the confidence intervals. A two stage method is also introduced, whereby the feature importance is evaluated as above, but on a single CART tree, where every feature is considered at each split. This provides information gain values for each feature at every split in the tree and can be used to fix the feature sampling distribution, prior to ensemble construction. This scheme can be considered a hybrid method as the ensemble is comprised of CART based trees. A single CART tree does not utilise the computationally expensive wrapper approach but does contain a bias of the learning algorithm, unlike a filter. This technique is fast to execute and is shown to achieve good results.