When we want to fit a Machine Learning (ML) model to a big dataset, it is often recommended to pre-process the input data to obtain better results carefully. Although the fact that more data leads to better outcomes has a wide acceptance, this is not necessarily true when referred to the number of variables in our data. Some variables may be noisy, redundant, and not useful.
The process of removing useless variables is known as Feature Selection (FS) and is a ubiquitous task that precedes the application of any data analysis technique. More formally, it is the process of selecting a subset of the features of a dataset that is optimal according to some criterion, to
There are roughly three kinds of FS methods:
Focusing on the wrapper methods, we can imagine candidate solutions as vectors like (0,0,1,1,0) meaning that only features 3 and 4 of our data have been considered relevant and the rest of columns are discarded. When the number of variables (i.e., the length of the binary vector, N) is high, the number of possible combinations is 2^N which makes an exhaustive search computationally unfeasible. Metaheuristics constitute a broad class of search algorithms that can deal with substantial search spaces in optimization problems. They explore the solution space to find reasonable solutions within a reasonable time. Although finding the global optimum is not guaranteed, they work well in practice.
A favorite class of metaheuristics that are famous for their excellent performance while being easy to implement are population-based metaheuristics. They build a group of candidate solutions (called chromosomes, i.e., binary vectors like in the previous example) that evolve together as in natural evolution (as proposed by Charles Darwin’s The origin of species in 1859). The most well-known subclass of population-based metaheuristics is Genetic Algorithms, developed in 1975.
At the initial generation, a set of M (typically between 50 and 200) candidate solutions (in our case, M binary vectors of length N) get generated randomly (the generation mechanism may vary). From them, and using some of the strategies we will comment later, M solutions are selected (with replacement, hence some may appear more than once) to form the parent’s population. They get mated in random pairs, and each pair will cross with a particular probability Pc (which usually is quite high, between 0.7 and 0.9) to generate two offspring (when the parents do not intersect, the offspring is assumed to be the parents themselves). Then, each of the M resulting offspring may suffer a spontaneous mutation with a probability Pm (which is usually low, between 0.1 and 0.3). The resulting individuals then get evaluated according to a specific fitness function, which in our Feature Selection problem is the performance of a k-NN classifier on the input data with the selected feature subset, as will be explained in the next section. Finally, the new population replaces the old one completely (this is the case for a Generational GA), and the loop repeats. The stopping criterion consists in calling the fitness function a maximum number of times. Most often, the algorithm incorporates elitism, meaning that the best solution from the old population is always kept and replaces the worst solution from the newly generated population.
Many components are problem-specific and have to be carefully thought out for each problem, such as the crossover and mutation operators to ensure they do not produce unfeasible solutions in our problem domain. Moreover, the mechanism to assess the individuals is also problem-specific. In our case, to measure how good a subset of a feature is, we employ a classifier trained and evaluated only with the subset of elements selected according to the solution assessed at the time (i.e., those features having a 1 in the solution). Note the classifier will be trained many times, one per chromosome being evaluated, so ideally it should be a fast classifier just able to estimate the fitness of candidate solutions. K-fold cross-validation has to be used here to have a reasonable estimation of the performance of the classifier on a solution, and one must use the exact same partitions when evaluating any solution. Therefore the dataset must be partitioned just once to determine the folds at the beginning of the algorithm.
CHC is a particular type of Genetic Algorithm introduced in a paper first presented at the First Workshop on Foundations of Genetic Algorithms in 1990 (see [1]). It is a binary-coding genetic algorithm with a high selective pressure and elitism selection strategy, with additional components that introduce diversity. The inventor of CHC, Larry Eshelman, defines it as a Genetic Algorithm with some essential modifications:
A proposal for the fitness function classifier is to use the performance of a k-NN classifier on the dataset with the selected features and Leave-One-Out (LVO) to classify each example. Since it is a lazy learner, training does not actually occur – the prediction phase is the costly step. Furthermore, since the datasets given as input may be imbalanced, it is advisable to use the Area Under the ROC curve (AUC) instead of accuracy. Finally, because the aim is to select features, the fitness function should be a weighted sum of the AUC and the feature reduction rate:
Fitness = 0.5 x AUC + 0.5 x Reduction rate = 0.5 x AUC + 0.5 x (1 – (#Fselected/#Ftotal))
The approach discussed here has exhibited good results in research studies and is not usually readily available in most statistical packages – so why not go and try implementing it yourself? As always, the importance resides in what you’ll learn during the process, not the results.
[1] The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination. In: G. J. E. Rawlins (Ed.): Foundations of Genetic Algorithms, vol I, pp. 265 – 283. California, Morgan Kauffman Publishers Inc. (1991)
_______________________________________________________________________
This post is a roundup of articles originally published on the Stratio blog. You can read the originals here and here.
If you’re eager for more information about the Data Science Bootcamp, get your own copy of our information booklet here. If you’re all set on joining our next edition, get started on your application.