Features Selection - Multiverse dilemma

Article content

There are some domains that are characterized by having very few records with a lot of features. For example, Comparative genomics bioinformatics or multi-omics biomarkers finding — typical data cohort contains several hundred patients’ records with several thousand different features.

As a rule of thumb, a meaningful algorithm — an algorithm that will truly possess learning — requires that the number of features that construct a learned algorithm be significantly lower than the number of the records used for learning. The rule of thumb is 10 records for each feature, but we can go as low as 3 to 5.


Article content
signifcant learning need


The implications are that we need to find a proper, good enough features combination out of a huge number of possibilities. Obviously, we can check only a small fraction of the vast multi-dimensional space of possible combinations. This is the challenge.

So how can we find the optimal combination of features that we will use afterward to train our “main” model?

Article content

The first step is to remove features that are either of the following:

1.1 Has lots of missing values

1.2 Has lots of outlier values

1.3 Is not informative enough

1.4 Is in high correlation with other features

If a specific feature has a lot of missing values (Option 1.1), it is clearly not informative, and we are less likely to gain insights from such a feature. Sometimes specific feature values are almost identical (Option 1.3) or are very correlated (Option 1.4) — both cases are not informative and will not be useful. Removal of outlier features (Option 1.2) is intended to help the models gain insights more easily.

Performing these steps doesn’t require us to know the anticipated records’ outcomes (called Ground Truth — GT) and can be done as a standalone separate process. We can say that these are Filtering methods using an unsupervised learning approach.

The second step is one of the following:

2.1 Run a model with a specific selection of features, get the results, and compare it to a second model run with a different selection of features.

2.2 Run a model with a specific selection of features, get the results, and check the relative significance/contribution (sometimes called weight) of each feature. Compare it to previous model runs with a different selection of features.

Option 2.1 can be implemented using Python’s SelectFromModel with an appropriate model and selection criteria.

Another possibility is to use the Python command SequentialFeatureSelector — Sequential Features Selection, as the name implies — we are scanning different models’ outcomes and selecting the version that is most closely related to the GT.

This approach requires us to know the GT and is an iterative wrapping process, so we can say that these are Wrapper methods using a supervised learning approach.

Option 2.2 can be implemented using Python’s SelectKBest; we need to define what the selection criteria are, which can be any classification/regression/statistical in nature but the selection criteria must be univariate — i.e. return a result for each and every feature separately — a technical constraint of SelectKBest.

As can be seen, this approach requires knowing the GT and is an iterative process from within the model. These are Embedding methods using a supervised learning approach.

Obviously, 2.1 is a “direct to the target” (we want the best model results) approach, but it requires a long run time, while option 2.2 is relying on indirect clues (features importance) which if right — will direct us faster to our desired outcomes — better model results.

The aforementioned steps & options are iterative by nature and require a fine balance between the prospect of missing relevant features (False Negative) and adding non-relevant features (False Positive). In order to iterate and converge toward a preferable and informative subset of features, we need to define a policy. This policy will need to answer the following questions:

a. How to define a concrete results score?

b. How to define significant improvement?

c. How to decide when & if to stop?

Python offers us several tools such as GridSearchCV, RandomSearchCV, and Optuna, but there are other approaches that involve multi-dimensional search and parameters optimizations using, for example, Stochastic Gradient Descent (SGD). One can dive following the relevant (technical) article [1].

Article content

Last but not least, we examined various steps and strategies related to pre-processing steps, but, in fact, some of these methods are relevant also to the main processing step using, for example, Deep Neural Networks — but this will be discussed in a separate post.

To view or add a comment, sign in

More articles by Ilan Sinai

Insights from the community

Others also viewed

Explore topics