Stacked Generalisation theory and implementation.

5 min readFeb 13, 2022

Introduction

A classifier is a system that takes instances from a dataset and assigns a class or category to each of them. The individual classifier must have some knowledge. The classifiers can is generated by using various forms of learning (e.g., deduction, analogy, or memorization), but the most common way of acquiring this knowledge is to infer it from a set of previously classified instances. This class of learning is called supervised learning.

There are a number of models that have been proposed, but the construction of a perfect classifier for any given task remains unobtainable. Furthermore, no single approach can claim to be superior to any other. Thus, the combination of different classification models is considered a viable alternative for obtaining more accurate classification systems.

The strategy in ensemble systems is to create a set of classifiers and combine their outputs such that the combination outperforms all of the single classifiers.

The individual classifiers should be accurate and diverse.
The output combination amplifies the correct decisions and cancels out the incorrect decisions.

The most common way of ensemble is typically applying a single learning algorithm and combining their outputs using a mathematical function. In contrast,

Stacking generates the members of the Stacking ensemble using several learning algorithms. Subsequently, it uses another algorithm to learn how to combine their outputs. ~ Sesmero et al. 2015

Ensembles of classifiers

So formally, we can say an ensemble of classifiers is a set of classifiers whose individual decisions are combined to obtain a system that hopefully outperforms all of its members.

For example, it is common to ask the opinion of different doctors before having surgery, performed, or read reviews before buying a product. In other words, a decision is more reliable if it is made based on the opinion of different experts.

Extrapolation of this proposition to the field of machine learning leads to the development of systems composed of several classifiers, in which the final decision is made collectively. This line of research in the machine learning field is known as the study of ensembles of classifiers.

The strategy in ensemble systems is to create a set of accurate and diverse classifiers and combine their outputs such that the combination outperforms all the single classifiers. Therefore, classifier ensembles are built in two phases: generation and combination.

In the generation phase, the individual components of the ensemble, known as base classifiers, are generated.
In the combination phase, the decisions made by the members of the ensemble are combined to obtain one decision.

What type of base classifiers are feasible? The base classifier should be accurate and diverse. Why accurate is quite trivial as a set of false narratives cannot be combined to give accurate reflections.

Suppose we have two classifiers say A and B; we mark them as diverse if they make errors at different instances. Let’s throw some light on this to make it more clear via a binary example.

Insert Image

The techniques used to generate diverse classifiers is based on the assumption/idea that the hypothesis of a classifier depends on both the learning algorithm and the subset used to generate these classifiers.

I. Using different training subset

Resampling the training examples: e.g., Bagging and Boosting
Manipulating the input features: i.e., by modifying the set of attributes used to describe the instances.
Manipulating the output target: having each classifier solve a different classification problem

II. Using varying learning algorithm

Use different versions of the same learning algorithm. Kolen and Pollak25 demonstrated that a pool of artificial neural networks starting from different initial weights could be trained to generate diverse classifiers and thus a good ensemble.
Heterogeneous ensembles Wolpert came out with the results that using different base learners (such as KNN, Bayesian, SVM), we can exploit the different biases of each algorithm.

How to integrate these results?

There are primarily two ways of combining base learners fusion and selection.

Selection: Classifier selection is based on the assumption that each classifier is an expert in some local regions of the space. Therefore, when the instance is submitted, the ensemble decision coincides with the decision of that classifier and region of the space to which the instance belongs.
Classifier fusion algorithms include combining rules, such as the average, majority vote, weighted majority vote, and more complex integration models, such as meta-classifiers.

Stacking

Stacking is short for Stacked Generalization. The algorithms such as Bagging or Boosting, which generate an ensemble of classifiers using the same learning algorithm (homogeneous ensembles), Stacking generates an ensemble composed of heterogeneous classifiers. Since learning algorithm uses different methods to represent the knowledge and different learning biases, the hypothesis space expose from different perspectives to generate a pool of diverse classifiers. Therefore, when their predictions are combined, the resultant model is expected to be more accurate than each member.

The individual outcome of ensemble members are combined using meta classifiers/learners. The generation of the meta-classifier or level-1 model is via using a learning algorithm following a cross-validation-like process.

Stacking is an ensemble of classifiers in which

The base learners are trained using different training parameters (generally different learning algorithms)
The outputs of the base learners are combined by using a meta-classifier.

Stacking configuration(set of base-level classifiers and the meta-classifier) can be an issue.

Definition of stacking.

Suppose we have a dataset S, generate subsets of equal size say S1, S2, …, SJ, then we skip one of subset say Sj and use S(−j)= S — Sj to train/generate level-0, K classifiers. Once the level-0 classifier is generated, Sj is used to generate level-1 instances. The level-1 instances serve as level-1 training data, each of these data have K features one from predictions of k=1,2,…, K level-0 predictions. Now any learning algorithm can be applied to this data to generate a level-1 model. For completeness, the level-0 models are re-generated from the entire dataset S.

For any new instance, the level-0 models produce a vector of predictions that is the input to the level-1 model, which in turn predicts the class.

For a better understanding, let me draw the flow diagram, and explain it.

Step I: Base learner generation.