Walkthrough

In this chapter, we will explore how to customize the Fedora Framework to perfectly fit your data science and machine learning projects. Whether you are working on unique data types or advanced techniques, this chapter will guide you in maximizing the potential of the Fedora Framework for your specific needs.

We will use the Car Evaluation Datasetarrow-up-right in this walkthrough.

Original dataset

First, we will need to download the dataset. The UCI Machine Learning repository python package allows us to do such with ease. To install it, run:

pip3 install ucimlrepo

Create a main.py file and load the data:

from ucimlrepo import fetch_ucirepo 
dataset = fetch_ucirepo(id=19) 

Create a pandas dataframe and load the data:

import pandas as pd
features = dataset.data.features
targets = dataset.data.targets["class"]
df = pd.concat([targets, features], axis=1)

At this point, the resulting dataset is the following:

print(df)
      class buying  maint  doors persons lug_boot safety
0     unacc  vhigh  vhigh      2       2    small    low
1     unacc  vhigh  vhigh      2       2    small    med
2     unacc  vhigh  vhigh      2       2    small   high
3     unacc  vhigh  vhigh      2       2      med    low
4     unacc  vhigh  vhigh      2       2      med    med
...     ...    ...    ...    ...     ...      ...    ...
1723   good    low    low  5more    more      med    med
1724  vgood    low    low  5more    more      med   high
1725  unacc    low    low  5more    more      big    low
1726   good    low    low  5more    more      big    med
1727  vgood    low    low  5more    more      big   high

[1728 rows x 7 columns]

Regarding the metadata of the dataset:

We can conclude that:

  • There are 4 highly unbalanced classes: "unacc", "acc", "good" and "vgood".

  • All 6 features are categorical. Therefore, they might require further preprocessing.

  • There are no missing values on the 1728 available entries. If there were, one should proceed to fix or delete such entries.

Representation and Operators

In this step, one must figure out how to represent the problem and which operators should use.

An operator can be virtually anything that combines or simply transforms features:

  • Integers and Floats: sum(a,b), subtraction(a,b), division(a,b), multiplication(a,b), absoluteValue(a,b), maximum(a,b), noise(a)

  • Boolean: AND(a,b), OR(a,b), NOT(a)

  • Strings: length(a)

For an operator to be compatible with the framework, it must:

  • Mandatory: Return a numeric type (boolean, integer, float)

  • Mandatory: Handle pandas.Series as arguments

  • Optional: Avoid the return of NaN values, as they will be replaced with 1

  • Optional: Ensure returned values conform to a 32-bit floating-point range, since outside it will be set to np.float32's negative or positive extremes

Having the definition of an operator established, the following question is fundamental:

If the answer is yes, then jump into the next section.

If not, then we will have to think about how can we represent these features in a way that the result of the operator is interpretable:

Related example Let's say that the entries A, B and C have the following feature values:

If I select an operator "MyOpt" that counts the "i" in the strings, for the features "buying" and "maint", the transformed dataset might become:

The resulting features are rather meaningless or at least not easily interpretable for the problem at hand.

However if one changes the dataset representation by one-hot encoding the features (will only display the buying-vhigh and maint-med codes for simplicity) and then selecting the logical AND operator to merge them:

The resulting feature "buying-vhigh AND maint-med" literally means:

As such, this feature and all features alike have the potential of being much more informative, and we might even end up with fewer but better features.

Hence, for the dataset at hand, we could one-hot encode all 6 features and then apply boolean operators (AND, OR, NOT) between them.

One-hot encoding the features leaves us with the following dataset:

We were left with an encoded label and 21 boolean features.

To create your own custom operators, please check this section.

Defining the features

After choosing the logical operators (AND, OR, and NOT) and preparing the initial dataset, the next step is to establish the construction rules for the new features. This task can be accomplished by specifying a Context-Free Grammar (CFG) that will be used as input for the framework.

To begin, determine the maximum number of features for the transformed dataset. To ensure an unbiased process, set this number equal to the count of features in the prepared dataset, which is 21. Subsequently, generate the grammar as follows:

This straightforward function serves as an initial setup for your grammar, and you're encouraged to make manual adjustments as needed. The grammar, presented in Backus-Naur Form (BNF), is provided below:

Note: If you wish to utilize certain reserved symbols in the grammar, kindly include their aliases within the grammar instead:

Symbol
Alias

<=

\le

>=

\ge

<

\l

>

\g

|

\eb

,

None

The comma is restricted by the framework for distinguishing features within the phenotype string, and its usage is prohibited in the grammar.

Tunning the Evolution Process

The Fedora Framework uses Structured Grammatical Evolution (SGE)arrow-up-right as the evolutionary algorithm. The Figure below shows the topology of the framework:

Fedora-Topology

In line with typical evolutionary algorithms, SGE requires specific parameters such as population size, generations, etc. These parameters are to be specified in a Yaml file as outlined below:

Note: There are more parameters available for SGE (grammar, seed, run) that we will define indirectly, for simplicity. The parameters mentioned above are the only ones that must be included in the Yaml file.

Running the Framework

First, let's start by splitting the data:

This returns two dictionaries (train_val (80%) and test (20%)) with keys X and y for the features and labels, respectively.

Finally, to assess the fitness of SGE, one only requires a machine learning model and an error metric (as SGE aims to minimize the objective function). There is also required a seed for initializing the random number generators of the machine learning model, SGE and dataset partitioning (training and validation).

In this context, we have selected:

  • To run the algorithm with the seed 42

  • A Decision Tree Classifier as the machine learning model, due to its interpretability

  • Balanced Accuracy as the performance metric (The error is just 1 - BAcc), since the classes are unbalanced.

To run the framework simply do:

Note: The X parameter is inherently designed to be a pandas.DataFrame without any deviation.

From this point, the algorithm will randomly split the train_val dataset into 2 subsets: train (50%) and validation (50%), using them in the feature engineering process. The train set will fit the machine learning model while the validation set will assess its generalization capabilities.

Once ended, the results are logged, generating the following generic folder structure:

If results have been logged for a particular seed, running the framework with that exact seed will bypass the evolutionary process, fetching the best phenotype directly from memory.

Applying the best individual to the train_val and test sets:

Results

To assess the results acquired against the baseline, we can simply do the entire procedure with two scikit-learn pipelinesarrow-up-right:

Now just fit them with the train_val set and report their test set scores:

Fedora was able to:

  • Improve the Accuracy metric! (95.66% vs 96.24%)

  • Improve the Balanced Accuracy (Recall Macro Average) metric! (90% vs 94.31%)

Therefore, we have shown the potential for effectiveness and hassle-free style of the framework for feature engineering tasks!

A summarized version of the entire procedure can be located herearrow-up-right.

Last updated