In this chapter, we will explore how to customize the Fedora Framework to perfectly fit your data science and machine learning projects. Whether you are working on unique data types or advanced techniques, this chapter will guide you in maximizing the potential of the Fedora Framework for your specific needs.
At this point, the resulting dataset is the following:
print(df)
class buying maint doors persons lug_boot safety
0 unacc vhigh vhigh 2 2 small low
1 unacc vhigh vhigh 2 2 small med
2 unacc vhigh vhigh 2 2 small high
3 unacc vhigh vhigh 2 2 med low
4 unacc vhigh vhigh 2 2 med med
... ... ... ... ... ... ... ...
1723 good low low 5more more med med
1724 vgood low low 5more more med high
1725 unacc low low 5more more big low
1726 good low low 5more more big med
1727 vgood low low 5more more big high
[1728 rows x 7 columns]
Regarding the metadata of the dataset:
We can conclude that:
There are 4 highly unbalanced classes: "unacc", "acc", "good" and "vgood".
All 6 features are categorical. Therefore, they might require further preprocessing.
There are no missing values on the 1728 available entries. If there were, one should proceed to fix or delete such entries.
Representation and Operators
In this step, one must figure out how to represent the problem and which operators should use.
An operator can be virtually anything that combines or simply transforms features:
Integers and Floats: sum(a,b), subtraction(a,b), division(a,b), multiplication(a,b), absoluteValue(a,b), maximum(a,b), noise(a)
Boolean: AND(a,b), OR(a,b), NOT(a)
Strings: length(a)
For an operator to be compatible with the framework, it must:
Mandatory: Return a numeric type (boolean, integer, float)
Mandatory: Handle pandas.Series as arguments
Optional: Avoid the return of NaN values, as they will be replaced with 1
Optional: Ensure returned values conform to a 32-bit floating-point range, since outside it will be set to np.float32's negative or positive extremes
Having the definition of an operator established, the following question is fundamental:
If the answer is yes, then jump into the next section.
If not, then we will have to think about how can we represent these features in a way that the result of the operator is interpretable:
Related example Let's say that the entries A, B and C have the following feature values:
If I select an operator "MyOpt" that counts the "i" in the strings, for the features "buying" and "maint", the transformed dataset might become:
The resulting features are rather meaningless or at least not easily interpretable for the problem at hand.
However if one changes the dataset representation by one-hot encoding the features (will only display the buying-vhigh and maint-med codes for simplicity) and then selecting the logical AND operator to merge them:
The resulting feature "buying-vhigh AND maint-med" literally means:
As such, this feature and all features alike have the potential of being much more informative, and we might even end up with fewer but better features.
Hence, for the dataset at hand, we could one-hot encode all 6 features and then apply boolean operators (AND, OR, NOT) between them.
One-hot encoding the features leaves us with the following dataset:
We were left with an encoded label and 21 boolean features.
To create your own custom operators, please check this section.
Defining the features
After choosing the logical operators (AND, OR, and NOT) and preparing the initial dataset, the next step is to establish the construction rules for the new features. This task can be accomplished by specifying a Context-Free Grammar (CFG) that will be used as input for the framework.
To begin, determine the maximum number of features for the transformed dataset. To ensure an unbiased process, set this number equal to the count of features in the prepared dataset, which is 21. Subsequently, generate the grammar as follows:
This straightforward function serves as an initial setup for your grammar, and you're encouraged to make manual adjustments as needed. The grammar, presented in Backus-Naur Form (BNF), is provided below:
Note: If you wish to utilize certain reserved symbols in the grammar, kindly include their aliases within the grammar instead:
Symbol
Alias
<=
\le
>=
\ge
<
\l
>
\g
|
\eb
,
None
The comma is restricted by the framework for distinguishing features within the phenotype string, and its usage is prohibited in the grammar.
In line with typical evolutionary algorithms, SGE requires specific parameters such as population size, generations, etc. These parameters are to be specified in a Yaml file as outlined below:
Note: There are more parameters available for SGE (grammar, seed, run) that we will define indirectly, for simplicity. The parameters mentioned above are the only ones that must be included in the Yaml file.
Running the Framework
First, let's start by splitting the data:
This returns two dictionaries (train_val (80%) and test (20%)) with keys X and y for the features and labels, respectively.
Finally, to assess the fitness of SGE, one only requires a machine learning model and an error metric (as SGE aims to minimize the objective function). There is also required a seed for initializing the random number generators of the machine learning model, SGE and dataset partitioning (training and validation).
In this context, we have selected:
To run the algorithm with the seed 42
A Decision Tree Classifier as the machine learning model, due to its interpretability
Balanced Accuracy as the performance metric (The error is just 1 - BAcc), since the classes are unbalanced.
To run the framework simply do:
Note: The X parameter is inherently designed to be a pandas.DataFrame without any deviation.
From this point, the algorithm will randomly split the train_val dataset into 2 subsets: train (50%) and validation (50%), using them in the feature engineering process. The train set will fit the machine learning model while the validation set will assess its generalization capabilities.
Once ended, the results are logged, generating the following generic folder structure:
If results have been logged for a particular seed, running the framework with that exact seed will bypass the evolutionary process, fetching the best phenotype directly from memory.
Applying the best individual to the train_val and test sets:
Results
To assess the results acquired against the baseline, we can simply do the entire procedure with two scikit-learn pipelines:
Now just fit them with the train_val set and report their test set scores:
Fedora was able to:
Improve the Accuracy metric! (95.66% vs 96.24%)
Improve the Balanced Accuracy (Recall Macro Average) metric! (90% vs 94.31%)
Therefore, we have shown the potential for effectiveness and hassle-free style of the framework for feature engineering tasks!
A summarized version of the entire procedure can be located here.
Features: 6
Types: ['Categorical']
Entries: 1728
Missing Values: no
Summary:
Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:
CAR car acceptability
. PRICE overall price
. . buying buying price
. . maint price of the maintenance
. TECH technical characteristics
. . COMFORT comfort
. . . doors number of doors
. . . persons capacity in terms of persons to carry
. . . lug_boot the size of luggage boot
. . safety estimated safety of the car
Input attributes are printed in lowercase. Besides the target concept (CAR), the model includes three intermediate concepts: PRICE, TECH, COMFORT. Every concept is in the original model related to its lower level descendants by a set of examples (for these examples sets see http://www-ai.ijs.si/BlazZupan/car.html).
The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety.
Because of known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.
Info:
buying: vhigh, high, med, low.
maint: vhigh, high, med, low.
doors: 2, 3, 4, 5more.
persons: 2, 4, more.
lug_boot: small, med, big.
safety: low, med, high.
class
unacc 1210
acc 384
good 69
vgood 65
Name: count, dtype: int64
Am I able to think of operators that have the ability to combine the current features in a way that the resulting feature makes sense?
Entry buying maint ...
A "vhigh" "med" ...
B "high" "med" ...
C "vhigh" "low" ...
Entry buying maint MyOpt(buying) MyOpt(maint) ...
A "vhigh" "med" 1 0 ...
B "high" "med" 1 0 ...
C "vhigh" "low" 1 0 ...
Entry buying-vhigh ... maint-med ... buying-vhigh AND maint-med ...
A True ... True ... True ...
B False ... True ... False ...
C True ... False ... False ...
Is the buying price of the car very high and the price of maintenance average (medium)?
# One-hot Encoding Features
oe_features = pd.get_dummies(features)
# Label encoding for numerical compatibility with ML models
labels, names = pd.factorize(targets)
df = pd.concat([pd.Series(labels, name="class"), oe_features], axis=1)
print(df)
# car-evaluation.yml
POPSIZE: 2000 # Population Size
GENERATIONS: 100 # Generations
ELITISM: 200 # Elistism (10%)
PROB_CROSSOVER: 0.9 # Crossover: Probability
PROB_MUTATION: 0.1 # Mutation: Probability
TSIZE: 3 # Selection: Tournament Size
EXPERIMENT_NAME: 'car-evaluation-results/' # Results folder name
INCLUDE_GENOTYPE: False # Option include the genotype in the logs
SAVE_STEP: 200 # Logging Frequency (every N generations)
VERBOSE: True # Statistics of each generation when running the algorithm
MIN_TREE_DEPTH: 3 # Genetic Programming Tree Minimum Depth
MAX_TREE_DEPTH: 10 # Genetic Programming Tree Maximum Depth
from fedora.core.utilities.lib import split_data
SEED = 42
train_val, test = split_data(oe_features, labels, SEED, test_size=0.2)