This project is aimed to learn the core topics in machine learning and decision trees. Any contributions for improvement are welcome.
The project has been implemented and tested using the Iris dataset. The accuracy of the decision tree is 93%
on the test set.
make clean
make
./decision_tree
rm -rf build decision_tree gcc -Wall -Wextra -std=c99 -g -c src/dataset.c -o build/dataset.o gcc -Wall -Wextra -std=c99 -g -c src/main.c -o build/main.o gcc -Wall -Wextra -std=c99 -g -c src/tree.c -o build/tree.o gcc -Wall -Wextra -std=c99 -g -c src/utils.c -o build/utils.o gcc -Wall -Wextra -std=c99 -g -o decision_tree build/dataset.o build/main.o build/tree.o build/utils.o -lcsv [Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species, ] 0 [1.000000, 5.100000, 3.500000, 1.400000, 0.200000, Iris-setosa, ] 1 [2.000000, 4.900000, 3.000000, 1.400000, 0.200000, Iris-setosa, ] 2 [3.000000, 4.700000, 3.200000, 1.300000, 0.200000, Iris-setosa, ] Split the Data Set into Training (105) and Test (45) Tree has been built using the training set Predictions have been made Calculating the accuracy... Accuracy: 0.93%
Keep in mind this is a for learning project and is not intended for production use.
Currently the project is very basic and does not have most of the features that a real decision tree would have. The following are some of the limitations of the project:
- The target feature is assumed to be the last feature in the dataset
- The project is only for classification and does not support regression
- The project can only use the GINI index for splitting
- The project does not have any pruning or stopping conditions
- The project expects the data to be floats and the target to be a string
The project is divided into 3 main parts:
- Main: The main file that runs the project
- Decision Tree: The decision tree implementation
- Data: The data used for training and testing
For this project we have a main structure to store data called DataSet
. This structure holds the actual read data from the csv. We then later convert this data into a Table
structure that is more suitable for the decision tree. Any operations on the data are done on the Table
structure in order to keep the original data intact.
The tree is implemented using a binary tree structure. This tree structure uses the Table to generate the optimal split using the GINI
index. This split is represented as a Split
structure. Using this split we can generate the left and right child nodes of the tree. This process is repeated recursively until we reach a stopping condition.
The data management part of the project was implemented in dataset.c
and dataset.h
. They represent the data in three main structures:
- The
Value
structure that holds the value of an attribute allowing for multiple types of data:typedef union Data { float f; char *s; } Data; typedef enum DataType { FLOAT, STRING } DataType; typedef struct Value { DataType type; Data data; int sampleID; } Value;
- The
Dataset
structure that holds the data and the labels// Holds The target feature typedef struct Target { int id; int unique; char **classes; } Target; typedef struct DataSet { char *features[MAX_FEATURES]; // Features names int height; // Number of samples int width; // Number of features int index; // Current index for parsing struct Target *target; // Target feature struct Sample *tail; // end of the list struct Sample *sample; // start of the list struct Sample **memory; // Memory for the samples } DataSet;
- The
Table
structure that holds data in a format more suitable for the decision tree// Subset of the entire dataset typedef struct Table { // Data matrix first index is the feature, second index is the sample // The value is a pointer to the value in the dataset Value ***data; char *features[MAX_FEATURES]; struct Target *target; int height; // Number of samples int width; // Number of features } Table;
- The
Sample
structure that holds the data for a single sampletypedef struct Sample { struct Sample *next; Value features[MAX_FEATURES]; int id; } Sample;
In the implementation of the decision tree I used the following data structure to represent the tree:
- The
TreeNode
structure that holds the node of the treetypedef struct TreeNode { struct TreeNode *left; // Left child struct TreeNode *right; // Right child int target; // target for this node Split *split; // split used for this node Table *table; // table of values in this node } TreeNode;
- The
Split
structure that holds the split informationtypedef struct Split { int feature; float threshold; float gini; } Split;