Decision Sprout

This project is aimed to learn the core topics in machine learning and decision trees. Any contributions for improvement are welcome.

Results

The project has been implemented and tested using the Iris dataset. The accuracy of the decision tree is 93% on the test set.

Running the Project

make clean
make
./decision_tree

rm -rf build decision_tree
gcc -Wall -Wextra -std=c99 -g  -c src/dataset.c -o build/dataset.o
gcc -Wall -Wextra -std=c99 -g  -c src/main.c -o build/main.o
gcc -Wall -Wextra -std=c99 -g  -c src/tree.c -o build/tree.o
gcc -Wall -Wextra -std=c99 -g  -c src/utils.c -o build/utils.o
gcc -Wall -Wextra -std=c99 -g  -o decision_tree  build/dataset.o  build/main.o  build/tree.o  build/utils.o -lcsv
[Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species, ]
0 [1.000000, 5.100000, 3.500000, 1.400000, 0.200000, Iris-setosa, ]
1 [2.000000, 4.900000, 3.000000, 1.400000, 0.200000, Iris-setosa, ]
2 [3.000000, 4.700000, 3.200000, 1.300000, 0.200000, Iris-setosa, ]
Split the Data Set into Training (105) and Test (45)
Tree has been built using the training set
Predictions have been made
Calculating the accuracy... 
Accuracy: 0.93%

Caveats

Keep in mind this is a for learning project and is not intended for production use.

Currently the project is very basic and does not have most of the features that a real decision tree would have. The following are some of the limitations of the project:

The target feature is assumed to be the last feature in the dataset
The project is only for classification and does not support regression
The project can only use the GINI index for splitting
The project does not have any pruning or stopping conditions
The project expects the data to be floats and the target to be a string

Project Structure and Implementation

The project is divided into 3 main parts:

Main: The main file that runs the project
Decision Tree: The decision tree implementation
Data: The data used for training and testing

For this project we have a main structure to store data called DataSet. This structure holds the actual read data from the csv. We then later convert this data into a Table structure that is more suitable for the decision tree. Any operations on the data are done on the Table structure in order to keep the original data intact.

The tree is implemented using a binary tree structure. This tree structure uses the Table to generate the optimal split using the GINI index. This split is represented as a Split structure. Using this split we can generate the left and right child nodes of the tree. This process is repeated recursively until we reach a stopping condition.

Data

The data management part of the project was implemented in dataset.c and dataset.h. They represent the data in three main structures:

The Value structure that holds the value of an attribute allowing for multiple types of data:

typedef union Data {
  float f;
  char *s;
} Data;

typedef enum DataType {
  FLOAT,
  STRING
} DataType;

typedef struct Value {
  DataType type;
  Data data;
  int sampleID;
} Value;

The Dataset structure that holds the data and the labels

// Holds The target feature
typedef struct Target {
  int id;
  int unique;
  char **classes;
} Target;

typedef struct DataSet {
  char *features[MAX_FEATURES]; // Features names
  int height; // Number of samples
  int width; // Number of features
  int index; // Current index for parsing
  struct Target *target; // Target feature
  struct Sample *tail; // end of the list
  struct Sample *sample; // start of the list
  struct Sample **memory; // Memory for the samples
} DataSet;

The Table structure that holds data in a format more suitable for the decision tree

// Subset of the entire dataset
typedef struct Table {
  // Data matrix first index is the feature, second index is the sample
  // The value is a pointer to the value in the dataset
  Value ***data; 

  char *features[MAX_FEATURES];
  struct Target *target;
  int height; // Number of samples
  int width; // Number of features
} Table;

The Sample structure that holds the data for a single sample

typedef struct Sample {
  struct Sample *next;
  Value features[MAX_FEATURES];
  int id;
} Sample;

Decision Tree

In the implementation of the decision tree I used the following data structure to represent the tree:

The TreeNode structure that holds the node of the tree

typedef struct TreeNode {
  struct TreeNode *left; // Left child
  struct TreeNode *right; // Right child
  int target; // target for this node
  Split *split; // split used for this node
  Table *table; // table of values in this node
} TreeNode;

The Split structure that holds the split information

typedef struct Split {
  int feature;
  float threshold;
  float gini;
} Split;

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.org		README.org
decision_tree		decision_tree
makefile		makefile
tree.dot		tree.dot
tree.png		tree.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decision Sprout

Results

Running the Project

Caveats

Project Structure and Implementation

Data

Decision Tree

About

Releases

Packages

Languages

italoaa/DecisionSprout

Folders and files

Latest commit

History

Repository files navigation

Decision Sprout

Results

Running the Project

Caveats

Project Structure and Implementation

Data

Decision Tree

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages