Skip to content

A decision tree algorithm in C for fun and certainly not profit

Notifications You must be signed in to change notification settings

italoaa/DecisionSprout

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Decision Sprout

This project is aimed to learn the core topics in machine learning and decision trees. Any contributions for improvement are welcome.

Results

The project has been implemented and tested using the Iris dataset. The accuracy of the decision tree is 93% on the test set.

Running the Project

make clean
make
./decision_tree
rm -rf build decision_tree
gcc -Wall -Wextra -std=c99 -g  -c src/dataset.c -o build/dataset.o
gcc -Wall -Wextra -std=c99 -g  -c src/main.c -o build/main.o
gcc -Wall -Wextra -std=c99 -g  -c src/tree.c -o build/tree.o
gcc -Wall -Wextra -std=c99 -g  -c src/utils.c -o build/utils.o
gcc -Wall -Wextra -std=c99 -g  -o decision_tree  build/dataset.o  build/main.o  build/tree.o  build/utils.o -lcsv
[Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species, ]
0 [1.000000, 5.100000, 3.500000, 1.400000, 0.200000, Iris-setosa, ]
1 [2.000000, 4.900000, 3.000000, 1.400000, 0.200000, Iris-setosa, ]
2 [3.000000, 4.700000, 3.200000, 1.300000, 0.200000, Iris-setosa, ]
Split the Data Set into Training (105) and Test (45)
Tree has been built using the training set
Predictions have been made
Calculating the accuracy... 
Accuracy: 0.93%

Caveats

Keep in mind this is a for learning project and is not intended for production use.

Currently the project is very basic and does not have most of the features that a real decision tree would have. The following are some of the limitations of the project:

  • The target feature is assumed to be the last feature in the dataset
  • The project is only for classification and does not support regression
  • The project can only use the GINI index for splitting
  • The project does not have any pruning or stopping conditions
  • The project expects the data to be floats and the target to be a string

Project Structure and Implementation

The project is divided into 3 main parts:

  • Main: The main file that runs the project
  • Decision Tree: The decision tree implementation
  • Data: The data used for training and testing

For this project we have a main structure to store data called DataSet. This structure holds the actual read data from the csv. We then later convert this data into a Table structure that is more suitable for the decision tree. Any operations on the data are done on the Table structure in order to keep the original data intact.

The tree is implemented using a binary tree structure. This tree structure uses the Table to generate the optimal split using the GINI index. This split is represented as a Split structure. Using this split we can generate the left and right child nodes of the tree. This process is repeated recursively until we reach a stopping condition.

Data

The data management part of the project was implemented in dataset.c and dataset.h. They represent the data in three main structures:

  • The Value structure that holds the value of an attribute allowing for multiple types of data:
    typedef union Data {
      float f;
      char *s;
    } Data;
    
    typedef enum DataType {
      FLOAT,
      STRING
    } DataType;
    
    typedef struct Value {
      DataType type;
      Data data;
      int sampleID;
    } Value;
        
  • The Dataset structure that holds the data and the labels
    // Holds The target feature
    typedef struct Target {
      int id;
      int unique;
      char **classes;
    } Target;
    
    typedef struct DataSet {
      char *features[MAX_FEATURES]; // Features names
      int height; // Number of samples
      int width; // Number of features
      int index; // Current index for parsing
      struct Target *target; // Target feature
      struct Sample *tail; // end of the list
      struct Sample *sample; // start of the list
      struct Sample **memory; // Memory for the samples
    } DataSet;
        
  • The Table structure that holds data in a format more suitable for the decision tree
    // Subset of the entire dataset
    typedef struct Table {
      // Data matrix first index is the feature, second index is the sample
      // The value is a pointer to the value in the dataset
      Value ***data; 
    
      char *features[MAX_FEATURES];
      struct Target *target;
      int height; // Number of samples
      int width; // Number of features
    } Table;
        
  • The Sample structure that holds the data for a single sample
    typedef struct Sample {
      struct Sample *next;
      Value features[MAX_FEATURES];
      int id;
    } Sample;
        

Decision Tree

In the implementation of the decision tree I used the following data structure to represent the tree:

  • The TreeNode structure that holds the node of the tree
    typedef struct TreeNode {
      struct TreeNode *left; // Left child
      struct TreeNode *right; // Right child
      int target; // target for this node
      Split *split; // split used for this node
      Table *table; // table of values in this node
    } TreeNode;
        
  • The Split structure that holds the split information
    typedef struct Split {
      int feature;
      float threshold;
      float gini;
    } Split;
        

About

A decision tree algorithm in C for fun and certainly not profit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published