Data Science Clustering

Overview

This project includes a method for classifying data and highlights the importance of key parameters in the decision tree. Additionally, the number of clusters is determined based on the elbow chart. The README file provides specific explanations for key lines of code and charts. This project is expected to provide basic knoweldge of the data classification and clustering

1 Data preparation

Through the Data preparation part, the given data was cleaned. A critical problem was some colums include a non-numeric values To clear those problems

Change to numeric values

data['name of the colum'] = pd.to_numeric(data['name of the colum'], errors='coerce')

This line of colum halps to change the data types of the certain colume to a numeric value

After cleanning the data, a sample random dataset was taken. To create sample random dataset, the line of code below was used.

random_indices = random.sample(range(num_rows), 600)
random_sample = data.iloc[random_indices]
random_sample = random_sample.dropna()

The distribution of the sample random dataset is shown through the charts below.

2 Data Classification (Decision Tree)

Task 2 aims to classify the dataset. Based on the columns, excluding the 'quality' column, the decision tree will predict the quality level for each data point. After the prediction, the predicted quality levels are compared with the actual quality levels. Finally, the accuracy of the predicted quality levels is presented through metrics.

During the creation of the decision tree, the trees were generated several times. Firstly, the decision tree was created without any justification for key parameters. Subsequently, the decision tree was refined gradually with appropriate key parameter values.

Without key parameter

The way to generate the decision tree is below

X_train, X_test, Y_train, Y_test = train_test_split(input, target,train_size = 0.7, random_state = 14)
clf = DecisionTreeClassifier()
fit = clf.fit(X_train, Y_train)

fig = plt.figure(figsize=(10,10))
fig = plot_tree(clf,
              feature_names=input.columns.values,
              class_names=list(map(str, target.unique())),
              rounded=True,
              filled=True)
plt.show()

Moreover, to create metric to check the decision tree's accuracy

from sklearn.metrics import classification_report
Y_pre = fit.predict(X_test)
print(classification_report(Y_test,Y_pre))

with appropriate key parameter

The key parameters which were checked whether it is affected to the accurarcy of the prediction result were max depth,max leaf nodes,ccp alpha and max features. justifying the key prameter hleps avoid the overfitting and imporve the prediction accuracy.

3 Data Clustering

clustering elbow chart

With the final decision tree, the data were clustered. During the data clustering process, the number of clusters plays an important role. To determine the appropriate number of clusters, the elbow chart was used. In this case, the number of clusters was decided to be 9.

The way to create the elbow chart is below

for i in range(1,100):
    model = KMeans(n_clusters = i,init='k-means++',n_init = 20, random_state=43)
    fit = model.fit(Copy_Data_x_scaled)
    distortions.append(model.inertia_)
    distortion = model.inertia_

    if target_distortion <= distortion:
        selected_clusters = i

plt.plot(range(1,100),distortions, marker='o')
plt.tight_layout()
plt.show

print(f"When distortion is Distortion >= 3000, the number of cluster : {selected_clusters}")

This code create the elbow chart with the number of cluster from 1 to 100

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
A2data.csv		A2data.csv
README.md		README.md
assignment2.ipynb		assignment2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Science Clustering

Contents

Overview

1 Data preparation

Change to numeric values

2 Data Classification (Decision Tree)

Without key parameter

with appropriate key parameter

3 Data Clustering

clustering elbow chart

About

Uh oh!

Releases

Packages

Languages

hyeonbinHur/Data-science-clustering

Folders and files

Latest commit

History

Repository files navigation

Data Science Clustering

Contents

Overview

1 Data preparation

Change to numeric values

2 Data Classification (Decision Tree)

Without key parameter

with appropriate key parameter

3 Data Clustering

clustering elbow chart

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages