This project includes a method for classifying data and highlights the importance of key parameters in the decision tree. Additionally, the number of clusters is determined based on the elbow chart. The README file provides specific explanations for key lines of code and charts. This project is expected to provide basic knoweldge of the data classification and clustering
Through the Data preparation part, the given data was cleaned. A critical problem was some colums include a non-numeric values To clear those problems
data['name of the colum'] = pd.to_numeric(data['name of the colum'], errors='coerce')
This line of colum halps to change the data types of the certain colume to a numeric value
After cleanning the data, a sample random dataset was taken. To create sample random dataset, the line of code below was used.
random_indices = random.sample(range(num_rows), 600)
random_sample = data.iloc[random_indices]
random_sample = random_sample.dropna()
The distribution of the sample random dataset is shown through the charts below.

Task 2 aims to classify the dataset. Based on the columns, excluding the 'quality' column, the decision tree will predict the quality level for each data point. After the prediction, the predicted quality levels are compared with the actual quality levels. Finally, the accuracy of the predicted quality levels is presented through metrics.
During the creation of the decision tree, the trees were generated several times. Firstly, the decision tree was created without any justification for key parameters. Subsequently, the decision tree was refined gradually with appropriate key parameter values.
The way to generate the decision tree is below
X_train, X_test, Y_train, Y_test = train_test_split(input, target,train_size = 0.7, random_state = 14)
clf = DecisionTreeClassifier()
fit = clf.fit(X_train, Y_train)
fig = plt.figure(figsize=(10,10))
fig = plot_tree(clf,
feature_names=input.columns.values,
class_names=list(map(str, target.unique())),
rounded=True,
filled=True)
plt.show()

Moreover, to create metric to check the decision tree's accuracy
from sklearn.metrics import classification_report
Y_pre = fit.predict(X_test)
print(classification_report(Y_test,Y_pre))

The key parameters which were checked whether it is affected to the accurarcy of the prediction result were max depth,max leaf nodes,ccp alpha and max features.
justifying the key prameter hleps avoid the overfitting and imporve the prediction accuracy.
With the final decision tree, the data were clustered. During the data clustering process, the number of clusters plays an important role. To determine the appropriate number of clusters, the elbow chart was used. In this case, the number of clusters was decided to be 9.
The way to create the elbow chart is below
for i in range(1,100):
model = KMeans(n_clusters = i,init='k-means++',n_init = 20, random_state=43)
fit = model.fit(Copy_Data_x_scaled)
distortions.append(model.inertia_)
distortion = model.inertia_
if target_distortion <= distortion:
selected_clusters = i
plt.plot(range(1,100),distortions, marker='o')
plt.tight_layout()
plt.show
print(f"When distortion is Distortion >= 3000, the number of cluster : {selected_clusters}")
This code create the elbow chart with the number of cluster from 1 to 100