Data-Science-Hackathon-Ecommerce-product-categorization-project

Data-Science-Hackathon-by-upGrad

Ecommerce Product Categorization

Project Overview

This project is part of a Data Science Hackathon focused on categorizing ecommerce products based on their descriptions. Using Logistic Regression and TF-IDF vectorization, the model predicts the product category from textual descriptions. The main goal of this project is to classify products into predefined categories using machine learning techniques.

Objectives:


Dataset

The dataset used for this project is a collection of ecommerce product descriptions with their associated product categories.

Features:

Preprocessing Steps:

  1. Handle Missing Data: Drop the brand column due to a high number of missing values and remove rows with any missing data.
  2. Label Encoding: Convert the target variable (product_category_tree) to numerical labels using LabelEncoder.
  3. TF-IDF Vectorization: Convert the product descriptions into TF-IDF vectors for numerical representation of text data.

Dependencies

The following Python libraries are required:

You can install these packages using:

pip install pandas scikit-learn seaborn matplotlib

Model Pipeline

1. Data Preprocessing

dataset.drop(columns=["brand"], inplace=True)
dataset.dropna(inplace=True)

2. Label Encoding

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dataset["en_product_category_tree"] = le.fit_transform(dataset["product_category_tree"])

3. TF-IDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(dataset["description"])

4. Train-Test Split

from sklearn.model_selection import train_test_split
X = dataset["description"]
y = dataset["product_category_tree"]
x_train, x_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

5. Model Selection: Logistic Regression

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train, y_train)

6. Hyperparameter Tuning: GridSearchCV

from sklearn.model_selection import GridSearchCV
parameters = {'C': [0.1, 1, 10, 100], 'solver': ['newton-cg', 'lbfgs', 'liblinear']}
grid_search = GridSearchCV(LogisticRegression(), parameters, cv=5)
grid_search.fit(x_train, y_train)

7. Evaluation

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predictions on test data
y_pred = best_model.predict(x_test)

# Test accuracy
test_accuracy = accuracy_score(y_test, y_pred)

# Classification report and confusion matrix
class_report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

8. Save Predictions

test_data['predictions'] = predictions
test_data.to_csv('test_data_with_predictions.csv', index=False)

Results

Test Accuracy

The Logistic Regression model achieved a Test Accuracy of 98.29% on the test dataset.

Classification Report

The classification report shows precision, recall, and F1-score for each product category. Here’s the breakdown of the report:

Category Precision Recall F1-score Support
Automotive 0.98 0.99 0.99 202
Baby Care 0.94 0.89 0.91 36
Bags, Wallets & Belts 1.00 0.95 0.97 41
Clothing 1.00 1.00 1.00 1051
Computers 0.96 0.97 0.96 95
Footwear 1.00 1.00 1.00 206
Home Decor & Festive Needs 0.98 0.97 0.97 126
Jewellery 1.00 1.00 1.00 668
Kitchen & Dining 0.95 0.99 0.97 112
Mobiles & Accessories 0.98 0.98 0.98 150
Pens & Stationery 0.91 0.71 0.79 68
Tools & Hardware 1.00 0.98 0.99 66
Toys & School Supplies 0.70 0.85 0.76 46
Watches 1.00 1.00 1.00 120
Overall accuracy     0.98 2987

Confusion Matrix

The confusion matrix provides a visual breakdown of how well the model performs across different categories. It shows the number of true positive, false positive, true negative, and false negative predictions for each category.

Here is a heatmap visualization of the confusion matrix:

Confusion Matrix Heatmap

In this confusion matrix:


Conclusion

The model performed exceptionally well, achieving a high accuracy and excellent precision, recall, and F1-scores across most categories. Categories like Clothing, Jewellery, and Watches saw perfect classification results, while categories like Toys & School Supplies had slightly lower performance.


Future Work

Possible enhancements include: