Dataset of products from different online fashion retailers
Below are tasks performed on this data:
Handling null values.
Discard records with invalid IDs.
Feature extraction from columns 'price', 'brand' and 'media'.
Data pre-processing on new features.
Removing product with offer price greater than regular price.
Check on outliers and dropping them.
Removing records with invalid URLs.
Data visualization.
NLP on product description.
Unsupervised clustering.
Importing required libraries
import numpy as np
import pandas as pd
from pandas .io .json import json_normalize
import matplotlib .pyplot as plt
import seaborn as sns
import urllib
from urllib .parse import urlparse
import nltk
nltk .download ('punkt' )
nltk .download ('stopwords' )
nltk .download ('wordnet' )
from nltk .tokenize import word_tokenize
from nltk .corpus import stopwords
from nltk .stem import WordNetLemmatizer
from wordcloud import WordCloud
from IPython .display import Image
from IPython .core .display import HTML
from sklearn import preprocessing
from sklearn .cluster import DBSCAN
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
from google .colab import drive
drive .mount ('/content/gdrive' )
path = '/content/gdrive/My Drive/'
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
data = pd .read_csv (path + 'data_engineering_assignment.txt' , sep = '|' , error_bad_lines = False , warn_bad_lines = False )
data .head ()
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:2718: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
_id
name
price
website_id
sku
url
brand
media
description_text
0
5d0b8aca0db7220b86cb4035
Joules Top Dog Underwear Three Pack
{'offer_price': {'currency': 'GBP', 'value': 3...
5cff5e7fe40f4900046735fa
312838
www.next.co.uk/style/st355408#312838
{'name': 'joules', 'sub_brand': ''}
{'standard': [{'order': 1, 'url': 'https://xcd ...
95% Cotton, 5% Elastane.
1
5d0b8aca0db7220b86cb4036
Figleaves Cheetah Satin Pyjama Set
{'offer_price': {'currency': 'GBP', 'value': 2...
5cff5e7fe40f4900046735fa
319571
https://www.next.co.uk/style/st324987#319571
{'name': 'figleaves', 'sub_brand': ''}
{'standard': [{'order': 1, 'url': 'https://xcd ...
100% Polyester.
2
5d0b8aca0db7220b86cb4037
Nike Solid 4" Swim Short
{'offer_price': {'currency': 'GBP', 'value': 1...
5cff5e7fe40f4900046735fa
335026
https://www.next.co.uk/style/st400645#335026
{'name': 'nike', 'sub_brand': ''}
{'standard': [{'order': 1, 'url': 'https://xcd ...
Nike Swim Boys' Solid Lap 4 Volley Short is a ...
3
5d0b8aca0db7220b86cb4038
Collection Luxe Orchid
{'offer_price': {'currency': 'GBP', 'value': 1...
5cff5e7fe40f4900046735fa
552266
https://www.next.co.uk/style/st262195#552266
{'name': 'next', 'sub_brand': ''}
{'standard': [{'order': 1, 'url': 'https://xcd ...
Height 85cm
4
5d0b8aca0db7220b86cb4039
River Island White Sleeveless Blazer
{'offer_price': {'currency': 'GBP', 'value': 5...
5cff5e7fe40f4900046735fa
680971
https://www.next.co.uk/style/st440132#680971
{'name': 'river island', 'sub_brand': ''}
{'standard': [{'order': 1, 'url': 'https://xcd ...
100% Polyester.
_id 0
name 5163
price 5163
website_id 5163
sku 5576
url 5163
brand 5163
media 5163
description_text 5631
dtype: int64
Invalid IDs values and all missing values for other features of it
data [data .website_id .isna ()].head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
_id
name
price
website_id
sku
url
brand
media
description_text
16
Canvas upper with TOMS toe-stitch, and elastic...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
17
TOMS classic suede insole with cushion for com...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
18
Latex arch insert for added support
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
19
One-piece outsole for flexibility and durability
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
20
With every pair of shoes you purchase, TOMS wi...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Dropping rows with invalid IDs
data .drop (data [data .website_id .isna ()].index , axis = 0 , inplace = True )
data = data .reset_index ()
data .drop ('index' , axis = 1 , inplace = True )
data .isna ().sum ()
_id 0
name 0
price 0
website_id 0
sku 413
url 0
brand 0
media 0
description_text 468
dtype: int64
Null values for SKU codes and Product description
Still we see few null values for SKU Codes and Product description.
We can replace them with value 'No SKU Code' and 'No description' and retain other feature for products for analysis
data .loc [data .sku .isna (), 'sku' ] = 'No SKU Code'
data .loc [data .description_text .isna (), 'description_text' ] = 'No description'
data .isna ().sum ()
_id 0
name 0
price 0
website_id 0
sku 0
url 0
brand 0
media 0
description_text 0
dtype: int64
A glance at shape of data we have
print ('Rows :' , data .shape [0 ], 'Columns :' , data .shape [1 ])
Rows : 318278 Columns : 9
Feature extraction from column 'price', 'brand', and 'media'
Data in this column are in JSON Objects
So, we can extract those key-values pairs and add them to features.
Extracting from column 'price'
data ['price' ] = data ['price' ].apply (lambda x : eval (x ))
price_data = json_normalize (data ['price' ])
data .drop ('price' , axis = 1 , inplace = True )
price_data .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
offer_price.currency
offer_price.value
regular_price.currency
regular_price.value
basket_price.value
basket_price.currency
0
GBP
35.0
GBP
35.0
NaN
NaN
1
GBP
25.0
GBP
25.0
NaN
NaN
2
GBP
15.0
GBP
15.0
NaN
NaN
3
GBP
150.0
GBP
150.0
NaN
NaN
4
GBP
55.0
GBP
55.0
NaN
NaN
price_data ['offer_price.currency' ].unique ()
array(['GBP'], dtype=object)
Either Currency is GBP or NaN so dropping currency columns from price_data
drop = ['offer_price.currency' , 'regular_price.currency' , 'basket_price.currency' ]
for col in drop :
price_data .drop (col , axis = 1 , inplace = True )
price_data .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
offer_price.value
regular_price.value
basket_price.value
0
35.0
35.0
NaN
1
25.0
25.0
NaN
2
15.0
15.0
NaN
3
150.0
150.0
NaN
4
55.0
55.0
NaN
Extracting from column 'brand'
data ['brand' ] = data ['brand' ].apply (lambda x : eval (x ))
brand_data = json_normalize (data ['brand' ])
data .drop ('brand' , axis = 1 , inplace = True )
brand_data .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
name
sub_brand
0
joules
1
figleaves
2
nike
3
next
4
river island
We got some missing values in form of empty strings
Replacing them with 'No brand name' for name and 'No sub-brand name' for sub_brand
brand_data .loc [brand_data .name .values == '' , 'name' ] = 'No brand name'
brand_data .loc [brand_data .sub_brand .values == '' , 'sub_brand' ] = 'No sub-brand name'
brand_data .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
name
sub_brand
0
joules
No sub-brand name
1
figleaves
No sub-brand name
2
nike
No sub-brand name
3
next
No sub-brand name
4
river island
No sub-brand name
We already got a column with 'name' for product name
So renaming brand columns with 'brand_name', 'sub_brand_name'
brand_data .columns = ['brand_name' , 'sub_brand_name' ]
Extracting from column 'media'
This contains three keys standard, thumbnail and large.
Extracting URL values for those and creating columns in data for it
data ['media' ] = data ['media' ].apply (lambda x : eval (x ))
standard_url = []
thumbnail_url = []
large_url = []
for val in data ['media' ]:
try :
standard_url .append (val ['standard' ][0 ]['url' ])
except IndexError :
standard_url .append ('No media URL' )
try :
thumbnail_url .append (val ['thumbnail' ][0 ]['url' ])
except (IndexError , KeyError ):
thumbnail_url .append ('No media URL' )
try :
large_url .append (val ['large' ][0 ]['url' ])
except (IndexError , KeyError ):
large_url .append ('No media URL' )
data ['standard_media_url' ] = standard_url
data ['thumbnail_media_url' ] = thumbnail_url
data ['large_media_url' ] = large_url
data .drop ('media' , axis = 1 , inplace = True )
Again a check on shape before concatenating them with original data
data .shape , price_data .shape , brand_data .shape
((318278, 9), (318278, 3), (318278, 2))
Concatenating all feature obtained with original data
Also a quick glance on our data with new feature
data = pd .concat ([data , price_data , brand_data ], axis = 1 )
data .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
'_id' : Product IDs
'name' : Product Names
'website_id' : IDs for company websites
'sku' : SKU codes for particular products
'url' : URLs for product listed on websites
'description_text' : Product descriptions
'media': URLs for Standard , Large and Thumbnails media
'price': Product price in GBP (Offer and Regular prices)
'brand': Brand name of product
Data pre-processing on newly obtained data
Again null values check with new columns, checking with percentage
data .isna ().sum ()/ data .shape [0 ] * 100
_id 0.000000
name 0.000000
website_id 0.000000
sku 0.000000
url 0.000000
description_text 0.000000
standard_media_url 0.000000
thumbnail_media_url 0.000000
large_media_url 0.000000
offer_price.value 0.104311
regular_price.value 0.104311
basket_price.value 65.510340
brand_name 0.000000
sub_brand_name 0.000000
dtype: float64
65.51% of 'basket_price.value' are null values, hence we can drop it.
data .drop ('basket_price.value' , axis = 1 , inplace = True )
Also 'offer_price.value' and 'regular_price.value' are just 0.10% null. So, we can drop that paticular rows.
data .drop (data [data ['regular_price.value' ].isna ()].index , axis = 0 , inplace = True )
data = data .reset_index ()
data .drop ('index' , axis = 1 , inplace = True )
data .isna ().sum ()
_id 0
name 0
website_id 0
sku 0
url 0
description_text 0
standard_media_url 0
thumbnail_media_url 0
large_media_url 0
offer_price.value 0
regular_price.value 0
brand_name 0
sub_brand_name 0
dtype: int64
data .describe (include = 'all' )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
_id
name
website_id
sku
url
description_text
standard_media_url
thumbnail_media_url
large_media_url
offer_price.value
regular_price.value
brand_name
sub_brand_name
count
317946
317946
317946
317946
317946
317946
317946
317946
317946
3.179460e+05
3.179460e+05
317946
317946
unique
317946
257399
4
299797
317946
275543
311110
238074
108855
NaN
NaN
4776
1
top
5d49a619a3ecbbd90d63ff98
logo print T-shirt
5d0cc7b68a66a100014acdb0
No SKU Code
https://www.farfetch.com/uk/shopping/men/eleve ...
100% Cotton.
No media URL
No media URL
No media URL
NaN
NaN
next
No sub-brand name
freq
1
450
109773
412
1
2213
58
78689
208173
NaN
NaN
25154
317946
mean
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2.239519e+02
2.643545e+02
NaN
NaN
std
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2.580080e+03
3.272166e+03
NaN
NaN
min
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
0.000000e+00
0.000000e+00
NaN
NaN
25%
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1.500000e+01
2.000000e+01
NaN
NaN
50%
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
3.600000e+01
4.500000e+01
NaN
NaN
75%
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1.930000e+02
2.400000e+02
NaN
NaN
max
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1.026484e+06
1.466406e+06
NaN
NaN
'sub_brand_name' has 1 unique value, so dropping that column
data .drop ('sub_brand_name' , axis = 1 , inplace = True )
'website_id' has 4 uniques IDs, so we can extract website name from associated URLs for it and create a column for analysis.
data ['website_name' ] = data ['url' ].apply (lambda link : urlparse (link ).netloc )
Checking for rows with offer price greater than regular price
print ('Total records with offer price greater than regular price: ' ,data .loc [data ['offer_price.value' ] > data ['regular_price.value' ]].shape [0 ])
Total records with offer price greater than regular price: 14
data .loc [data ['offer_price.value' ] > data ['regular_price.value' ]].head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
Removing records with offer price value greater than regular price value
data .drop (data .loc [data ['offer_price.value' ] > data ['regular_price.value' ]].index , axis = 0 , inplace = True )
data = data .reset_index ()
data .drop ('index' , axis = 1 , inplace = True )
print ('Total records with offer price greater than regular price: ' ,data .loc [data ['offer_price.value' ] > data ['regular_price.value' ]].shape [0 ])
data .shape
Total records with offer price greater than regular price: 0
(317932, 13)
def boxPlot_single (numColumn ):
plt .rcParams ['figure.figsize' ] = (10.0 , 5.0 )
plt .title ('Box plot for ' + numColumn )
sns .set (style = "whitegrid" )
ax = sns .boxplot (data [numColumn ])
def boxPlot_multi (numColumn , catColumn ):
plt .rcParams ['figure.figsize' ] = (20.0 , 10.0 )
plt .title ('Box plot for ' + numColumn + ' per ' + catColumn )
sns .set (style = "whitegrid" )
ax = sns .boxplot (x = catColumn , y = numColumn , data = data )
for i ,artist in enumerate (ax .artists ):
col = artist .get_facecolor ()
artist .set_edgecolor (col )
artist .set_facecolor ('None' )
for j in range (i * 6 ,i * 6 + 6 ):
line = ax .lines [j ]
line .set_color (col )
line .set_mfc (col )
line .set_mec (col )
Checking and for outliers in price
boxPlot_single ('offer_price.value' )
def removeOutliers (price , data ):
print ('Total records with offer price greater than ' ,price ,': ' ,data .loc [data ['offer_price.value' ] > price ].shape [0 ])
print ('Removing records...' )
data .drop (data .loc [data ['offer_price.value' ] > price ].index , axis = 0 , inplace = True )
data = data .reset_index ()
data .drop ('index' , axis = 1 , inplace = True )
print ('Total records with offer price greater than ' ,price ,': ' ,data .loc [data ['offer_price.value' ] > price ].shape [0 ])
removeOutliers (5000 , data )
Total records with offer price greater than 5000 : 706
Removing records...
Total records with offer price greater than 5000 : 0
boxPlot_single ('offer_price.value' )
removeOutliers (200 , data )
Total records with offer price greater than 200 : 76756
Removing records...
Total records with offer price greater than 200 : 0
boxPlot_single ('offer_price.value' )
print ('Records left after removing outliers (offer price greater than 200 GBP): ' , data .shape [0 ])
print ('Records dropped in precentage :' , round ((1 - data .shape [0 ]/ 317932 ) * 100 , 2 ))
Records left after removing outliers (offer price greater than 200 GBP): 240470
Records dropped in precentage : 24.36
After removing most of the outliers, Still distribution is left skewed.
Data is dropped by 24.36%
plt .title ('Offer Price Distribution' )
sns .distplot (data ['offer_price.value' ])
<matplotlib.axes._subplots.AxesSubplot at 0x7fe919bd56a0>
BoxPlot for offer price by each website
boxPlot_multi ('offer_price.value' , 'website_name' )
Dropping records with invalid URLs
def validate_url (url ):
try :
result = urlparse (url )
return all ([result .scheme , result .netloc ])
except ValueError :
return False
data ['valid_url' ] = data ['url' ].apply (lambda link : validate_url (link ))
print ('Total records with invalid URLs : ' ,data [data ['valid_url' ] == False ].shape [0 ])
data [data ['valid_url' ] == False ].index
Total records with invalid URLs : 16
Int64Index([ 0, 9809, 19609, 29455, 39327, 49116, 58949, 68781,
138077, 148065, 158049, 178027, 217981, 257936, 267936, 307932],
dtype='int64')
data .drop (data [data ['valid_url' ] == False ].index , axis = 0 , inplace = True )
data = data .reset_index ()
data .drop ('index' , axis = 1 , inplace = True )
print ('Total records with invalid URLs : ' ,data [data ['valid_url' ] == False ].shape [0 ])
data .drop ('valid_url' , axis = 1 , inplace = True )
Total records with invalid URLs : 0
Bar graphs for top categories
def barGraph_count (catColumn , top = 5 ):
temp = data [catColumn ].value_counts ()
temp = temp .reset_index ()
temp .columns = [catColumn , 'count' ]
temp = temp .head (top )
temp = pd .DataFrame (temp )
ax = sns .barplot (x = catColumn , y = 'count' , data = temp )
del temp
Top 10 product name
'logo print T-shirt' and 'logo T-shirt' topping chart with around 375 and 195 products respectively
barGraph_count ('name' , top = 10 )
Top 10 brand names
'next' on top with around 22500 products
barGraph_count ('brand_name' , top = 10 )
barGraph_count ('website_name' )
Bar graphs for top discounts
def barGraph_discount (catColumn , top = 10 , percent = 50 ):
temp = data .groupby (catColumn )['offer_price.value' , 'regular_price.value' ].mean ()
temp = temp .reset_index ()
temp ['discount_percent' ] = round ((1 - temp ['offer_price.value' ]/ temp ['regular_price.value' ]) * 100 , 2 )
temp .sort_values (by = ['discount_percent' ], inplace = True , ascending = False )
temp = temp .loc [:,[catColumn , 'discount_percent' ]].head (top )
ax = sns .barplot (x = catColumn , y = 'discount_percent' , data = temp , palette = "Blues_d" )
del temp
Top discounts by brand name
Top 10: projectt, seeker, kowtow, parka london and other are with discounts around 77-84%
barGraph_discount ('brand_name' , top = 10 , percent = 50 )
barGraph_discount ('website_name' , top = 6 , percent = 0 )
Text pre-processing on 'description_text'
Function to remove punctuation, stopwords and lemmatization
def remove_punctuation (text ):
return [word for word in text if word .isalpha ()]
def remove_punctuation_from_word (text ):
token = []
for word in text :
if word [- 1 ].isalpha ():
token .append (word )
else :
token .append (word [:- 1 ])
return token
stop_words = stopwords .words ('english' )
def remove_stopword (text ):
return [w for w in text if not w in stop_words ]
def lemmatizing (text ):
lemmatizer = WordNetLemmatizer ()
return [lemmatizer .lemmatize (word ) for word in text ]
Here we perform processing on 'description_text'
Convert to lowercase -> tokenization -> removing punctuation -> removing stop words -> Lemmatization
data ['processed_text' ] = data ['description_text' ]
print ('Converting to lower case...' )
data ['processed_text' ] = [text .strip ().lower () for text in data ['processed_text' ]]
print ('Done' )
print ('Tokenizing...' )
data ['processed_text' ] = [word_tokenize (text ) for text in data ['processed_text' ]]
print ('Done' )
print ('Removing punctuation...' )
data ['processed_text' ] = [remove_punctuation (text ) for text in data ['processed_text' ]]
data ['processed_text' ] = [remove_punctuation_from_word (text ) for text in data ['processed_text' ]]
print ('Done' )
print ('Removing Stop words...' )
data ['processed_text' ] = [remove_stopword (text ) for text in data ['processed_text' ]]
print ('Done' )
print ('Lemmatizing...' )
data ['processed_text' ] = [lemmatizing (text ) for text in data ['processed_text' ]]
print ('Text processing Done' )
Converting to lower case...
Done
Tokenizing...
Done
Removing punctuation...
Done
Removing Stop words...
Done
Lemmatizing...
Text processing Done
Quick look at processed text
data .loc [:,['description_text' ,'processed_text' ]].head (10 )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
description_text
processed_text
0
100% Polyester.
[polyester]
1
Nike Swim Boys' Solid Lap 4 Volley Short is a ...
[nike, swim, boy, solid, lap, volley, short, d...
2
Height 85cm
[height]
3
100% Polyester.
[polyester]
4
In a leopard print this tie up heel.
[leopard, print, tie, heel]
5
70% Viscose, 26% Nylon, 4% Elastane.
[viscose, nylon, elastane]
6
Wish your loved one a Hoppy Birthday with this...
[wish, loved, one, hoppy, birthday, beer, hawk...
7
With adjustable strap fastening.
[adjustable, strap, fastening]
8
18 carat gold-plated sterling silver. Comes in...
[carat, sterling, silver, come, gift, box]
9
With blackout lining.
[blackout, lining]
Word Cloud with processed text and Standard media URL
def img_with_wordcloud (row ):
word_cloud = WordCloud (max_words = 5 , background_color = 'white' ).generate (' ' .join (data ['processed_text' ][row ]))
plt .figure (figsize = (5 ,5 ),facecolor = 'white' , edgecolor = 'blue' )
plt .imshow (word_cloud )
plt .axis ('off' )
plt .tight_layout (pad = 0 )
plt .show ()
display (Image (url = data ['standard_media_url' ][row ]))
Creating Tags for products with processed text
As we see processed text are almost can be used as tags for image
So, creating tag columns with top 3 words from processed text
def get_tag (text ):
tag = {}
try :
tagDict = WordCloud (max_words = 3 ).generate (' ' .join (text )).words_
tag ['tag_1' ] = list (tagDict .keys ())[0 ]
try :
tag ['tag_2' ] = list (tagDict .keys ())[1 ]
except IndexError :
tag ['tag_2' ] = ''
try :
tag ['tag_3' ] = list (tagDict .keys ())[2 ]
except IndexError :
tag ['tag_3' ] = ''
return tag
except ValueError :
return {}
# tags = [get_tag(x) for x in data['processed_text']]
# tags = pd.DataFrame(tags)
# tags.to_csv(path+'tags.csv',index=False)
tags = pd .read_csv (path + 'tags.csv' , header = 0 )
tags .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
tag_1
tag_2
tag_3
0
polyester
NaN
NaN
1
lap
short
swimming
2
height
NaN
NaN
3
polyester
NaN
NaN
4
leopard
print
tie
def barGraph_tags (top = 5 ):
temp = pd .concat ([tags ['tag_1' ],tags ['tag_2' ],tags ['tag_3' ]]).value_counts ()
temp = pd .DataFrame (temp )
temp = temp .reset_index ()
temp = temp .head (10 )
ax = sns .barplot (x = 'index' , y = 0 , data = temp )
'Cotton' is most used word to describe, followed by 'look', 'design' and 'black'
Adding column tag_1, tag_2 and tag_3 to data
Dropping column 'processed_text'
data .drop ('processed_text' , axis = 1 , inplace = True )
tags = tags .replace (np .nan , '' )
data = pd .concat ([data , tags ], axis = 1 )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
Let's check if we can cluster data based on brand_name, website_name, and tags columns for products
Taking subset of data with those columns in cluster_data
cluster_data = data .loc [:,['brand_name' , 'website_name' , 'tag_1' , 'tag_2' , 'tag_3' ]]
cluster_data .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
le = preprocessing .LabelEncoder ()
for c in cluster_data .columns :
cluster_data [c ] = le .fit_transform (cluster_data [c ])
cluster_data .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
brand_name
website_name
tag_1
tag_2
tag_3
0
1399
4
4028
0
0
1
2561
4
2903
5761
7308
2
2550
4
2407
0
0
3
3011
4
4028
0
0
4
1350
4
2973
4992
7529
Applying DBSCAN, as we don't know how many categories of product do we have
db = DBSCAN (eps = 1 , min_samples = 100 ).fit (cluster_data )
core_samples_mask = np .zeros_like (db .labels_ , dtype = bool )
core_samples_mask [db .core_sample_indices_ ] = True
labels = db .labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len (set (labels )) - (1 if - 1 in labels else 0 )
print ('Number of clusters: ' ,n_clusters_ )
Creating column cluster number for each product
data ['clusters' ] = labels
And a check on random cluster number and standard media URL
def checkClusters (clusterNumber ):
temp = data [data ['clusters' ] == clusterNumber ].head ()
for url in temp ['standard_media_url' ].values :
display (Image (url = url ))
Cluster with adult footwears
Cluster with women wears with some specific type of dress
Cluster with bedsheet and pillow cover
Cluster with kids top wears
Cluster with kids footwears
Cluster with home decors or specific to flowers
We can say that cluster with help of 'brand_name', 'website_name' and 'tags' generated by description of product performed well.