With the easy availability of the internet and the latest interest of people to prepare
their favourite food, Food Blogs have become a thing of relevance today. From searching for
one pot meals to innovative 5 star entrees, from easy snacks to the very traditional regional
dishes, people have started to search for recipes specifically, or with certain conditions or by
the ingredients that are readily available. Thus retrieval of the most relevant recipes by
ingredients and the rating and cooking time gets prominence.
After the extracting, cleaning, filtering and transformation phases the data is ready for
further procedures. Here for ranking and querying we have used three different approaches.
- Cosine Similarity
Good old cosine similarity approach to find the matching strings (title) and output the top – 10 - Enumerated Index based String matching
Enumerated string lists is taken and index-based string matching is done to display the results - Fuzzy String Matching
Using fuzzywuzzy library the matching is done via approximation, Levenshtein distance the basic metric used here, calculated in ratios between two strings the matching is done – Used four different approaches here,
a) W Ratio
b) Partial Ratio
c) Token Sort Ratio
d) Token Set ratio
In phase 3, using the preprocessed data, clustering and recommendation of similar recipes is done.
The recommendations have been evaluated through some metrics.
The clustering technique was inspired from the paper ‘Hierarchical Clustering for Collaborative
Filtering Recommender Systems’ Chalco.et.al
For the Agglomerative Clustering of the data, the data was vectorised and Agglomerative
clustering was done with the number of clusters ranging between 2 and 10.
The optimal number of clusters were found to be ‘7’ through the average silhouette score of the
cluster. As shown above, it is only when the number of clusters is 7 does the silhouette score
reaches the peak. The dendogarm for the data was also done to counter-check the optimal
number of clusters.
Thus the optimal number of clusters were fixed as 7 and further works for recommendations
were done.
Now, when the user queries for a particular recipe, the particular recipe were search and
retrieved. But along with it a set of recommended dishes similar to the one queried were also
retrieved.
The retrieved and recommended dishes were evaluated using the ‘precision at k’ metric, where
‘k’ is the number of recommendations done to the user. The dish belonging to the same cuisine
as the most relevant recipe to the queried recipe was considered to be relevant and the
precision at k was computed for the given query.
Due to the unavailability of the information on whether the recipe was relevant or not, the above
method of relevance marking was done. Due to this factor, other metrics like recall were not
computed.
beautifulsoup4==4.7.1
fuzzywuzzy==0.18.0
matplotlib==3.4.2
numpy==1.19.1
pandas==1.2.0
requests==2.22.0
scikit-learn==0.23.1
scipy==1.5.4