perfume_project

The Problem Statement:

When a user logs in to a fragrance website, print out a list of perfumes he/she might like, to try next.

Where These Recommendations Come From?

Based on the cosine similarities, find the most similar user to the logged-in user, find which perfumes the similar user have reviewed and liked, that the logged-in user have not tried out yet, and print those out.
If the logged-in user turned out to be one of the biggest reviewers; that is, have tried a lot of perfumes, then the most similar user might have not tried as much, and so we will have an empty list of recommendations. If that happens, we will look for the most similar 3 perfumes for the top 3 perfumes liked by the logged-in user, and print out that list of 9 perfumes to try next.
Explanation
Why would I compare only to most similar user, and not the most three similar users or so?
There’s a valid point for using next most similar users, and I might do this to the model, while controlling for how many perfumes to recommend to not overwhelm the customer or lose credibility; but I was afraid this will give a long stretch for predictions that might not be relevant at that point; since it’s cosine similarities, on top of sentiment analysis, based on Vader. And the logged-in user might end up hating it, and thus losing faith in our recommendations, and instead of being a feature, it’ll become a nuisance.
Extension
Of course, when deploying this model to the fragrance website, recommendations should always include recent launched perfumes, and new arrivals. These perfumes, however, need to share the same notes in the most liked perfumes by the logged-in user, or other features, like wearability in a certain season, etc.
This latter case has not been treated for in this project, and that would be its own project, or an extension to this one in the near future.

Mechanics:

Find cosine similarities between users; and find cosine similarities between items. Make functions to search on the criteria mentioned above; then turn everything into very easy to use functions.
Note
The collected review texts didn’t have ratings attached to them by users. Vader library was used to give a rating between -1 and 1 for sentiment analysis, that is, from negative, to neurtral, to positive.
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains. vaderSentiment GitHub

Other Mechanics to Be Attempted Next, or in Later Projects:

Manually label few hundreds or thousands reviews. Make a model to predict ratings for the rest of the reviews based on text analysis. Then run the cosine similarity recommenders again.
Evaluate performance by field and content knowledge to assess goodness of recommendations.
Exploring how neural networks for unspervised learning for text data can be implemented to group together similar reviews about each perfume, thus grouping together similar users. Can be done to items as well.

PostScript: Getting The Data

Scraping: I gathered links of most popular designers, then for each designer, I collected links to the perfumes they made, then went to each page and collected the perfume name, reviews, and customer id without their names.
I ended up with over 5000 links, for ease of handeling, I randomly selected 500 of these links to run the recommender on. That still yielded 33036 reviews.
This recommender will perform better when having the whole dataset, which I intend to do in the near future.
Notes on the reviews text: In addition to cleaning up the review text of the unwanted characters, there were many reviews not written in English. It is up to you to consider this a problem or not, VaderSentiment would consider these reviews of a neutral statement, I can filter them out based on neutral statment now; or I can drop these reviews after detecting the langauge, which can be done with the language detection library but be careful that it will missclassify some reviews as non-English if the text isn’t squeaky clean, if you have strange characters or spaces, it won’t do as well.