Building a Recommendation System with Hugging Face Transformers

Thank you for reading this post, don't forget to subscribe!

Image by jcomp on Freepik

We have relied on software in our phones and computers in the modern era. Many applications, such as e-commerce, movie streaming, game platforms, and others, have changed how we live, as these applications make things easier. To make things even better, the business often provides features that allow recommendations from the data.

Our Top 5 Free Course Recommendations

1. Google Cybersecurity Certificate – Get on the fast track to a career in cybersecurity.

2. Natural Language Processing in TensorFlow – Build NLP systems

3. Python for Everybody – Develop programs to gather, clean, analyze, and visualize data

4. Google IT Support Professional Certificate

5. AWS Cloud Solutions Architect – Professional Certificate

The basis of recommendation systems is to predict what the user might interested in based on the input. The system would provide the closest items based on either the similarity between the items (content-based filtering) or the behavior (collaborative filtering).

With many approaches to the recommendation system architecture, we can use the Hugging Face Transformers package. If you didn’t know, Hugging Face Transformers is an open-source Python package that allows APIs to easily access all the pre-trained NLP models that support tasks such as text processing, generation, and many others.

This article will use the Hugging Face Transformers package to develop a simple recommendation system based on embedding similarity. Let’s get started.

Develop a Recommendation System with Hugging Face Transformers

Before we start the tutorial, we need to install the required packages. To do that, you can use the following code:

<code>pip install transformers torch pandas scikit-learn</code>

You can select the suitable version for your environment via their website for the Torch installation.

As for the dataset example, we would use the Anime recommendation dataset example from Kaggle.

Once the environment and the dataset are ready, we will start the tutorial. First, we need to read the dataset and prepare them.

<code>import pandas as pd

df = pd.read_csv('anime.csv')

df = df.dropna()
df['description'] = df['name'] +' '+ df['genre'] + ' ' +df['type']+' episodes: '+ df['episodes']</code>

In the code above, we read the dataset with Pandas and dropped all the missing data. Then, we create a feature called “description” that contains all the information from the available data, such as name, genre, type, and episode number. The new column would become our basis for the recommendation system. It would be better to have more complete information, such as the anime plot and summary, but let’s be content with this one for now.

Next, we would use Hugging Face Transformers to load an embedding model and transform the text into a numerical vector. Specifically, we would use sentence embedding to transform the whole sentence.

The recommendation system would be based on the embedding from all the anime “description” we will perform soon. We would use the cosine similarity method, which measures the similarity of two vectors. By measuring the similarity between the anime “description” embedding and the user’s query input embedding, we can get precise items to recommend.

The embedding similarity approach sounds simple, but it can be powerful compared to the classic recommendation system model, as it can capture the semantic relationship between words and provide contextual meaning for the recommendation process.

We would use the embedding model sentence transformers from the Hugging Face for this tutorial. To transform the sentence into embedding, we would use the following code.

<code>from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

def get_embeddings(sentences):
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

  with torch.no_grad():
      model_output = model(**encoded_input)

  sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

  sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

  return sentence_embeddings</code>

Try the embedding process and see the vector result with the following code. However, I would not show the output as it’s pretty long.

<code>sentences = ['Some great movie', 'Another funny movie']
result = get_embeddings(sentences)
print("Sentence embeddings:")
print(result)</code>

To make things easier, Hugging Face maintains a Python package for embedding sentence transformers, which would minimize the whole transformation process in 3 lines of code. Install the necessary package using the code below.

<code>pip install -U sentence-transformers</code>

Then, we can transform the whole anime “description” with the following code.

<code>from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

anime_embeddings = model.encode(df['description'].tolist())</code>

With the embedding database is ready, we would create a function to take user input and perform cosine similarity as a recommendation system.

<code>from sklearn.metrics.pairwise import cosine_similarity

def get_recommendations(query, embeddings, df, top_n=5):
    query_embedding = model.encode([query])
    similarities = cosine_similarity(query_embedding, embeddings)
    top_indices = similarities[0].argsort()[-top_n:][::-1]
    return df.iloc[top_indices]</code>

Now that everything is ready, we can try the recommendation system. Here is an example of acquiring the top five anime recommendations from the user input query.

<code>query = "Funny anime I can watch with friends"
recommendations = get_recommendations(query, anime_embeddings, df)
print(recommendations[['name', 'genre']])</code>

<code>Output>>
                                         name  \
7363  Sentou Yousei Shoujo Tasukete! Mave-chan   
8140            Anime TV de Hakken! Tamagotchi   
4294      SKET Dance: SD Character Flash Anime   
1061                        Isshuukan Friends.   
2850                       Oshiete! Galko-chan   

                                             genre  
7363  Comedy, Parody, Sci-Fi, Shounen, Super Power  
8140          Comedy, Fantasy, Kids, Slice of Life  
4294                       Comedy, School, Shounen  
1061        Comedy, School, Shounen, Slice of Life  
2850                 Comedy, School, Slice of Life </code>

The result is all of the comedy anime, as we want the funny anime. Most of them also include anime, which is suitable to watch with friends from the genre. Of course, the recommendation would be even better if we had more detailed information.

Conclusion

A Recommendation System is a tool for predicting what users might be interested in based on the input. Using Hugging Face Transformers, we can build a recommendation system that uses the embedding and cosine similarity approach. The embedding approach is powerful as it can account for the text’s semantic relationship and contextual meaning.

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

biskit 3 weeks agoLast Updated: August 20, 2024

24 4 minutes read

Our Top 5 Free Course Recommendations

Develop a Recommendation System with Hugging Face Transformers

Conclusion

biskit

How generative AI can revive the economy

🚀 Forex Trading Secrets: Game-Changing Setups You Can't Miss! 💰 | Master the Market Now!

Related Articles

Bootstrapping Your Data Science Career: A Guide to Self-Learning Pathways

Free Courses That Are Actually Free: Data Analytics Edition

Building Trust in AI: Qlik’s Latest AutoML Enhancements Offer Transparent Explainability and Improved Business Outcomes

7 Free Cloud IDE for Data Science That You Are Missing Out