21  Transformers

The HuggingFace ecosystem in 2025: The transformers library now provides access to over 900,000 models on the Hub. Key developments since this chapter was first written:

  • Llama 3 / Llama 3.1 / Llama 3.2 — Meta’s open models significantly narrowed the gap with GPT-4. The 8B version is the new baseline for open-source general use.
  • Mistral 7B / Mixtral 8x7B — efficient models from Mistral AI; strong for code and reasoning relative to size
  • Gemma 2 (9B, 27B) — Google DeepMind’s open models; competitive with much larger predecessors
  • Phi-4 (14B) — Microsoft’s “small but mighty” model; punches above its weight on reasoning benchmarks
  • Qwen 2.5 — Alibaba’s 7B–72B series with strong multilingual support

The Inference API (now “Serverless Inference”) lets you call any hosted model via HTTP without loading weights locally. For production deployments, Inference Endpoints spins up a dedicated container for your model.

GPT-4o comparison: Many of the tasks shown in this chapter (summarization, NER, QA) can now be performed comparably by open 7–8B models via the HuggingFace pipeline when run on suitable hardware.

Consider the following two sentences:
- She waited at the river bank
- She was looking at her bank account

Under the Glove and Word2Vec embeddings, both of the uses of the word bank would have the same vector representation. Which is a problem as the word ‘bank’ refers to two completely different things based on the context. Fixed embedding schemes such as Word2Vec can’t solve for this.

Transformer based language models solve for this by creating context specific embeddings. Which means instead of creating a static word-to-vector embedding, they provide dynamic embeddings depending upon the context. You provide the model the entire sentence, and it returns an embedding paying attention to all other words in the context of which a word appears.

Transformers use something called an ‘attention mechanism’ to compute the embedding for a word in a sentence by also considering the words adjacent to the given word. By combining the transformer architecture with self-supervised learning, these models have achieved tremendous success as is evident in the tremendous popularity of large language models. The transformer architecture has been successfully applied to vision and audio tasks as well, and is currently all the rage to the point of making past deep learning architectures completely redundant.

21.1 Attention is All You Need

A seminal 2017 paper by Vaswani et al from the Google Brain team introduced and popularized the transformers architecture. The paper represented a turning point for deep learning practitioners, and transformers were soon applied to solving a wide variety of problems. The original paper can be downloaded from https://arxiv.org/abs/1706.03762.

image.png

The original paper on transformers makes difficult reading for a non-technical audience. A more intuitive and simpler explanation of the paper was provided by Jay Alammar in a blog post on GitHub that received immense popularity and accolades. Jay’s blog post is available at https://jalammar.github.io/illustrated-transformer/.

image.png

The core idea behind self-attention is to derive embeddings for a word based on the all the words that surround it, including a consideration of the order they appear in. There is a lot matrix algebra involved, but the essence of the idea is to take into account the presence of other words before and after a given word, and use their embeddings as weights in computing the context sensitive embedding for a given word.

This means the same word would have a different embedding vector when used in different sentences, and the model will need the entire sentence or document as an input to compute the embeddings of the word. All of these computations end up being compute heavy as the number of weights and biases explodes when compared to a traditional FCNN or RNN. These transformer models are self-trained on large amounts of text (generally public domain text), and require computational capabilities beyond the reach of the average person. These new transformer models tend to have billions of parameters, and are appropriately called ‘Large Language Models’, or LLMs for short.

Large corporations such as Google, Facebook, OpenAI and others have come up with their own LLMs, some of which are open source, and others not. Models that are not open sourced can be accessed through APIs, which meanse users send their data to the LLM provider (such as OpenAI), and the provider returns the answer. These providers charge for usage based on the volumes of data they have to process.

Models that are open sourced can be downloaded in their entirety on the user’s infrastructure, and run locally without incremental cost except that of the user’s hardware and compute costs.

LLMs come in a few different flavors, and current thinking makes the below distinctions. However this can change rapidly as ever advanced models are released:
- Foundational Models – Base models, cannot be used out of the box as not trained for anything other than predicting the next word
- Instruction Tuned Models – Have been trained to follow instructions
- Fine-tuned Models – Have been trained on additional text data specific to the user’s situation

The line demarcating the above can be fuzzy and the LLM space is evolving rapidly with different vendors competing to meet their users’ needs in the most efficient way.

21.2 Sentence Transformers

(https://www.sbert.net/)

Sentence BERT is a library that allows the creation of sentence embeddings based on transformer models, including nearly all models available on Huggingface. A ‘sentence’ does not mean a literal sentence, it refers to any text.

Once we have embeddings available, there is no limit to what we can do with it. We can pass the embeddings to traditional or network based models to drive classification, regression, or perform clustering of text data using any clustering method such as k-means or hierarchical clustering.

We will start with sentence BERT, and look at some examples of the kinds of problems we can solve with it.

21.2.1 Get some text data first

We import about 10,000 random articles that were collected using web scraping the net for articles that address cybersecurity. Some item are long, some are short, and others are not really even articles as those might just be ads or other website notices.
Local saving and loading of models > Save with:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-roberta-large-v1')

model.save(path)

Load with:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(path)
Code
# Set default locations for downloaded models
# If you are running things on your own hardware,
# you can ignore this cell completely.

# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'
        
Code
# Usual library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tqdm
import torch
import gc
Code
pwd
'C:\\Users\\user\\Google Drive\\jupyter'
Code
# Import the data from a pickle file
df = pd.read_pickle('sample.pkl')
Code
# How many rows and columns in our dataframe
df.shape
(10117, 7)
Code
# We look at the dataframe below.  The column of interest to us is the column titled 'text'
df.head(3)
title summary_x URL keywords summary_y text published_date
0 Friday Squid Blogging: On Squid Brains <p>Interesting <i>National Geographic</i> <a h... https://www.schneier.com/blog/archives/2021/08... working,school,technologist,security,schneier,... About Bruce SchneierI am a public-interest tec... About Bruce Schneier\n\nI am a public-interest... 2021-08-20 21:18:14
1 More on Apple’s iPhone Backdoor <p>In this post, I&#8217;ll collect links on A... https://www.schneier.com/blog/archives/2021/08... service,using,wiserposted,iphone,security,appl... More on Apple’s iPhone BackdoorIn this post, I... More on Apple’s iPhone Backdoor\n\nIn this pos... 2021-08-20 13:54:51
2 T-Mobile Data Breach <p>It&#8217;s a <a href="https://www.wired.com... https://www.schneier.com/blog/archives/2021/08... tmobiles,numbers,data,tmobile,security,schneie... It’s a big one:As first reported by Motherboar... It’s a big one:\n\nAs first reported by Mother... 2021-08-19 11:17:56
Code
# We create a dataframe with just the story text, and call it corpus

corpus = df[['text']]
Code
corpus
text
0 About Bruce Schneier\n\nI am a public-interest...
1 More on Apple’s iPhone Backdoor\n\nIn this pos...
2 It’s a big one:\n\nAs first reported by Mother...
3 Apple’s NeuralHash Algorithm Has Been Reverse-...
4 Upcoming Speaking Engagements\n\nThis is a cur...
... ...
10112 Nigerian automotive tech company Autochek toda...
10113 — The Starters — Apple Inc. and Tesla Inc. hav...
10114 — Hello friends, and welcome back to Week in R...
10115 — Factorial, a startup out of Barcelona that h...
10116 But it’s not totally clear whether rural Ameri...

10117 rows × 1 columns

Code
# Next, we examine how long the articles are.  Perhaps we want to 
# throw out the outliers, ie really short articles, which may 
# not really be articles, and also very long articles.
# 
# We do this below, looking at the mean and distribution of article lengths

article_lengths = [(len(x.split())) for x in (corpus.text)]
article_lengths = pd.Series(article_lengths)
plt.figure(figsize = (14,9))
sns.histplot(article_lengths)
pd.Series(article_lengths).describe()
count    10117.000000
mean       559.145003
std        501.310623
min          0.000000
25%        293.000000
50%        450.000000
75%        724.000000
max       8807.000000
dtype: float64

Code
# Let us just keep the regular sized articles, ie the middle 50%. We are still 
# left with a sizable number in our corpus.

corpus = corpus[(article_lengths>article_lengths.quantile(.25)) & (article_lengths<article_lengths.quantile(.75))]
len(corpus)
5050
Code
# Next we look at the distribution again

article_lengths = [(len(x.split())) for x in (corpus.text)]
article_lengths = pd.Series(article_lengths)
plt.figure(figsize = (14,9))
sns.histplot(article_lengths)
pd.Series(article_lengths).describe()
count    5050.000000
mean      468.483960
std       121.253301
min       294.000000
25%       358.000000
50%       450.000000
75%       565.000000
max       723.000000
dtype: float64

Our code becomes really slow if we use all 9600 articles, so we randomly pick just 100 articles from the corpus. This is just so we can finish in time with the demos. When you have more time, you can run the code for all the articles too.

Code
# We take only a sample of the entire corpus
# If we want to consider the entire set, we do not need to run this cell

corpus = corpus.sample(100)
Code
# Let us print out a random article

print(corpus.text.iloc[35])
Vancouver, British Columbia— Plurilock Security Inc. (TSXV: PLUR) (OTCQB: PLCKF) and related subsidiaries (“Plurilock” or the “Company”), an identity-centric cybersecurity solutions provider for workforces, has entered into definitive asset purchase agreements (the “Agreements”) dated October 21, 2021 to acquire certain assets (the “Purchased Assets”) of CloudCodes Software Private Limited (“CloudCodes”), an award winning cloud access security broker (“CASB”) based in India (the “Acquisition”).

Since 2011, CloudCodes has provided innovative cloud security SaaS solutions for protecting email and group collaboration platforms, offering single-sign-on (SSO), multi-factor authentication (MFA), and cloud data loss prevention (DLP) solutions. CloudCodes earned approximately CAD$576k in product revenue for its year ended March 31, 2021.

Following the Acquisition, CloudCodes’ existing customers will have access to a larger public organization with adequate financial resources, deep security, IT, AI capabilities and expertise, and the Company’s world-class sales team while Plurilock will gain a larger market presence in the international cybersecurity space and enter the growing CASB segment. In addition, Plurilock, through its Indian subsidiary, Plurilock Security Private Limited (“PSP”) will obtain a technical product team and a new office in Pune, India to complement its office in Mumbai, India.

The Acquisition will add additional functionality within Plurilock’s product portfolio, with CloudCodes’ CASB solution offered as an early access product under the name of Plurilock CLOUD. This additional technology solution creates new opportunity for Plurilock’s customers for a cost-effective cloud security solution and a path to integrate low-friction, high-security behavioral biometric identity with SSO and cloud security functionality. As a result, it is expected that the Acquisition will accelerate Plurilock’s sales growth and cement the Company’s position in the growing zero trust market.

“The acquisition of CloudCodes provides us with an award-winning CASB solution with broad customer adoption across small, medium and large enterprises. Businesses, especially small businesses, continue to face security risks with workforces that are working in a post-COVID, remote-centric world, and it has never been more important to secure cloud resources such as corporate email and file sharing,” said Ian L. Paterson, CEO of Plurilock. “This acquisition aligns with our commitment to becoming the premier cybersecurity solutions provider in the market, acquiring critical technology to enhance organizations’ zero trust architecture. We are looking forward to adding the CloudCodes product to our robust product portfolio and integrating their staff into our growing team, as we continue to develop cutting edge technology that empowers organizations to operate safely and securely while reducing friction for users.”

“We are pleased to join the Plurilock family of companies,” said Debasish Pramanik, co-founder of CloudCodes. “This transaction offers an opportunity to expand the use of our signature product in the North American market and join a fast-growing organization with deep security and IT expertise that is developing the next generation of cybersecurity solutions that can revolutionize the industry.”

Once the Acquisition is completed, CloudCodes assets will be transferred into the Plurilock family of companies, under the guidance of Plurilock’s management team.

Terms of Agreements

The Company and its subsidiaries, Plurilock Security Solutions Inc. and PSP, entered into the Agreements with CloudCodes whereby the Company will acquire the Purchased Assets. Pursuant to the terms of the Agreements, the Company has agreed to pay CloudCodes aggregate consideration of US$1,700,000 payable as follows: (i) US$1,000,000 in cash payable on closing; and (ii) US$700,000 in common shares of Plurilock (the “Consideration Shares”), less any deferred revenue. The Consideration Shares will be issued at a deemed price of C$0.59 per share and will be placed in escrow for 18 months to satisfy any indemnification obligations to the Company.

The Acquisition is subject to customary closing conditions and receipt of the approval of the TSXV. The Company expects to close the Acquisition on or around October 31, 2021.

About CloudCodes

CloudCodes is an internationally based Cloud Security SaaS platform company, offering a product that protects email and group collaboration platforms like Microsoft 365 and Google Workspace, while providing SSO, MFA and DLP functionality.

About Plurilock

Plurilock provides identity-centric cybersecurity for today’s workforces. The Plurilock family of companies enables organizations to operate safely and securely while reducing cybersecurity friction. Plurilock offers world-class IT and cybersecurity solutions through its Solutions Division, paired with proprietary, AI-driven and cloud-friendly security through its Technology Division. Together, the Plurilock family of companies delivers persistent identity assurance with unmatched ease of use.

21.2.2 Embeddings/Feature Extraction

Feature extraction means obtaining the embedding vectors for a given text from a pre-trained model. Once you have the embeddings, which are numerical representations of text, lots of possibilities open up. You can compare the similarity between documents, you can use the embeddings to match questions to answers, perform clustering based on any algorithm, use the embeddings as features to create clusters of similar documents, and so on.

Difference between word embeddings and document embeddings
So far, we have been talking of word embeddings, which means we have a large embedding vector for every single word in our text data. What do we mean when we say sentence or document embedding? A sentence’s embedding is derived from the embeddings for all the words in the sentence. The embedding vectors are generally averaged (‘mean-pooled’), though other techniques such as ‘max-pooling’ are also available. It is surprising that we spend so much effort computing separate embeddings for words considering context and word order, and then just mash everything up using an average to get a single vector for the entire sentence, or even the document. It is equally surprising that this approach works remarkably effectively for a large number of tasks.

Fortunately for us, the sentence transformers library knows how to computer mean-pooled or other representations of entire documents based upon the pre-trained model used. Effectively, we reduce the entire document to a single vector that may have 768 or such number of dimensions.

Let us look at this in action.

First, we get embeddings for our corpus using a specific model. We use the ‘all-MiniLM-L6-v2’ for symmetric queries, and any of the MSMARCO models for asymmetric queries. The difference between symmetric an asymmetric queries is that the query and the sentences are roughly the same length in symmetric queries. In asymmetric queries, the query is much smaller than the sentences.

This is based upon the documentation on sentence-bert’s website.

Code
# Toy example with just three sentences to see what embeddings look like

from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('all-MiniLM-L6-v2') #for symmetric queries
model = SentenceTransformer('msmarco-distilroberta-base-v2') #for asymmetric queries
#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")
Sentence: This framework generates embeddings for each input sentence
Embedding: [-1.94027126e-01 -1.22946411e-01 -1.03667654e-01 -5.60734332e-01
  1.10684156e-01  6.79868519e-01 -6.36458471e-02 -7.55182922e-01
  7.56757021e-01  2.64225334e-01 -1.42991528e-01  3.98469239e-01
  1.76254734e-01 -1.42204142e+00 -2.50023663e-01  6.46462897e-03
  4.95950967e-01  4.63492602e-01 -1.50225936e-02  8.64237368e-01
  1.83195844e-01 -8.47510993e-01 -7.40249932e-01 -1.01876450e+00
 -1.04469287e+00  5.33529818e-01  7.04183996e-01  3.23025554e-01
 -1.34202325e+00 -1.40403613e-01 -1.69761360e-01  9.34997141e-01
 -3.45071107e-01  4.92122434e-02  1.28702326e-02 -1.90801427e-01
  5.31530082e-01 -3.53034407e-01 -9.99688327e-01  1.29575178e-01
  8.10617507e-01  5.22234797e-01 -7.57190168e-01 -2.42323861e-01
  4.81890917e-01 -2.24909961e-01  5.87174833e-01 -9.55266654e-01
 -2.80447155e-01 -5.75490184e-02  1.38305819e+00 -6.43579885e-02
 -2.80887455e-01 -2.96108842e-01  6.02366269e-01 -6.88801527e-01
 -3.63944560e-01  1.24548979e-01  1.68449059e-01 -3.52236390e-01
 -5.34670591e-01  1.07049860e-01  1.89601004e-01  4.98377889e-01
  5.57314813e-01  9.96691082e-03  1.11395024e-01 -3.20706338e-01
 -5.68632662e-01 -2.54595071e-01 -1.17989182e-01  2.34521136e-01
  4.05370779e-02 -8.24391186e-01  6.77566350e-01 -8.15773308e-01
  6.42071486e-01 -7.75032520e-01 -2.13417113e-01  6.85814440e-01
  1.00933135e+00  3.57063770e-01 -4.13770676e-01  3.37253094e-01
 -3.41039188e-02 -3.45317245e-01  2.80249473e-02  9.73951519e-01
 -6.43461645e-02 -6.06842160e-01 -3.48319650e-01 -5.75613379e-02
 -6.01034939e-01  1.48180962e+00  2.74765521e-01  6.42698586e-01
  2.52264529e-01 -1.33694637e+00  2.61821836e-01 -1.21891707e-01
  1.12433112e+00  3.23991627e-01  1.90715790e-01  1.06098376e-01
 -5.28269231e-01  1.66739047e-01  4.35670942e-01  3.07411522e-01
 -7.34456956e-01 -2.05261961e-01  1.22825503e-01  1.61016196e-01
  4.43147391e-01  2.64934242e-01  8.47648621e-01 -7.37874135e-02
  2.99923062e-01  3.89373690e-01  3.17179821e-02  5.00585735e-01
 -2.81464159e-01 -8.12774718e-01 -5.90420187e-01 -1.62012696e-01
 -6.17273927e-01  3.92245650e-01  6.67506993e-01  7.01212466e-01
 -1.29788291e+00  4.20975447e-01  6.82982877e-02  1.05026746e+00
  1.90296426e-01  1.57451317e-01 -1.27690539e-01  1.70817673e-01
 -5.59714615e-01  2.86618054e-01  6.88185275e-01  1.76241621e-01
 -2.90350705e-01  5.54080665e-01  3.52999896e-01 -9.71634805e-01
  5.82876742e-01  7.67536610e-02 -8.55224058e-02  1.64016068e-01
 -4.47867244e-01 -2.59355217e-01  1.27354398e-01  9.79057074e-01
  3.73845607e-01  2.00498570e-02  3.08634132e-01 -8.47880661e-01
 -2.75358230e-01  4.34403330e-01  6.07397497e-01  1.44445404e-01
  3.02737325e-01 -8.48591924e-02  7.59577528e-02  2.25079998e-01
  3.31507593e-01 -3.65941495e-01  4.87931490e-01 -2.12545112e-01
 -6.65542066e-01  3.48111510e-01  2.20464692e-01 -3.09980959e-01
 -8.39646101e-01 -3.30511957e-01 -4.15750414e-01 -2.79508859e-01
 -1.40072629e-01  1.84452698e-01 -1.54586315e-01  5.54982841e-01
 -5.79781711e-01 -3.45990062e-01 -1.88777968e-01 -1.06845282e-01
 -3.00893903e-01 -4.41065788e-01  6.00923777e-01  4.12963390e-01
  5.86519182e-01  2.00733587e-01  1.36600316e+00 -1.49683118e-01
 -1.08713202e-01 -5.95987380e-01 -3.16461809e-02 -6.61388695e-01
  7.37694204e-01  7.15092123e-02 -3.63184452e-01 -6.92547485e-02
  2.76804715e-01 -9.55267191e-01 -9.52274427e-02  4.58616495e-01
 -4.26264793e-01 -4.42463070e-01  1.27647474e-01 -9.39838886e-01
 -1.15567088e-01 -6.55211329e-01  7.31721699e-01 -1.57167566e+00
 -1.10542095e+00 -9.03355658e-01 -5.43098509e-01  7.95553029e-01
 -7.08044181e-03 -2.85227060e-01  9.28429782e-01  9.71573889e-02
 -3.96223694e-01  4.94155139e-01  5.37390769e-01 -3.39529753e-01
  3.68308216e-01 -1.28579482e-01 -1.05017090e+00  4.17593837e-01
  2.48604313e-01 -9.68254879e-02 -3.59232217e-01 -1.08622682e+00
 -1.00478329e-01  2.23072171e-01 -4.37571257e-01  1.38826239e+00
  7.68635273e-01 -1.42441198e-01  6.20768249e-01 -2.65000969e-01
  1.35475969e+00  2.88145393e-01 -1.43894047e-01 -2.99536616e-01
  6.31552264e-02 -2.51712114e-01 -1.38677716e-01 -5.41011631e-01
  1.47185221e-01 -1.49833515e-01 -7.15740383e-01  2.88314521e-01
 -6.38389051e-01  3.16053420e-01  7.71043360e-01  1.43179849e-01
  1.48211978e-02  4.73498583e-01  8.03198099e-01 -1.08405840e+00
 -5.70261180e-01 -4.76540141e-02  5.26882231e-01 -2.81869620e-01
 -1.13989723e+00 -7.62864292e-01  2.67658662e-03 -5.99309146e-01
  5.08213304e-02  3.48600075e-02 -1.31661296e-01  3.43350202e-01
  1.47039965e-01  3.29475522e-01 -2.65227765e-01 -1.64056107e-01
  1.84712335e-01 -1.64587155e-01  2.68281907e-01 -1.01048581e-01
  3.19146842e-01 -1.23158330e-02  8.56841505e-01  2.03407288e-01
 -3.81547332e-01 -6.64151132e-01  1.32862270e+00  3.04318994e-01
  3.39265078e-01  4.92733121e-01 -1.24012269e-01 -7.18624413e-01
  7.86116779e-01 -1.71105146e-01 -6.88624561e-01 -5.21284103e-01
  3.24477136e-01 -6.42667353e-01 -4.49099392e-01 -1.64437735e+00
 -1.15677512e+00  1.04355657e+00 -3.67201120e-01  4.36934233e-01
 -3.68611068e-01 -5.88484347e-01  1.77582696e-01  4.92794275e-01
 -1.17947571e-01 -3.62115175e-01 -8.98680031e-01  1.27371326e-01
  1.12385474e-01  7.67848730e-01 -5.89435279e-01 -1.44601986e-01
 -1.09177661e+00  8.49221230e-01  5.22653639e-01  2.08491504e-01
 -5.28513193e-01 -4.64428753e-01  4.48831111e-01  5.75599611e-01
 -3.98134351e-01  9.21166122e-01  3.45954180e-01 -1.62111804e-01
 -1.04399741e-01 -2.50324935e-01  3.00042212e-01 -6.02201462e-01
  1.75128534e-01  4.32529122e-01 -5.86885870e-01 -3.32548052e-01
 -3.95463258e-01 -5.57754755e-01 -4.48471338e-01 -2.77211517e-01
  8.81520510e-02 -6.36177719e-01  3.19960952e-01  7.60709226e-01
  3.15277666e-01  4.44414705e-01  6.47632718e-01 -2.63870377e-02
  5.25060177e-01 -1.61295444e-01 -1.55720651e-01  8.36495817e-01
 -9.65523064e-01  3.01889300e-01  1.69886515e-01 -3.05459499e-02
  1.86375126e-01  1.04048140e-01  2.38540974e-02 -6.64686680e-01
  7.24217117e-01  3.38430315e-01 -5.57187736e-01 -6.26726449e-01
  2.66006470e-01  7.35096037e-01 -9.07033145e-01  3.59426349e-01
  6.95876300e-01 -7.21453428e-01  2.58155048e-01  5.54192603e-01
  5.41482761e-04 -7.63940573e-01  3.79112154e-01  1.46436945e-01
  5.97151697e-01 -7.88239002e-01 -3.30818325e-01  3.47732455e-01
 -9.91573274e-01  1.00135529e+00 -7.29097128e-01  6.53858840e-01
  8.88706267e-01  7.92914554e-02  5.46956718e-01 -1.07456028e+00
  2.40587711e-01 -2.27806821e-01 -5.90185463e-01  1.93599775e-01
 -2.94735581e-01  6.93159640e-01  5.71026504e-01 -5.83263189e-02
  7.66058385e-01 -1.13303483e+00 -2.08590925e-01  7.20142305e-01
  5.14323413e-01 -2.17918679e-01  5.63963242e-02  1.05543637e+00
  3.60624075e-01 -8.86408508e-01  6.73737109e-01 -6.02494895e-01
  2.93800205e-01  3.85887951e-01 -7.39043728e-02 -6.01254046e-01
  4.93471593e-01  4.23360497e-01  4.78618175e-01 -2.05091592e-02
  1.23453252e-01  3.61531049e-01 -6.02922976e-01  3.94695520e-01
 -6.17296934e-01  4.17496830e-01  1.88823715e-01  6.38141155e-01
 -2.95579195e-01 -1.13238625e-01 -2.92233139e-01  1.89026613e-02
  8.18741992e-02  2.89110005e-01 -4.24625039e-01  1.15595661e-01
  1.10594714e+00 -4.42896038e-01  9.07224491e-02 -3.24043751e-01
 -1.41674364e-02  3.84770662e-01 -1.10513262e-01  1.47955164e-01
  3.03255673e-02  9.41580951e-01  7.44941950e-01 -5.16233027e-01
 -1.07049954e+00 -3.39610100e-01 -9.81280506e-01  1.63678247e-02
 -3.09417546e-01  8.38646412e-01 -2.57466435e-01  2.66416728e-01
  8.29470217e-01  1.18659770e+00  4.45776910e-01 -5.46342313e-01
  3.46238047e-01  4.82638836e-01 -2.03869224e-01 -4.86987345e-02
 -1.36196777e-01  7.27507114e-01 -2.94585973e-01  4.04031157e-01
 -4.61239845e-01  1.53372660e-01  5.77554286e-01 -1.07579343e-01
 -1.07114184e+00 -6.10307753e-01 -1.70739844e-01  2.83243328e-01
 -2.24987030e-01  3.85358900e-01 -7.83190504e-02 -6.51502684e-02
 -4.53457981e-01  1.75708488e-01  9.54947174e-01 -4.80354220e-01
  3.67996283e-02  3.07653725e-01  9.76266861e-01 -2.82786489e-01
 -5.11633098e-01 -5.04429340e-01  2.25381449e-01  5.29595613e-01
 -1.00188531e-01  3.30978967e-02 -4.25292045e-01 -2.50480801e-01
  7.80557394e-01 -3.06186169e-01  1.09467424e-01 -6.34019434e-01
  3.03106278e-01 -1.41973698e+00 -4.36300397e-01  3.82955313e-01
  2.25167990e-01 -3.14564019e-01 -2.14847505e-01 -7.26124108e-01
  4.01522785e-01  1.61230147e-01 -2.14475319e-01 -8.14741179e-02
  1.44952789e-01  4.35495615e-01  1.60962373e-01  8.42103720e-01
  4.83167648e-01 -1.81479957e-02 -3.72209549e-01 -8.54204893e-02
 -1.25429153e+00  6.33920580e-02 -3.04254025e-01  1.19560093e-01
 -4.54789966e-01 -6.71517909e-01  8.25446323e-02  8.15794468e-02
  8.27028692e-01 -3.20302039e-01 -6.09916687e-01 -2.28958473e-01
 -3.23811322e-01 -5.48929155e-01 -7.08900273e-01  5.72744966e-01
 -9.07650590e-02  2.64598966e-01  2.70573050e-01 -9.85758960e-01
 -2.44654134e-01 -3.91785651e-01  2.55578756e-01 -6.70406878e-01
 -1.21352875e+00 -3.58353883e-01  9.98406172e-01  6.14020884e-01
  5.54472543e-02  2.67769605e-01  6.59718096e-01  6.53219372e-02
 -4.38049883e-01  9.86246109e-01 -2.51958340e-01  7.89942980e-01
 -7.73840129e-01  5.97827852e-01 -2.22646594e-01  4.02279533e-02
 -2.87521333e-01  3.42817307e-01 -2.41310239e-01  1.77004576e-01
 -9.65369642e-02  9.10423577e-01  4.00543928e-01  5.33569120e-02
 -2.18828022e-01 -2.59987801e-01 -2.06984386e-01  3.85516196e-01
  9.66344357e-01  2.62666583e-01 -5.70590794e-01  9.91979778e-01
  2.98639029e-01  9.17680323e-01 -9.80460465e-01 -5.94711006e-02
 -9.55575332e-02  8.68820190e-01 -6.75058305e-01 -2.41460040e-01
 -8.95355940e-01  4.71445829e-01 -2.14758113e-01  5.96137702e-01
 -6.81212023e-02 -1.22940755e-02 -3.48113567e-01  9.15871263e-02
 -8.74246836e-01 -6.46880984e-01 -2.76604384e-01 -4.86592144e-01
  3.61363381e-01 -4.31284517e-01 -2.53118962e-01 -2.11931407e-01
  7.04253241e-02  1.43149868e-01 -7.21811831e-01 -7.77530134e-01
 -2.66693115e-01  2.54974961e-02  3.14531595e-01  2.98289031e-01
  4.59118724e-01  4.35666889e-01 -6.02146268e-01 -3.29307169e-01
  2.72133678e-01  2.44671479e-02  3.10772389e-01 -6.65003121e-01
  3.58248562e-01  3.00383627e-01 -3.64194423e-01 -5.12525737e-01
  2.16460541e-01  5.01621068e-01  2.53829032e-01 -1.22401452e+00
  4.61754054e-01 -1.53161451e-01 -2.68886209e-01  1.27812326e+00
 -1.07412553e+00 -4.94798303e-01  6.21693552e-01  4.18770790e-01
  7.43999183e-01  2.84353107e-01  1.35036871e-01  8.22463810e-01
  5.11462271e-01 -2.76414454e-01  3.26247245e-01 -4.85349864e-01
  4.11561877e-01 -1.19246654e-01 -1.61334530e-01 -7.34282315e-01
 -9.41174507e-01 -1.15899551e+00 -2.58182764e-01 -4.81391102e-01
  1.41962335e-01 -1.07252918e-01 -2.61298269e-02 -4.07726765e-01
  3.95175695e-01  9.52931941e-01 -6.57295436e-02 -5.97879887e-01
 -4.26192760e-01  2.06618801e-01  6.77784741e-01 -1.12915194e+00
 -7.80462995e-02  3.37206364e-01 -6.69075772e-02  6.15010798e-01
 -2.87119716e-01 -2.27136135e-01 -2.42563352e-01 -2.03058645e-01
 -2.77406633e-01 -3.84487003e-01  1.71700701e-01  1.32659745e+00
 -1.54341653e-01 -9.40678045e-02 -3.46466780e-01  3.97526532e-01
 -3.61106247e-01  1.07136858e+00 -7.35428035e-01  4.52006727e-01
 -3.94796461e-01 -5.93080342e-01 -1.30981520e-01  2.37584129e-01
 -5.63736558e-01  7.58668244e-01  9.55792367e-01  3.89002115e-01
  6.69343948e-01 -4.48577017e-01 -5.99645674e-01 -5.11237085e-01
 -6.01219475e-01 -3.33563328e-01  3.43445688e-02  1.24906890e-01
 -3.98856193e-01 -4.00449544e-01 -1.91573918e-01 -9.40701723e-01
 -1.97318971e-01 -1.99874625e-01 -3.46654914e-02 -1.74211666e-01
 -9.32460129e-01 -6.68434799e-02  3.58897686e-01  2.40670264e-01
  1.68707371e-01  2.12407067e-01  3.82851698e-02  4.22058821e-01
  7.49818563e-01 -6.04370773e-01 -5.07282317e-01  6.40344441e-01
 -4.69703197e-01 -6.06814682e-01 -2.10751593e-01  5.21379858e-02
 -1.81016140e-02  3.84092331e-01 -1.14480209e+00 -3.46426129e-01
  4.44303304e-01  3.00263196e-01  9.76041034e-02  1.52969763e-01
  1.78943232e-01 -2.96392560e-01 -4.73998755e-01 -6.50664628e-01
 -1.90126374e-01  1.75953805e-01  1.06422436e+00  6.82281494e-01
  6.07434690e-01 -4.69581038e-01  2.85444587e-01  1.47231007e+00
  6.49958193e-01 -4.16353196e-01 -2.71410137e-01 -4.02401328e-01
  4.31929082e-01 -1.11652696e+00  9.89714801e-01 -4.93843794e-01
  2.96220332e-01  6.49991155e-01  1.71276465e-01 -3.89997095e-01
  2.96082497e-01 -6.96498632e-01  1.15289032e+00  5.26634753e-01
 -1.92738521e+00 -2.08714798e-01  2.58085877e-01 -2.02861592e-01
 -7.30242729e-01  9.42804396e-01 -1.71018064e-01  4.25120860e-01
  5.78499913e-01  5.67792714e-01 -3.58646393e-01 -4.07528400e-01
  1.21926451e+00 -4.26342487e-01  4.62184846e-03  9.98993576e-01]

Sentence: Sentences are passed as a list of string.
Embedding: [-1.16906703e-01 -3.39529991e-01  2.95595676e-01  6.28463686e-01
 -1.21640146e+00  1.65200818e+00 -3.72159153e-01  1.22192897e-01
  1.43514737e-01  1.89907885e+00  7.67186701e-01  1.97850570e-01
 -3.00641447e-01  2.56379187e-01 -3.48131806e-01 -4.73125935e-01
  1.08252943e+00  2.98562735e-01  7.63341844e-01  8.66353214e-01
  4.58364397e-01 -9.81929481e-01  2.39389911e-01 -2.22516790e-01
 -1.33060351e-01 -9.96134505e-02  3.78246218e-01  6.10263705e-01
 -2.39595935e-01 -6.06570303e-01 -1.00376761e+00  1.12918425e+00
  1.00350715e-01 -3.09985340e-01  5.68390429e-01  4.60176259e-01
  5.56804717e-01 -9.56280410e-01 -1.07998073e+00 -8.21260884e-02
 -5.05553067e-01  4.20840144e-01 -9.42075014e-01 -1.94354117e-01
 -7.87233829e-01 -3.89431119e-01  6.93553627e-01 -1.27063155e-01
 -3.93037200e-02 -3.24397892e-01  2.25299224e-01 -4.44827318e-01
  3.83225173e-01  1.55420110e-01 -4.00179952e-01  5.34262359e-01
 -7.52259672e-01 -1.48048389e+00 -3.27409536e-01 -6.32044077e-02
  2.56292492e-01 -6.87812805e-01 -5.23866713e-01 -8.44650269e-02
  7.01584160e-01 -5.68879962e-01 -3.34130555e-01  3.62065405e-01
 -2.21194491e-01  2.73137331e-01  1.07009542e+00 -8.99545193e-01
 -1.09715140e+00 -4.02705759e-01  4.93271738e-01 -1.13299298e+00
 -1.01656474e-01  1.21973073e+00  2.00953558e-01  6.92954242e-01
  1.01618135e+00  9.38402593e-01  1.15313955e-01  1.12252986e+00
 -3.70449662e-01 -3.82418424e-01 -5.63915260e-02 -6.26985848e-01
  1.02046466e+00  4.74569649e-01  2.05626383e-01 -2.17339873e-01
  1.67204976e-01  4.19093333e-02 -2.10443318e-01  4.12339449e-01
  2.06380725e-01 -5.14172077e-01  3.42742920e-01 -6.01808608e-01
  2.28810072e-01  4.86289233e-01 -8.57147396e-01  4.29493450e-02
 -5.76607525e-01 -5.80542564e-01  1.24514186e+00 -4.79772687e-01
 -3.29002924e-02  1.63348034e-01 -2.32700869e-01  6.50418699e-01
  5.11511207e-01  4.20596987e-01  8.68813574e-01 -6.27200127e-01
  1.10752177e+00 -1.90651700e-01  8.90402794e-02  2.78722495e-01
 -2.18247354e-01  3.38913113e-01 -3.35250974e-01  4.99915242e-01
 -6.69480503e-01  1.10688768e-01  8.38254988e-01 -3.01222503e-01
 -1.18903148e+00 -2.68496219e-02  6.16425693e-01  1.19437933e+00
  5.95553339e-01 -1.32276988e+00  1.98763236e-01 -2.30663300e-01
 -7.13005066e-01  2.79665422e-02  7.10063040e-01  3.44212174e-01
  8.25598016e-02  2.76916295e-01  7.48485327e-01 -3.27318966e-01
  1.07345498e+00  3.40998173e-01  3.17850679e-01  4.49840099e-01
  4.13322210e-01  9.21605751e-02 -2.65353888e-01  1.64882147e+00
 -3.41724962e-01  3.83048415e-01  3.46933603e-02  1.15815781e-01
 -5.06706655e-01 -9.16418970e-01  6.92660391e-01 -1.91819593e-01
  4.06172037e-01  3.52777958e-01  1.16981216e-01  1.12070453e+00
  9.78735149e-01  6.79820254e-02  8.12346160e-01 -1.63415521e-01
 -4.97115523e-01  1.41054779e-01  1.21359766e-01  1.46335497e-01
 -7.41519451e-01 -6.45965993e-01 -1.24297106e+00 -7.96831191e-01
  1.20233670e-02 -7.87057459e-01  6.79720104e-01 -2.38567099e-01
 -5.98563194e-01 -7.69117355e-01 -3.11015069e-01 -6.62288725e-01
  1.29011683e-02 -4.72290635e-01  6.81381881e-01 -4.00673419e-01
  2.86585927e-01 -9.39205348e-01  9.26605225e-01  1.39040515e-01
  2.27116466e-01  1.29718721e+00 -2.83729821e-01 -1.75453711e+00
  5.14367938e-01  1.06682390e-01  9.78547513e-01 -1.69397041e-01
  2.32441604e-01 -5.80137968e-02 -2.61542708e-01  7.10425794e-01
 -8.07003677e-01 -3.24614078e-01 -2.31425181e-01  1.46879926e-01
 -1.99359477e-01 -7.85942554e-01  4.01071787e-01  3.64467710e-01
 -1.64785957e+00  3.43260497e-01  6.66370034e-01  3.20747823e-01
  7.45556355e-01  1.49886513e+00  4.15916331e-02  2.38673940e-01
  3.11245531e-01  1.11624986e-01  7.58126557e-01 -5.90230763e-01
  7.38683701e-01 -3.79376650e-01 -4.98532891e-01 -5.99651784e-02
 -4.13518339e-01  5.47317922e-01  2.37316594e-01 -2.11386609e+00
 -3.93641740e-02  1.37290642e-01  2.58059531e-01  1.37962806e+00
  1.65988877e-01  6.67001307e-02  3.37507725e-01 -5.14431059e-01
  4.13342744e-01 -2.81218946e-01 -2.19349101e-01 -5.69458783e-01
 -4.63474154e-01  5.79096377e-01 -4.88767654e-01  1.13501060e+00
  2.89165288e-01  1.12575583e-01  2.79255897e-01  4.80988055e-01
 -5.67966998e-01 -5.34342825e-02 -9.01518047e-01 -3.24264139e-01
 -2.45777056e-01 -4.92247462e-01  1.03530514e+00  9.57975090e-01
  4.51066822e-01 -9.26326215e-01  1.34554291e+00 -3.74585301e-01
  2.47377172e-01 -1.81935370e-01 -2.40810290e-01 -5.23710484e-03
 -3.87806863e-01 -4.16272372e-01 -1.71081439e-01  3.55579376e-01
  8.26952681e-02  1.00085485e+00 -5.76247275e-01 -1.80821180e-01
  8.64278972e-01 -5.98722994e-01 -7.60923207e-01 -2.56919235e-01
  3.39388609e-01 -4.09686655e-01  3.80000882e-02  5.18352330e-01
 -1.36770442e-01  3.60791117e-01 -1.16104193e-01  1.77927837e-01
 -1.48969278e-01  4.53826547e-01 -6.20274186e-01 -1.56975731e-01
 -7.03017175e-01  9.73927259e-01  2.12830022e-01  5.20101003e-02
 -1.31684944e-01 -4.94677395e-01 -6.14997089e-01 -2.58644611e-01
 -7.12190568e-01  1.17969322e+00 -1.86769530e-01  7.47682750e-01
  1.40399203e-01  1.88243046e-01  1.12703241e-01  3.15180987e-01
 -1.09888017e-01  1.92593802e-02  8.62524450e-01  4.12412733e-01
  1.97270349e-01  3.58969830e-02  2.80338824e-01 -1.11712724e-01
 -1.95806876e-01 -8.96784544e-01 -8.74943435e-01 -5.09607911e-01
  2.54794061e-01 -1.11525692e-01  4.84610766e-01  2.03405425e-01
 -1.28510582e+00  6.13453805e-01 -7.62468040e-01 -4.45492864e-01
  8.98256302e-01 -4.65354651e-01 -2.69756705e-01  6.43096626e-01
 -1.17004596e-01  1.26671463e-01  7.34529570e-02 -6.00621253e-02
  2.99074471e-01 -2.24282682e-01 -1.75983489e-01  6.67618692e-01
 -6.75170124e-01  3.97939712e-01  2.71357536e-01 -7.92269781e-02
 -2.15837181e-01 -1.67162970e-01  3.36395413e-01  5.76822937e-01
  4.60953623e-01 -6.98052466e-01  2.63512045e-01  7.60804236e-01
 -5.87762117e-01  8.38262260e-01  3.91144991e-01 -4.16893512e-01
  3.68824095e-01 -3.06232218e-02  3.03764313e-01 -6.96085453e-01
 -6.19740427e-01 -6.71980381e-01  4.05086428e-01  2.55809754e-01
  7.36332119e-01  1.07301868e-01  8.99604142e-01  3.40113729e-01
  2.11660769e-02 -3.83403808e-01  4.60269779e-01 -1.18836604e-01
  1.00144535e-01 -2.24260044e+00  1.93750244e-02 -7.39753917e-02
 -8.71745408e-01  8.03703964e-01  1.01660228e+00  2.40650535e-01
 -2.53779978e-01 -4.69365865e-01 -2.86698461e-01  2.74048060e-01
  7.87152648e-02 -1.53373882e-01 -2.92661875e-01 -2.36835539e-01
  1.95323884e-01  2.89673895e-01  1.05472839e+00 -1.23539722e+00
 -5.54235220e-01 -2.46521588e-02  1.38157783e-02 -7.63832569e-01
  6.22972727e-01 -3.92603874e-02  7.64605999e-02  5.43346368e-02
  6.10556424e-01  1.02582574e-01  2.56898254e-01  1.37820661e-01
  4.16399390e-01 -1.39033392e-01  1.24321818e-01  6.18484206e-02
  5.80244243e-01 -5.59255958e-01  1.20674990e-01  4.10759360e-01
  1.28601402e-01 -3.12268585e-01  3.42458874e-01 -1.27690017e-01
 -3.82217243e-02 -9.15540695e-01 -1.02993524e+00  3.61140966e-01
 -3.60447168e-01  5.16320646e-01 -5.18503666e-01  6.51507974e-01
 -5.95811903e-01  2.35786811e-01  5.75912535e-01 -5.66179693e-01
 -1.10640474e-01 -7.76338518e-01 -2.11644843e-01 -8.05815756e-01
  8.35742533e-01 -2.62212425e-01  7.90669918e-01 -3.43366355e-01
 -3.72239590e-01 -4.08375002e-02  1.12646043e+00 -1.66463006e+00
  3.08841825e-01  7.88043499e-01  7.16356754e-01 -5.27685463e-01
 -8.58413041e-01 -4.89941746e-01 -6.18519545e-01 -5.47997952e-01
  2.82600641e-01  2.53601279e-02 -2.31744885e-01 -1.62017029e-02
  3.90202761e-01  4.31031227e-01  1.22245204e+00 -8.24961245e-01
 -4.07059699e-01  3.74508649e-01 -6.94210052e-01  3.41466069e-01
  5.05169392e-01  3.98315996e-01  5.49142420e-01  6.20304108e-01
  3.60187322e-01  1.61006555e-01  4.66429852e-02  4.81841236e-01
 -1.84292018e-01  4.89783406e-01  5.16658068e-01  4.50122833e-01
  3.07243675e-01 -1.70838699e-01 -2.76717335e-01  4.60514193e-03
 -2.14468315e-01  8.68432343e-01  3.81191730e-01 -6.10564768e-01
  7.38632858e-01  4.27027196e-02  2.78751403e-01 -1.05490424e-01
  1.88715577e-01  3.07165861e-01 -6.19095862e-01 -2.75719196e-01
 -5.85847080e-01  8.56540143e-01  5.67891657e-01 -1.51822820e-01
  2.37745881e-01  3.64973813e-01 -7.51306340e-02  3.16785611e-02
  3.98023933e-01 -4.46235865e-01 -7.03080416e-01  2.52316087e-01
  3.52890044e-01 -5.75693011e-01  1.24144804e+00  1.38290450e-01
  3.81564856e-01 -8.19765866e-01  2.28060409e-04 -5.46364903e-01
  2.03513607e-01  5.78800857e-01  3.69109660e-01  9.68074322e-01
 -2.43431076e-01  9.17764723e-01 -3.66045050e-02  7.57102013e-01
 -6.07912183e-01 -9.96343195e-01 -4.58300859e-01 -1.82977900e-01
 -5.52110493e-01  3.47473145e-01 -9.36147511e-01 -2.70746976e-01
  2.48594582e-01 -5.79488389e-02  2.39616349e-01  3.35074186e-01
 -1.06118715e+00 -1.42484140e+00 -7.67819941e-01 -1.43470192e+00
 -5.37024796e-01  1.65033668e-01  4.07063276e-01 -1.52938589e-01
 -1.18532348e+00 -2.95132369e-01 -1.73252201e+00 -4.88075972e-01
 -4.30523485e-01  5.56108117e-01  6.89623654e-01  1.09163724e-01
 -5.97034693e-01 -4.75037605e-01 -4.20480259e-02  9.49334621e-01
 -5.05421937e-01  5.95862627e-01 -6.86309278e-01 -1.74919188e+00
 -4.96481806e-01  4.71895546e-01 -5.22780418e-01 -1.12564278e+00
  1.33108401e+00 -4.00434047e-01 -2.46227980e-01 -2.05789194e-01
 -7.13341773e-01  9.93153036e-01  5.43550670e-01  1.40179068e-01
 -1.20376861e+00  1.13346800e-03 -7.26536810e-01  1.67122170e-01
  1.23233855e-01 -7.82044709e-01 -4.97816831e-01  3.81824762e-01
 -3.73728245e-01  2.39122152e+00 -1.07404327e+00  2.29385540e-01
 -1.38386741e-01  6.94290876e-01 -3.10964614e-01  4.20647860e-02
  9.38088417e-01 -1.04231320e-01  1.16593421e-01 -3.05111915e-01
 -9.77337956e-02 -9.86911952e-01 -1.09040000e-01 -4.07513171e-01
 -5.02026796e-01  2.71878634e-02 -2.00747356e-01 -6.90447330e-01
  1.33138835e-01 -1.00048327e+00 -1.72360018e-01  7.12540448e-01
  9.36333954e-01  1.94153622e-01  3.32033902e-01  4.40459579e-01
  4.60635155e-01  2.93384284e-01 -8.14757288e-01  9.33267295e-01
  1.13695204e+00 -3.12429607e-01  9.34470117e-01 -5.23371361e-02
  2.66572446e-01 -1.24626815e+00 -6.47320151e-01 -1.20385019e-02
  2.51794249e-01 -1.62435949e+00 -8.43286335e-01  7.72574186e-01
  3.02384049e-01 -3.15416664e-01 -5.72963536e-01 -9.20167029e-01
 -1.82137266e-01 -4.98007625e-01 -7.29632378e-01  1.04492652e+00
 -6.90359056e-01 -9.51736748e-01  3.10429007e-01  7.88420618e-01
  6.19803630e-02  3.74994054e-03 -7.25813925e-01  5.08510530e-01
 -6.10125065e-01  3.90015602e-01  4.52400684e-01 -6.01844117e-03
  2.28873387e-01  2.35855579e-01 -9.13136974e-02  3.06747228e-01
 -3.69900256e-01 -1.39348954e-01  5.83143353e-01 -1.25550878e+00
 -8.68154243e-02  6.80030107e-01 -8.99782240e-01  8.13376755e-02
 -3.56431007e-01 -2.15152636e-01  1.38490871e-01  2.13240579e-04
  4.07018214e-01 -4.40745592e-01  8.44455063e-01  3.03579599e-01
  1.49657160e-01  5.79764992e-02  7.15669692e-02  1.71762690e-01
 -6.31176412e-01 -2.79557973e-01  2.62509018e-01 -3.23241018e-02
  4.23288256e-01  3.77706289e-01  5.57583451e-01  8.59237432e-01
 -3.47556502e-01 -7.35680163e-01 -7.03005567e-02 -5.45158148e-01
  8.58226538e-01  9.47745323e-01 -4.69266266e-01 -2.98372269e-01
  1.17472284e-01  7.46314287e-01 -1.12415202e-01 -4.88281697e-01
 -9.37604487e-01  4.19724472e-02  7.81153500e-01 -5.68225868e-02
  7.27754295e-01  5.69072008e-01 -7.93714762e-01  1.44073918e-01
 -4.56198484e-01 -2.51369148e-01  9.05633252e-03 -4.03466634e-02
 -3.96877140e-01 -2.11487338e-01 -4.27028865e-01 -3.86283904e-01
 -2.77999073e-01 -2.68107265e-01  3.09029549e-01 -5.82618952e-01
 -1.06567276e+00  4.17070184e-03  4.01851311e-02 -6.01722717e-01
  4.67178106e-01  1.71983734e-01  1.02346766e+00  2.26993084e-01
  5.59997261e-02 -6.66350186e-01 -5.41385472e-01 -2.64908195e-01
  1.17840266e+00 -9.09033343e-02  5.70265710e-01  5.13671160e-01
 -5.46499901e-02  3.44300121e-01 -1.03550231e+00 -4.83340770e-01
  3.63577127e-01 -6.91399872e-01  3.50902170e-01  1.29768801e+00
 -4.58698839e-01 -5.93462646e-01  1.38791516e-01 -3.23593557e-01
 -3.75319362e-01  5.54265082e-02  8.91401529e-01  4.82140891e-02
  1.08048670e-01 -2.60419816e-01  1.30271280e+00 -1.25113881e+00
 -2.67142266e-01 -1.66038179e-03 -3.50445271e-01 -2.64513761e-01
  8.10346901e-01 -6.63675249e-01  4.60750848e-01 -4.22019243e-01
  1.34326071e-01  1.13470137e+00 -5.85057914e-01 -7.30287656e-02
  2.85868615e-01 -1.50318944e+00  3.82266909e-01  4.41001296e-01
 -1.80920377e-01 -3.12278360e-01  1.30556867e-01  2.84805689e-02
 -1.06328058e+00  1.11511700e-01  1.82903241e-02  6.55073702e-01
  3.26293200e-01  1.18603802e+00 -1.55810431e-01  1.99316982e-02
 -1.86682150e-01 -4.06430632e-01  4.99121279e-01  1.71999395e+00]

Sentence: The quick brown fox jumps over the lazy dog.
Embedding: [-2.68969119e-01 -5.03524899e-01 -1.75523773e-01  2.02556327e-01
 -2.23503485e-01 -1.07607991e-01 -1.00223982e+00 -9.82934162e-02
  3.46169442e-01 -4.59772974e-01 -7.90717065e-01 -6.96034372e-01
 -1.47489876e-01  1.45099306e+00  1.52760684e-01 -1.37310255e+00
  4.35586601e-01 -6.60499275e-01  3.41288656e-01  5.21309197e-01
 -3.79796118e-01  3.82933259e-01  1.93710133e-01  1.72207832e-01
  1.11666811e+00 -1.58467069e-01 -8.79326582e-01 -1.04076660e+00
 -5.95402718e-01 -5.07739961e-01 -8.78801286e-01  5.56477845e-01
  2.71484941e-01  1.14686406e+00  7.52792060e-01 -1.76436022e-01
  4.71736342e-01 -3.68952960e-01  5.48888266e-01  6.86078846e-01
 -5.23310713e-02 -9.48668122e-02 -1.66674972e-01 -1.00176132e+00
  5.21575809e-01 -9.06651616e-02  4.29446697e-01 -4.49900419e-01
  2.51435786e-01 -2.33954266e-01 -5.11107802e-01 -3.94426316e-01
  6.45667851e-01 -5.30599177e-01  1.85784638e-01  1.42533794e-01
 -3.00294191e-01  1.20069236e-01  4.23554778e-01 -4.89861488e-01
  4.29551601e-01 -7.01631829e-02 -9.37448665e-02 -1.15294479e-01
 -3.64667982e-01  3.96197349e-01 -2.02278331e-01  1.08975613e+00
  6.74837410e-01 -8.51848781e-01 -8.50385148e-03 -2.92631090e-01
 -8.90984535e-01  2.79130042e-01  7.33362496e-01 -2.50035077e-01
  4.23965812e-01  4.33227062e-01 -3.26750219e-01 -4.49382335e-01
  2.12669894e-01 -9.23774540e-02  8.03800896e-02 -9.57483351e-01
 -4.16921943e-01 -4.06423301e-01 -2.35502779e-01  1.82990298e-01
 -4.38782088e-02 -3.14177006e-01 -1.64121151e+00 -1.08092874e-01
  1.26185387e-01 -2.39004508e-01  2.31035843e-01 -1.37319088e+00
 -1.09651752e-01  8.69975746e-01  5.29625118e-01 -3.94928485e-01
 -4.19801116e-01  3.17379445e-01  1.01159728e+00 -3.20728064e-01
  7.06700504e-01 -2.87798136e-01  1.24787891e+00  1.35662451e-01
 -3.59829456e-01 -5.47928989e-01 -4.67555672e-01  2.81867355e-01
  4.62634295e-01 -6.17993474e-02 -9.53641593e-01  2.37935498e-01
 -2.29460478e-01 -3.87110800e-01  2.52903730e-01  3.61001283e-01
  1.38696238e-01  4.70266014e-01  4.66160059e-01  3.28223497e-01
  5.93105070e-02 -1.63352168e+00 -2.77716607e-01  3.24460685e-01
  4.59154934e-01  6.27264023e-01  7.00711906e-01  8.31256434e-02
 -3.90917547e-02  6.63699627e-01  6.51222885e-01 -1.23447984e-01
  1.16297640e-01 -3.19162756e-01 -3.68013196e-02 -2.44184196e-01
  6.71637595e-01  5.24988472e-01 -5.65380275e-01  4.64954853e-01
 -2.36027882e-01  1.26898900e-01 -8.10473979e-01 -4.33059335e-01
 -7.57938921e-01  8.53266776e-01 -3.98881525e-01  5.07005751e-01
 -1.62705883e-01 -1.30534649e-01  3.67368251e-01 -9.70499516e-01
  3.40842813e-01  4.97943670e-01  1.58791557e-01 -2.94252366e-01
 -2.42184117e-01 -3.72528404e-01 -1.02916189e-01 -9.32460651e-02
  5.89991271e-01  1.16003148e-01  2.60323495e-01  4.31694746e-01
 -5.11277854e-01 -6.45893812e-01  1.37274072e-01  1.14651644e+00
 -4.86506432e-01 -3.28468084e-01  3.27599883e-01  4.68083739e-01
 -2.47447770e-02 -1.60796180e-01 -1.17120497e-01 -9.79837701e-02
  1.10103719e-01  5.45698404e-01  5.11912763e-01 -6.92725420e-01
  9.79623273e-02  4.42452788e-01 -4.89460021e-01  2.34949112e-01
 -3.07361424e-01  6.56947553e-01  7.93626368e-01 -2.94100523e-01
 -2.89061934e-01 -1.43957710e+00  3.79291177e-01  8.70321751e-01
 -4.80792820e-02 -1.06954193e+00 -1.58590689e-01 -9.69053134e-02
  9.12153244e-01 -1.23418200e+00  4.51984435e-01 -4.57109302e-01
 -2.01666737e+00  2.20076397e-01  5.54018080e-01  1.22555375e+00
  3.02874416e-01  7.03863084e-01  3.94382149e-01  9.47180092e-01
  2.24415809e-02 -5.42042613e-01  2.69550920e-01 -7.95507282e-02
 -1.07106514e-01  1.02086997e+00  1.16716832e-01  3.97928715e-01
 -3.21070999e-01 -6.07484616e-02 -3.35352570e-01 -4.89043564e-01
  7.83755600e-01  4.48905230e-01 -3.26831609e-01 -6.30239606e-01
 -3.69371861e-01  5.18288910e-01 -2.31943205e-01  7.51047909e-01
 -9.50811133e-02  6.59678653e-02 -4.41956043e-01 -7.28520334e-01
  5.47576368e-01  8.39055538e-01 -3.89602065e-01 -1.11769289e-01
 -1.33700597e+00 -1.93452254e-01  4.31115389e-01  5.68186581e-01
  1.99087486e-01 -5.89395702e-01 -2.32292101e-01 -2.24908280e+00
 -2.52226561e-01 -3.92306596e-01 -4.02772695e-01  3.22517037e-01
  1.56779751e-01  1.95240542e-01  5.58442533e-01 -6.56266630e-01
  1.04244091e-01  7.31817901e-01 -4.68050033e-01 -9.43407536e-01
  8.49514231e-02  3.44091624e-01  5.19126177e-01  1.76345766e-01
 -1.47558255e-02  5.23190610e-02  2.12843597e-01  1.14475131e-01
 -1.42233863e-01 -1.51223883e-01  1.82097778e-01 -4.30664033e-01
  5.23616254e-01  5.61065137e-01 -1.14937700e-01  3.88169616e-01
 -1.85353205e-01  2.58062929e-01 -9.29598629e-01 -6.23448908e-01
 -1.90620422e-01  7.05193460e-01 -3.31302911e-01 -8.48224342e-01
  7.35408545e-01  1.90986216e-01  1.18175375e+00 -1.31905630e-01
  6.12540364e-01 -2.27061346e-01  6.12019718e-01 -2.15495512e-01
  8.65324080e-01 -9.04373944e-01 -7.20959663e-01 -1.09317759e-02
  5.78229368e-01  3.06568354e-01  1.27713576e-01  5.53308189e-01
  3.06025833e-01  1.36258829e+00  1.49002159e+00 -5.77752411e-01
 -6.27591908e-01  5.52487254e-01 -1.07050858e-01 -7.02992618e-01
  6.61346614e-01  1.48597682e+00 -6.77083015e-01 -4.57088761e-02
  4.91177052e-01  4.69605803e-01  4.79630589e-01 -6.30940378e-01
  3.36747140e-01  3.69836330e-01  1.56286880e-01 -5.31476140e-02
 -2.05094088e-02 -8.72779712e-02  4.99324769e-01  4.19809937e-01
  2.90960163e-01 -2.08618924e-01 -9.31023598e-01  2.00260162e-01
 -1.67006791e-01  1.26128837e-01 -2.68612027e+00  6.09867126e-02
  3.88033271e-01 -1.61151960e-01  5.81366599e-01 -8.78995955e-01
  3.46108675e-01 -2.94365883e-01 -6.19270861e-01 -2.72756126e-02
  2.50237972e-01 -8.14377964e-01 -9.11832377e-02  8.10585976e-01
 -3.08608059e-02  3.30792278e-01  1.92960143e-01  5.86867146e-03
 -2.81593978e-01  1.07093573e-01  4.95705485e-01 -5.93880177e-01
 -9.31353495e-02  4.96446282e-01 -6.09604299e-01 -7.22874463e-01
 -3.91721100e-01  5.72671115e-01 -9.82809141e-02 -9.71323967e-01
  2.91557168e-03  6.50036037e-01 -3.80878687e-01  4.00921375e-01
 -5.13824701e-01  3.31710905e-01 -1.11063111e+00 -2.53363192e-01
  2.70247668e-01  1.53764293e-01  4.10947949e-01  7.36399472e-01
  2.56115347e-01  3.60514194e-01  1.65222809e-01 -5.48293293e-01
  3.92309994e-01  2.21360728e-01  1.85525894e-01 -1.58027694e-01
  7.06265450e-01 -3.98958713e-01  3.22360903e-01  6.30561411e-02
 -7.96444893e-01 -3.65186334e-01  1.66747883e-01 -3.76820415e-01
 -4.38464791e-01 -8.31743419e-01 -8.69124904e-02 -1.07136369e+00
  1.15936637e+00 -7.67137930e-02 -4.17888194e-01  4.69102740e-01
 -8.31389904e-01  6.38637900e-01  9.34859931e-01  4.25273150e-01
 -7.34036863e-01 -2.99270123e-01  6.30385339e-01 -4.01807316e-02
  6.99251175e-01  4.14618962e-02 -3.52737427e-01 -8.50805044e-02
 -2.99162388e-01 -2.26154685e+00  1.94242299e-01  2.13073134e+00
 -3.08788896e-01  7.76493490e-01 -3.85470778e-01  5.09764440e-02
 -3.32625985e-01 -2.70368278e-01 -2.73551255e-01  5.12204349e-01
 -1.57281738e-02 -2.03380153e-01 -9.83158126e-02 -1.08133204e-01
 -1.30143538e-01 -5.75545073e-01 -2.37112567e-01  7.66147450e-02
 -8.78035605e-01 -4.69297498e-01 -5.42065442e-01 -3.27693969e-01
  4.26546298e-02 -3.32958400e-02  2.29725197e-01 -6.76590025e-01
  5.90870023e-01  4.67663854e-01  4.05347981e-02 -1.01889901e-01
  2.98107713e-01  9.38350379e-01  4.14085627e-01  3.07833761e-01
  1.42964447e+00 -4.45302457e-01 -3.95337015e-01  2.59421319e-01
  4.70786721e-01 -3.28690141e-01 -3.26702684e-01  4.60925788e-01
  3.00194889e-01 -1.27617776e+00  1.93441764e-01  6.96230233e-02
  2.50156671e-01  6.29924238e-01  4.90749180e-02 -6.81155145e-01
  7.49544725e-02 -5.32942772e-01 -1.46872148e-01  1.52525142e-01
 -1.40324622e-01 -4.47373897e-01  5.56249619e-01  2.14512050e-01
 -1.18132508e+00 -5.53278513e-02 -1.21997505e-01  4.59377319e-01
 -7.73455799e-01  6.49327457e-01 -2.88689077e-01  2.49826238e-01
  1.49201185e-01  8.95310715e-02 -1.65562227e-01  3.12327474e-01
 -2.94915408e-01  6.04736328e-04  1.51518643e-01 -2.43606791e-01
 -3.77415299e-01 -7.48818874e-01  1.97308421e-01  1.54566392e-01
  2.31411859e-01 -1.31587371e-01 -9.31632221e-01  5.21845639e-01
 -1.77722231e-01 -3.30963254e-01  8.78183618e-02 -3.89436334e-01
  1.18288970e+00  4.61944133e-01 -3.60817403e-01  9.62439552e-02
  3.29588264e-01 -7.63412178e-01 -4.32654470e-02  3.71987849e-01
  1.30858049e-01  3.64951998e-01 -1.14586122e-01  9.33710337e-02
  8.79911661e-01  8.51520672e-02  5.08776367e-01  8.31995070e-01
 -3.25600989e-02 -6.76322937e-01  2.73325801e-01 -5.52082062e-01
  7.04786360e-01 -9.38153565e-02  3.16393822e-01  9.30703938e-01
  1.43995300e-01  1.28289476e-01  7.13749886e-01 -6.91197693e-01
 -4.63513970e-01 -5.43086469e-01  3.93340915e-01  6.55048430e-01
  2.37008616e-01  5.90158224e-01 -1.45171452e+00 -5.42990565e-01
  7.16494992e-02 -5.74965961e-02 -3.19812238e-01 -4.15111750e-01
 -1.15385628e+00  9.88350138e-02 -3.99327636e-01 -3.86230469e-01
 -9.66311634e-01  6.24254346e-01 -8.67876589e-01  7.91857168e-02
 -2.16634810e-01 -1.30776152e-01  5.42041898e-01 -1.27456293e-01
  2.19354436e-01  2.45431010e-02 -1.31416008e-01 -9.94022846e-01
  3.11670303e-01  2.79867351e-01  1.76257046e-03  2.36769140e-01
  2.31806144e-01 -2.09278569e-01 -3.29869896e-01  5.31312287e-01
 -1.50237354e-02 -1.96521565e-01 -4.44440484e-01 -1.03522427e-01
  1.57737657e-01 -3.42689991e-01  6.51860237e-01  5.95698655e-01
  1.57644466e-01 -7.42945671e-01 -8.25147808e-01  8.19953501e-01
 -3.89360994e-01 -4.63993937e-01  4.91448015e-01  1.03501648e-01
 -1.43068001e-01  6.59973502e-01 -6.28924549e-01  8.06039393e-01
 -6.85656607e-01 -7.82158554e-01  9.65895131e-02 -4.44408804e-02
  5.49144030e-01 -3.65767390e-01  9.12624821e-02 -1.57133341e-01
  6.35210574e-01  1.13972329e-01 -1.81234777e-01  8.61219227e-01
  1.31244048e-01  5.98486185e-01  1.65067330e-01  5.73872924e-01
  4.87469345e-01 -1.68602914e-04 -1.20138995e-01  3.91712785e-01
  5.30987561e-01  2.69023091e-01  1.52858093e-01 -7.95847654e-01
 -9.93978977e-01  4.33744788e-01  1.67980358e-01 -1.70952514e-01
  3.58180523e-01  1.74466264e+00 -5.23976624e-01  4.59476858e-01
 -3.23338002e-01 -3.03671360e-01 -5.17562926e-02 -9.27555263e-01
  1.22588545e-01  9.21691000e-01 -7.77567685e-01  7.57553339e-01
  5.98537266e-01  1.51887700e-01 -5.41039586e-01 -6.00209981e-02
 -1.40656960e+00 -2.00708449e-01 -5.64499199e-01 -7.12801754e-01
 -6.20633423e-01  2.33130932e-01 -9.46428776e-01 -3.88114452e-01
 -3.07884246e-01 -1.85048208e-01 -5.36420830e-02  1.98027864e-01
  6.83651924e-01  2.92166889e-01  1.00554061e+00  5.15276730e-01
  9.16525349e-02  4.16358560e-01  1.63049176e-01  6.65169239e-01
  4.27927561e-02  2.41373792e-01 -3.95990938e-01 -2.23522838e-02
 -1.48183376e-01 -7.48705864e-01 -9.84093964e-01 -2.63506919e-01
 -7.75057673e-02  2.21898973e-01  3.77231210e-01 -2.79826611e-01
  4.35912460e-01  1.72021911e-01 -2.74398804e-01 -5.74143082e-02
  3.34030300e-01  3.96052748e-01 -8.62337589e-01 -3.87751102e-01
 -2.32265308e-01 -2.47504607e-01 -1.66177660e-01 -2.38482710e-02
  4.86696005e-01  2.90136546e-01  7.03744352e-01  2.41497960e-02
  7.77043641e-01  6.32856488e-02  5.27289271e-01 -3.04111600e-01
  1.47445953e+00 -3.12046498e-01 -9.46989298e-01  6.20721638e-01
 -2.51838416e-01 -4.54647511e-01  2.69545764e-01  4.68926907e-01
  3.01602572e-01  4.27661866e-01  1.22736610e-01 -4.31586295e-01
 -3.55660915e-01  2.95261294e-02 -8.30298737e-02 -1.24135458e+00
 -9.31493819e-01  1.69711602e+00  4.80521649e-01 -3.84667546e-01
  4.66282278e-01  2.56258368e-01 -2.17681658e-02 -1.55928707e+00
  5.75663626e-01 -1.57456219e-01  9.48550522e-01  1.66335061e-01
 -5.28567255e-01 -5.70252359e-01 -1.46020964e-01  4.87905383e-01
  4.78017591e-02 -5.77540576e-01  1.92407763e-03  6.13499105e-01
 -1.05793841e-01 -1.09683506e-01 -2.96360105e-02  4.35960323e-01
 -4.02660578e-01  3.87758702e-01  1.12706983e+00  9.24294665e-02
 -5.83021998e-01 -3.87693644e-01  2.39835214e-03 -5.64945757e-01
  1.48378536e-01 -2.77033478e-01 -2.34442890e-01  1.73353069e-02
  3.67671520e-01 -7.33666599e-01 -7.92824030e-01  6.30360723e-01
  3.33380669e-01  4.56192940e-01 -7.72260949e-02  1.27429411e-01
 -1.78493440e-01  1.97268978e-01  1.57322931e+00  1.07754266e+00
 -1.59494922e-01 -1.17894687e-01 -1.59461528e-01 -6.25817001e-01
  2.81546354e-01  2.70361573e-01 -4.11008507e-01  2.61917561e-01
  1.35742575e-01  2.32770786e-01 -1.96227446e-01  1.48295149e-01
  6.96589231e-01 -4.05377030e-01 -5.51130362e-02  6.23578541e-02
  6.14083529e-01 -2.98538417e-01 -8.09021533e-01 -2.79868711e-02
 -9.66248572e-01 -8.61432016e-01  2.46819437e-01 -3.50682139e-01
 -1.29827034e+00 -2.78865784e-01 -3.06518644e-01  6.44666135e-01]
Code
embedding.shape
(768,)
Code
%%time
# Use our data

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') #for symmetric queries
# model = SentenceTransformer('msmarco-distilroberta-base-v2') #for asymmetric queries

#Our sentences we like to encode
sentences = list(corpus.text)

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)
CPU times: user 13.4 s, sys: 5.68 s, total: 19.1 s
Wall time: 19.9 s
Code
# At this point, the variable embeddings contains all our embeddings, one row for each document
# So we expect there to be 100 rows, and as many columns as the model we chose vectorizes text
# into.

embeddings.shape
(100, 384)

21.2.3 Cosine similarity between sentences

We can compute the cosine similarity between documents, and that gives us a measure of how similar sentences or documents are.

The below code uses brute force, and finds the most similar sentences. Very compute intensive, will not run if number of sentences is very large.

Code
from sentence_transformers import util
distances = util.cos_sim(embeddings, embeddings)
distances.shape
torch.Size([100, 100])
Code
df_dist = pd.DataFrame(distances, columns = corpus.index, index = corpus.index)
df_dist
9920 3172 7798 7335 31 4229 2514 5830 8326 2855 ... 2747 8549 8341 293 9501 5493 5258 1937 6687 6335
9920 1.000000 0.135734 0.224861 0.298378 0.150421 0.079714 0.159787 0.327536 0.321123 0.375711 ... 0.095475 0.389827 0.469558 0.429317 0.272564 0.128299 0.036300 0.251787 0.106914 0.340685
3172 0.135734 1.000000 0.094552 0.162400 0.151268 0.148340 0.123678 0.103126 0.187191 0.216326 ... 0.253598 0.136567 0.200792 0.082866 0.131629 0.076341 0.149354 0.169703 0.043003 0.274530
7798 0.224861 0.094552 1.000000 0.261107 0.184356 0.133621 0.131916 0.271464 0.386157 0.267628 ... 0.224716 0.266434 0.210274 0.274348 0.217656 0.287046 0.164005 0.341288 0.310254 0.258850
7335 0.298378 0.162400 0.261107 1.000000 0.238793 0.051851 0.088298 0.303140 0.381001 0.247187 ... 0.178367 0.321581 0.230840 0.161895 0.245882 0.115410 0.020552 0.346345 0.078837 0.328493
31 0.150421 0.151268 0.184356 0.238793 1.000000 0.046454 0.114700 0.199142 0.276150 0.218646 ... 0.168328 0.218759 0.226452 0.076760 0.178390 0.075329 0.159295 0.202835 0.129672 0.151179
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5493 0.128299 0.076341 0.287046 0.115410 0.075329 0.125446 0.084448 0.131338 0.237292 0.313341 ... 0.201184 0.045626 0.191044 0.164987 0.196647 1.000000 0.133942 0.178779 0.022278 0.112217
5258 0.036300 0.149354 0.164005 0.020552 0.159295 0.528685 0.129603 0.210878 0.080393 0.145991 ... 0.416838 0.112070 0.152700 0.085756 0.178735 0.133942 1.000000 0.095838 0.126565 0.206801
1937 0.251787 0.169703 0.341288 0.346345 0.202835 0.109112 0.197919 0.304981 0.306736 0.316769 ... 0.131771 0.246408 0.194218 0.181308 0.208915 0.178779 0.095838 1.000000 0.074157 0.300806
6687 0.106914 0.043003 0.310254 0.078837 0.129672 0.019954 0.174099 0.158508 0.204196 0.082941 ... 0.189893 0.119275 0.162651 0.106629 0.052256 0.022278 0.126565 0.074157 1.000000 0.176698
6335 0.340685 0.274530 0.258850 0.328493 0.151179 0.134998 0.114199 0.237868 0.213965 0.325430 ... 0.201053 0.322407 0.410377 0.326511 0.205842 0.112217 0.206801 0.300806 0.176698 1.000000

100 rows × 100 columns

At this point, we can use stack to rearrange the data to identify similar articles, but stack fails if you have a lot of documents. Let us see how stack does the job.

Code
# Using stack
df_dist = df_dist.stack().reset_index()
df_dist.columns = ['article', 'similar_article', 'similarity']
df_dist = df_dist.sort_values(by = ['article', 'similarity'], ascending = [True, False])
df_dist
article similar_article similarity
404 31 31 1.000000
462 31 1631 0.422828
414 31 9949 0.395747
411 31 6627 0.390846
483 31 2289 0.346313
... ... ... ...
3327 9970 5720 0.009801
3367 9970 7427 0.006469
3331 9970 9176 -0.004562
3399 9970 6335 -0.010530
3310 9970 1157 -0.023462

10000 rows × 3 columns

Code
# Let us reset our df_dist dataframe
df_dist = pd.DataFrame(distances, columns = corpus.index, index = corpus.index)
Code
from tqdm import tqdm
# Using a loop
top_n = 21
temp = []
for col in tqdm(range(len(df_dist))):
    t = pd.DataFrame(df_dist.iloc[:, col].sort_values(ascending = False)[:top_n]).stack().reset_index()
    t.columns = ['similar_article', 'article', 'similarity']
    t = t[['article', 'similar_article', 'similarity']]
    temp.append(t)

pd.concat(temp)
100%|██████████| 100/100 [00:00<00:00, 446.37it/s]
article similar_article similarity
0 9920 9920 1.000000
1 9920 6147 0.513100
2 9920 470 0.498859
3 9920 466 0.492767
4 9920 7427 0.479613
... ... ... ...
16 6335 7485 0.343014
17 6335 9920 0.340685
18 6335 6233 0.337505
19 6335 1559 0.336447
20 6335 6627 0.334233

2100 rows × 3 columns

21.2.4 Semantic paraphrasing

Finds similar articles, except more efficient than the prior method.

Code
%%time
from sentence_transformers import SentenceTransformer, util

# model = SentenceTransformer('all-MiniLM-L6-v2')

# Single list of sentences - Possible tens of thousands of sentences
sentences = list(corpus.text)

paraphrases = util.paraphrase_mining(model, sentences)
    
paraphrases[:10]
CPU times: user 13.3 s, sys: 5.34 s, total: 18.6 s
Wall time: 18.5 s
[[0.997217059135437, 25, 95],
 [0.8632327318191528, 9, 22],
 [0.8306697607040405, 13, 88],
 [0.7193999290466309, 7, 39],
 [0.7128127217292786, 41, 49],
 [0.6752821803092957, 8, 34],
 [0.6559962630271912, 29, 92],
 [0.6505193114280701, 22, 74],
 [0.6476482152938843, 11, 39],
 [0.6302182674407959, 12, 91]]
Code
print(sentences[13])
A cyber attack in Iran left petrol stations across the country crippled, disrupting fuel sales and defacing electronic billboards to display messages challenging the regime's ability to distribute gasoline.

Posts and videos circulated on social media showed messages that said, "Khamenei! Where is our gas?" — a reference to the country's supreme leader Ayatollah Ali Khamenei. Other signs read, "Free gas in Jamaran gas station," with gas pumps showing the words "cyberattack 64411" when attempting to purchase fuel, semi-official Iranian Students' News Agency (ISNA) news agency reported.

Abolhassan Firouzabadi, the head of Iran's Supreme Cyberspace Council, said the attacks were "probably" state-sponsored but added it was too early to determine which country carried out the intrusions.

Although no country or group has so far claimed responsibility for the incident, the attacks mark the second time digital billboards have been altered to display similar messaging.

In July 2021, Iranian Railways and the Ministry of Roads and Urban Development systems became the subject of targeted cyber attacks, displaying alerts about train delays and cancellations and urging passengers to call the phone number 64411 for further information. It's worth noting that the phone number belongs to the office of Ali Khamenei that supposedly handles questions about Islamic law.

The attacks involved the use of a never-before-seen reusable data-wiping malware called "Meteor."

Cybersecurity firm Check Point later attributed the train attack to a "regime opposition" threat actor that self-identifies as "Indra" — referring to the Hindu god of lightning, thunder, and war — and is believed to have ties to hacktivist and other cybercriminal groups, in addition to linking the malware to prior attacks targeting Syrian petroleum companies in early 2020.

"Aiming to bring a stop to the horrors of [Quds Force] and its murderous proxies in the region," the group's official Twitter account bio reads.

"While most attacks against a nation's sensitive networks are indeed the work of other governments, the truth is that there is no magic shield that prevents a non-state sponsored entity from creating the same kind of havoc, and harming critical infrastructure in order to make a statement," Check Point noted in July.
Code
print(sentences[19])
A hacker known only as “Mr. A” was picked up by authorities at a South Korean airport after getting stuck in the country due to COVID-19 travel restrictions.

Another alleged member of the TrickBot gang has been apprehended, this time when trying to leave South Korea, according to published reports.

The Russian national, who is an alleged developer of the notorious crimeware, reportedly had been trapped in South Korea since February 2020 due to COVID-19 travel restrictions. Seoul-based news outlet KBS News reported that the individual, identified only as “Mr A”, was arrested at a South Korea airport last week. Mr. A is believed to have worked as a web browser developer for the TrickBot crime syndicate while he lived in Russia in 2016.

Recorded Future’s The Record, who reported on the incident, cited the KBS report and said the accused criminal hacker was forced to spend more than a year in South Korea in order to renew his passport delaying his departure.

His arrest was the result of an investigation U.S. authorities began into TrickBot during his time in South Korea after the botnet was used “to facilitate ransomware attacks across the US throughout 2020,” according to the report.

Ever-Evolving Threat

TrickBot is a sophisticated malware first developed in 2016 to steal online banking credentials. Since then, it has evolved as operators have added new features.

The malware, once a simple banking trojan, is now a module-based crimeware platform leased as a malware-as-a-service solution to cybercriminals. TrickBot is typically leveraged against corporations and public infrastructure. The evolution and success of the TrickBot platform has pushed authorities to crack down on the criminals behind TrickBot beginning last year.

In February, authorities took alleged TrickBot developer Alla Witte into custody in Miami. Witte is known in cybercrime circles as “Max” and a main TrickBot coder, according to the Department of Justice (DoJ). Witte is believed responsible for developing TrickBot’s ransomware-related functionality, including control, deployment and payments, authorities said at the time of her arrest.

Her colleague, Mr. A, was arraigned in a Seoul court last Wednesday on an international arrest warrant and extradition request to the United States, according to The Record, citing the KBS news report. However, the suspect is fighting the extradition, with his lawyer claiming that if it happens, Mr. A “will be subjected to excessive punishment,” according to the report.

Industry-Driven Disruption

Prior to the official investigation and crackdown by the DoJ and related arrests, an earlier attempt to foil TrickBot’s operations came from Microsoft and some technology partners.

Last October, the tech giant and others used a court order they’d obtained to cut off key infrastructure to TrickBot operations so its operators no longer could initiate new infections or activate ransomware already dropped into computer systems.

Microsoft, ESET, Lumen’s Black Lotus Labs, NTT Ltd., Symantec and others were responsible for the coordinated legal and technical action to disrupt the group’s activity–which turned out to be a temporary scenario as TrickBot’s cybercriminals soon regrouped and resumed operations.

It’s time to evolve threat hunting into a pursuit of adversaries. JOIN Threatpost and Cybersixgill for Threat Hunting to Catch Adversaries, Not Just Stop Attacks and get a guided tour of the dark web and learn how to track threat actors before their next attack. REGISTER NOW for the LIVE discussion on Sept. 22 at 2 p.m. EST with Cybersixgill’s Sumukh Tendulkar and Edan Cohen, along with independent researcher and vCISO Chris Roberts and Threatpost host Becky Bracken.
Code
# Free up memory
del paraphrases
gc.collect()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 2
      1 # Free up memory
----> 2 del paraphrases
      3 gc.collect()

NameError: name 'paraphrases' is not defined

21.2.6 Clustering

If we know the embeddings, we can do clustering just like we can for regular tabular data.

21.2.6.1 KMeans

Code
from sklearn.cluster import KMeans

# Perform kmean clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters, n_init='auto')
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(list(corpus.text)[sentence_id])
Code
clustering_model.labels_.shape
(100,)
Code
cluster_assignment
array([1, 0, 3, 0, 0, 3, 3, 3, 0, 1, 1, 0, 2, 0, 0, 4, 3, 4, 1, 0, 1, 0,
       1, 3, 0, 4, 2, 2, 2, 1, 0, 0, 3, 3, 0, 2, 3, 2, 3, 0, 4, 3, 1, 0,
       3, 0, 4, 0, 2, 3, 0, 4, 4, 3, 2, 3, 3, 3, 3, 2, 4, 2, 0, 3, 3, 2,
       3, 2, 2, 0, 4, 0, 4, 2, 1, 1, 0, 0, 3, 3, 4, 2, 1, 0, 1, 4, 4, 3,
       0, 1, 4, 2, 1, 2, 3, 4, 4, 1, 4, 2], dtype=int32)
Code
pd.Series(cluster_assignment).value_counts()
0    25
3    25
2    18
4    17
1    15
Name: count, dtype: int64

21.3 Huggingface Pipeline function

The Huggingface Pipeline function wraps everything together for a number of common NLP tasks.

The format for the commands is as below:

from transformers import pipeline

# Using default model and tokenizer for the task
pipeline("<task-name>")

# Using a user-specified model
pipeline("<task-name>", model="<model_name>")

# Using custom model/tokenizer as str
pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')

By default, pipeline selects a particular pretrained model that has been fine-tuned for the specified task. The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.

Pipelines are made of:

  • A tokenizer in charge of mapping raw textual input to token.
  • A model to make predictions from the inputs.
  • Some (optional) post processing for enhancing model’s output.

Some of the currently available pipelines are:

  • feature-extraction (get the vector representation of a text)
  • fill-mask
  • ner (named entity recognition)
  • question-answering
  • sentiment-analysis
  • summarization
  • text-generation
  • translation
  • zero-shot-classification

Each pipeline has a default model, which can be obtained from https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/init.py

Pipeline Default Model
“feature-extraction” “distilbert-base-cased”
“fill-mask” “distilroberta-base”
“ner” “t5-base”
“question-answering” “distilbert-base-cased-distilled-squad”
“summarization” “sshleifer/distilbart-cnn-12-6”
“translation” “t5-base”
“text-generation” “gpt2”
“text2text-generation” “t5-base”
“zero-shot-classification” “facebook/bart-large-mnli”
“conversational” “microsoft/DialoGPT-medium”

First, some library imports

Code
# First, some library imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tqdm
import torch

from transformers import AutoTokenizer, AutoModel, pipeline
Code
from platform import python_version

print(python_version())
3.12.4
Code
mytext = """
Panther Labs, an early stage startup that specializes in detection and response analytics, has raised a whopping $120 million in a new round of funding led by hedge fund Coatue Management.

Panther Labs said the Series B investment was raised at a $1.4 billion valuation, putting the company among a growing list of ‘unicorn’ cybersecurity startups.

In addition to Coatue Management, Panther Labs scored investments from ICONIQ Growth and Snowflake Ventures along with money from existing investors Lightspeed Venture Partners, S28 Capital, and Innovation Endeavors.

The company previously raised $15 million in a September 2020 Series A round.

The San Francisco firm, which was founded by Airbnb and AWS alumni, styles itself as a “cloud-scale security analytics platform” that helps organizations prevent breaches by providing actionable insights from large volumes of data.

The Panther product can be used by security teams to perform continuous security monitoring, gain security visibility across cloud and on-premise infrastructure, and build data lakes for incident response investigations.

In the last year, Panther claims its customer roster grew by 300 percent, including deals with big companies like Dropbox, Zapier and Snyk.

Panther Labs said the new funding will be used to speed up product development, expand go-to-marketing initiatives and scale support for its customers.

Related: Panther Labs Launches Open-Source Cloud-Native SIEM

Related: CyCognito Snags $100 Million for Attack Surface Management
"""
Code
print(len(mytext.split()))
print(mytext)
224

Panther Labs, an early stage startup that specializes in detection and response analytics, has raised a whopping $120 million in a new round of funding led by hedge fund Coatue Management.

Panther Labs said the Series B investment was raised at a $1.4 billion valuation, putting the company among a growing list of ‘unicorn’ cybersecurity startups.

In addition to Coatue Management, Panther Labs scored investments from ICONIQ Growth and Snowflake Ventures along with money from existing investors Lightspeed Venture Partners, S28 Capital, and Innovation Endeavors.

The company previously raised $15 million in a September 2020 Series A round.

The San Francisco firm, which was founded by Airbnb and AWS alumni, styles itself as a “cloud-scale security analytics platform” that helps organizations prevent breaches by providing actionable insights from large volumes of data.

The Panther product can be used by security teams to perform continuous security monitoring, gain security visibility across cloud and on-premise infrastructure, and build data lakes for incident response investigations.

In the last year, Panther claims its customer roster grew by 300 percent, including deals with big companies like Dropbox, Zapier and Snyk.

Panther Labs said the new funding will be used to speed up product development, expand go-to-marketing initiatives and scale support for its customers.

Related: Panther Labs Launches Open-Source Cloud-Native SIEM

Related: CyCognito Snags $100 Million for Attack Surface Management
Code
# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'
        

21.3.1 Embeddings/Feature Extraction

Feature extraction allows us to obtain embeddings for a sentence. This is similar (in fact identical) to embeddings obtained from sentence-transformers.

Code
pwd
'C:\\Users\\user\\Google Drive\\jupyter'
Code
feature_extraction = pipeline('feature-extraction')
features = feature_extraction("i am awesome")
features = np.squeeze(features)
print(features.shape)
No model was supplied, defaulted to distilbert/distilbert-base-cased and revision 935ac13 (https://huggingface.co/distilbert/distilbert-base-cased).
Using a pipeline without specifying a model name and revision in production is not recommended.
(5, 768)
Code
# If you summarize by column, you get the same results as `model.encode` with sentence-bert
features = np.mean(features, axis=0)
Code
features.shape
(768,)
Code
# Let us try feature extraction on mytext
features = feature_extraction(mytext)
features = np.squeeze(features)
print(features.shape)
(322, 768)
Code
# Free up memory
del feature_extraction
gc.collect()
93

21.3.2 Fill Mask

Code
fill_mask = pipeline('fill-mask') 
fill_mask('New York is a <mask>')
No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[{'score': 0.1009121835231781,
  'token': 8018,
  'token_str': ' joke',
  'sequence': 'New York is a joke'},
 {'score': 0.04816760495305061,
  'token': 4593,
  'token_str': ' democracy',
  'sequence': 'New York is a democracy'},
 {'score': 0.04618655890226364,
  'token': 7319,
  'token_str': ' mess',
  'sequence': 'New York is a mess'},
 {'score': 0.04198974370956421,
  'token': 20812,
  'token_str': ' circus',
  'sequence': 'New York is a circus'},
 {'score': 0.024249661713838577,
  'token': 43689,
  'token_str': ' wasteland',
  'sequence': 'New York is a wasteland'}]
Code
fill_mask = pipeline('fill-mask', model = 'distilroberta-base')
fill_mask('New <mask> is a great city')
Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[{'score': 0.4224233627319336,
  'token': 469,
  'token_str': ' York',
  'sequence': 'New York is a great city'},
 {'score': 0.23672223091125488,
  'token': 4942,
  'token_str': ' Orleans',
  'sequence': 'New Orleans is a great city'},
 {'score': 0.08853647857904434,
  'token': 3123,
  'token_str': ' Jersey',
  'sequence': 'New Jersey is a great city'},
 {'score': 0.06783472746610641,
  'token': 3534,
  'token_str': ' Delhi',
  'sequence': 'New Delhi is a great city'},
 {'score': 0.03218536078929901,
  'token': 12050,
  'token_str': ' Haven',
  'sequence': 'New Haven is a great city'}]
Code
fill_mask('Joe Biden is a good <mask>')
[{'score': 0.09071354568004608,
  'token': 2173,
  'token_str': ' guy',
  'sequence': 'Joe Biden is a good guy'},
 {'score': 0.07118388265371323,
  'token': 1441,
  'token_str': ' friend',
  'sequence': 'Joe Biden is a good friend'},
 {'score': 0.03984031453728676,
  'token': 30443,
  'token_str': ' listener',
  'sequence': 'Joe Biden is a good listener'},
 {'score': 0.03301309794187546,
  'token': 28587,
  'token_str': ' liar',
  'sequence': 'Joe Biden is a good liar'},
 {'score': 0.030751319602131844,
  'token': 313,
  'token_str': ' man',
  'sequence': 'Joe Biden is a good man'}]
Code
fill_mask('Joe Biden is in a good <mask>')
[{'score': 0.8292393088340759,
  'token': 6711,
  'token_str': ' mood',
  'sequence': 'Joe Biden is in a good mood'},
 {'score': 0.040497832000255585,
  'token': 3989,
  'token_str': ' shape',
  'sequence': 'Joe Biden is in a good shape'},
 {'score': 0.02688208967447281,
  'token': 317,
  'token_str': ' place',
  'sequence': 'Joe Biden is in a good place'},
 {'score': 0.024331938475370407,
  'token': 1514,
  'token_str': ' spot',
  'sequence': 'Joe Biden is in a good spot'},
 {'score': 0.013950899243354797,
  'token': 737,
  'token_str': ' position',
  'sequence': 'Joe Biden is in a good position'}]

21.3.3 Sentiment Analysis (+ve/-ve)

Code
# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# Ignore if working on your own hardware as the default
# locations will work.

# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'
        
Code
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

classifier("It was sort of ok")
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
[{'label': 'POSITIVE', 'score': 0.9996662139892578}]
Code
classifier(mytext)
[{'label': 'POSITIVE', 'score': 0.8596639633178711}]
Code
# Free up memory
del classifier
gc.collect()
33

21.3.4 Named Entity Recognition

Identify tokens as belonging to one of 9 classes:

O, Outside of a named entity
B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
I-MIS, Miscellaneous entity
B-PER, Beginning of a person’s name right after another person’s name
I-PER, Person’s name
B-ORG, Beginning of an organisation right after another organisation
I-ORG, Organisation
B-LOC, Beginning of a location right after another location
I-LOC, Location
Code
# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'
        
Code
ner = pipeline("ner") 

ner("Seattle is a city in Washington where Microsoft is headquartered")
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[{'entity': 'I-LOC',
  'score': 0.99756324,
  'index': 1,
  'word': 'Seattle',
  'start': 0,
  'end': 7},
 {'entity': 'I-LOC',
  'score': 0.9981115,
  'index': 6,
  'word': 'Washington',
  'start': 21,
  'end': 31},
 {'entity': 'I-ORG',
  'score': 0.999338,
  'index': 8,
  'word': 'Microsoft',
  'start': 38,
  'end': 47}]
Code
ner(mytext)
[{'entity': 'I-ORG',
  'score': 0.99932563,
  'index': 1,
  'word': 'Panther',
  'start': 1,
  'end': 8},
 {'entity': 'I-ORG',
  'score': 0.9993229,
  'index': 2,
  'word': 'Labs',
  'start': 9,
  'end': 13},
 {'entity': 'I-ORG',
  'score': 0.9992663,
  'index': 37,
  'word': 'Co',
  'start': 171,
  'end': 173},
 {'entity': 'I-ORG',
  'score': 0.9986853,
  'index': 38,
  'word': '##at',
  'start': 173,
  'end': 175},
 {'entity': 'I-ORG',
  'score': 0.999196,
  'index': 39,
  'word': '##ue',
  'start': 175,
  'end': 177},
 {'entity': 'I-ORG',
  'score': 0.99944323,
  'index': 40,
  'word': 'Management',
  'start': 178,
  'end': 188},
 {'entity': 'I-ORG',
  'score': 0.9994549,
  'index': 42,
  'word': 'Panther',
  'start': 191,
  'end': 198},
 {'entity': 'I-ORG',
  'score': 0.9986261,
  'index': 43,
  'word': 'Labs',
  'start': 199,
  'end': 203},
 {'entity': 'I-ORG',
  'score': 0.99832755,
  'index': 85,
  'word': 'Co',
  'start': 367,
  'end': 369},
 {'entity': 'I-ORG',
  'score': 0.9989543,
  'index': 86,
  'word': '##at',
  'start': 369,
  'end': 371},
 {'entity': 'I-ORG',
  'score': 0.99904543,
  'index': 87,
  'word': '##ue',
  'start': 371,
  'end': 373},
 {'entity': 'I-ORG',
  'score': 0.99918145,
  'index': 88,
  'word': 'Management',
  'start': 374,
  'end': 384},
 {'entity': 'I-ORG',
  'score': 0.99947304,
  'index': 90,
  'word': 'Panther',
  'start': 386,
  'end': 393},
 {'entity': 'I-ORG',
  'score': 0.9986386,
  'index': 91,
  'word': 'Labs',
  'start': 394,
  'end': 398},
 {'entity': 'I-ORG',
  'score': 0.9969086,
  'index': 95,
  'word': 'I',
  'start': 423,
  'end': 424},
 {'entity': 'I-ORG',
  'score': 0.98679113,
  'index': 96,
  'word': '##CO',
  'start': 424,
  'end': 426},
 {'entity': 'I-ORG',
  'score': 0.9962644,
  'index': 97,
  'word': '##NI',
  'start': 426,
  'end': 428},
 {'entity': 'I-ORG',
  'score': 0.9870978,
  'index': 98,
  'word': '##Q',
  'start': 428,
  'end': 429},
 {'entity': 'I-ORG',
  'score': 0.995076,
  'index': 99,
  'word': 'Growth',
  'start': 430,
  'end': 436},
 {'entity': 'I-ORG',
  'score': 0.997384,
  'index': 101,
  'word': 'Snow',
  'start': 441,
  'end': 445},
 {'entity': 'I-ORG',
  'score': 0.99732804,
  'index': 102,
  'word': '##f',
  'start': 445,
  'end': 446},
 {'entity': 'I-ORG',
  'score': 0.9969291,
  'index': 103,
  'word': '##lake',
  'start': 446,
  'end': 450},
 {'entity': 'I-ORG',
  'score': 0.99730384,
  'index': 104,
  'word': 'Ventures',
  'start': 451,
  'end': 459},
 {'entity': 'I-ORG',
  'score': 0.99798065,
  'index': 111,
  'word': 'Lights',
  'start': 501,
  'end': 507},
 {'entity': 'I-ORG',
  'score': 0.9802942,
  'index': 112,
  'word': '##pe',
  'start': 507,
  'end': 509},
 {'entity': 'I-ORG',
  'score': 0.99478084,
  'index': 113,
  'word': '##ed',
  'start': 509,
  'end': 511},
 {'entity': 'I-ORG',
  'score': 0.99712026,
  'index': 114,
  'word': 'Venture',
  'start': 512,
  'end': 519},
 {'entity': 'I-ORG',
  'score': 0.99780315,
  'index': 115,
  'word': 'Partners',
  'start': 520,
  'end': 528},
 {'entity': 'I-ORG',
  'score': 0.9866433,
  'index': 117,
  'word': 'S',
  'start': 530,
  'end': 531},
 {'entity': 'I-ORG',
  'score': 0.97416526,
  'index': 118,
  'word': '##28',
  'start': 531,
  'end': 533},
 {'entity': 'I-ORG',
  'score': 0.9915843,
  'index': 119,
  'word': 'Capital',
  'start': 534,
  'end': 541},
 {'entity': 'I-ORG',
  'score': 0.9983632,
  'index': 122,
  'word': 'Innovation',
  'start': 547,
  'end': 557},
 {'entity': 'I-ORG',
  'score': 0.9993075,
  'index': 123,
  'word': 'End',
  'start': 558,
  'end': 561},
 {'entity': 'I-ORG',
  'score': 0.9934894,
  'index': 124,
  'word': '##eavor',
  'start': 561,
  'end': 566},
 {'entity': 'I-ORG',
  'score': 0.98961776,
  'index': 125,
  'word': '##s',
  'start': 566,
  'end': 567},
 {'entity': 'I-LOC',
  'score': 0.99653375,
  'index': 143,
  'word': 'San',
  'start': 653,
  'end': 656},
 {'entity': 'I-LOC',
  'score': 0.99250937,
  'index': 144,
  'word': 'Francisco',
  'start': 657,
  'end': 666},
 {'entity': 'I-ORG',
  'score': 0.9983175,
  'index': 151,
  'word': 'Air',
  'start': 694,
  'end': 697},
 {'entity': 'I-ORG',
  'score': 0.98135924,
  'index': 152,
  'word': '##b',
  'start': 697,
  'end': 698},
 {'entity': 'I-ORG',
  'score': 0.6833769,
  'index': 153,
  'word': '##n',
  'start': 698,
  'end': 699},
 {'entity': 'I-ORG',
  'score': 0.9928785,
  'index': 154,
  'word': '##b',
  'start': 699,
  'end': 700},
 {'entity': 'I-ORG',
  'score': 0.998475,
  'index': 156,
  'word': 'A',
  'start': 705,
  'end': 706},
 {'entity': 'I-ORG',
  'score': 0.99682593,
  'index': 157,
  'word': '##WS',
  'start': 706,
  'end': 708},
 {'entity': 'I-ORG',
  'score': 0.80408233,
  'index': 192,
  'word': 'Panther',
  'start': 886,
  'end': 893},
 {'entity': 'I-ORG',
  'score': 0.995609,
  'index': 231,
  'word': 'Panther',
  'start': 1122,
  'end': 1129},
 {'entity': 'I-ORG',
  'score': 0.9984397,
  'index': 247,
  'word': 'Drop',
  'start': 1218,
  'end': 1222},
 {'entity': 'I-ORG',
  'score': 0.9981306,
  'index': 248,
  'word': '##box',
  'start': 1222,
  'end': 1225},
 {'entity': 'I-ORG',
  'score': 0.99752074,
  'index': 250,
  'word': 'Z',
  'start': 1227,
  'end': 1228},
 {'entity': 'I-ORG',
  'score': 0.96972084,
  'index': 251,
  'word': '##ap',
  'start': 1228,
  'end': 1230},
 {'entity': 'I-ORG',
  'score': 0.99131715,
  'index': 252,
  'word': '##ier',
  'start': 1230,
  'end': 1233},
 {'entity': 'I-ORG',
  'score': 0.9980101,
  'index': 254,
  'word': 'S',
  'start': 1238,
  'end': 1239},
 {'entity': 'I-ORG',
  'score': 0.9695136,
  'index': 255,
  'word': '##ny',
  'start': 1239,
  'end': 1241},
 {'entity': 'I-ORG',
  'score': 0.99053967,
  'index': 256,
  'word': '##k',
  'start': 1241,
  'end': 1242},
 {'entity': 'I-ORG',
  'score': 0.9990858,
  'index': 258,
  'word': 'Panther',
  'start': 1245,
  'end': 1252},
 {'entity': 'I-ORG',
  'score': 0.99652547,
  'index': 259,
  'word': 'Labs',
  'start': 1253,
  'end': 1257},
 {'entity': 'I-ORG',
  'score': 0.99833304,
  'index': 290,
  'word': 'Panther',
  'start': 1407,
  'end': 1414},
 {'entity': 'I-ORG',
  'score': 0.9907589,
  'index': 291,
  'word': 'Labs',
  'start': 1415,
  'end': 1419},
 {'entity': 'I-ORG',
  'score': 0.98082525,
  'index': 306,
  'word': 'Cy',
  'start': 1469,
  'end': 1471},
 {'entity': 'I-ORG',
  'score': 0.9829427,
  'index': 307,
  'word': '##C',
  'start': 1471,
  'end': 1472},
 {'entity': 'I-ORG',
  'score': 0.9704884,
  'index': 308,
  'word': '##og',
  'start': 1472,
  'end': 1474},
 {'entity': 'I-ORG',
  'score': 0.87991095,
  'index': 309,
  'word': '##ni',
  'start': 1474,
  'end': 1476},
 {'entity': 'I-ORG',
  'score': 0.97091585,
  'index': 310,
  'word': '##to',
  'start': 1476,
  'end': 1478}]
Code
# Free up memory
del ner
gc.collect()
33

21.3.5 Question Answering

Code
# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'
        
Code
from transformers import pipeline

question_answerer = pipeline("question-answering") 

question_answerer(
    question="Where do I work?",
    context="My name is Mukul and I work at NYU Tandon in Brooklyn",
)
No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
{'score': 0.7861828804016113, 'start': 31, 'end': 41, 'answer': 'NYU Tandon'}
Code
print(mytext)

Panther Labs, an early stage startup that specializes in detection and response analytics, has raised a whopping $120 million in a new round of funding led by hedge fund Coatue Management.

Panther Labs said the Series B investment was raised at a $1.4 billion valuation, putting the company among a growing list of ‘unicorn’ cybersecurity startups.

In addition to Coatue Management, Panther Labs scored investments from ICONIQ Growth and Snowflake Ventures along with money from existing investors Lightspeed Venture Partners, S28 Capital, and Innovation Endeavors.

The company previously raised $15 million in a September 2020 Series A round.

The San Francisco firm, which was founded by Airbnb and AWS alumni, styles itself as a “cloud-scale security analytics platform” that helps organizations prevent breaches by providing actionable insights from large volumes of data.

The Panther product can be used by security teams to perform continuous security monitoring, gain security visibility across cloud and on-premise infrastructure, and build data lakes for incident response investigations.

In the last year, Panther claims its customer roster grew by 300 percent, including deals with big companies like Dropbox, Zapier and Snyk.

Panther Labs said the new funding will be used to speed up product development, expand go-to-marketing initiatives and scale support for its customers.

Related: Panther Labs Launches Open-Source Cloud-Native SIEM

Related: CyCognito Snags $100 Million for Attack Surface Management
Code
question_answerer(
    question = "How much did Panther Labs raise",
    context = mytext,
)
{'score': 0.02731592394411564,
 'start': 249,
 'end': 261,
 'answer': '$1.4 billion'}
Code
question_answerer(
    question = "How much did Panther Labs raise previously",
    context = mytext,
)
{'score': 0.6693971753120422,
 'start': 600,
 'end': 611,
 'answer': '$15 million'}
Code
question_answerer(
    question = "Who founded Panter Labs",
    context = mytext,
)
{'score': 2.9083132176310755e-05,
 'start': 694,
 'end': 715,
 'answer': 'Airbnb and AWS alumni'}
Code
# Free up memory
del question_answerer
gc.collect()
73

21.3.6 Summarization

Code
# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'
        
Code
from transformers import pipeline

summarizer = pipeline("summarization")

summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China anbd India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)
No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
tokenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 76.4kB/s]
vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 10.5MB/s]
merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 17.8MB/s]
[{'summary_text': ' America suffers an increasingly serious decline in the number of engineering graduates and a lack of well-educated engineers . Rapidly developing economies such as China anbd India, as well as other industrial countries in Europe and Asia, continue to encourage and advance the teaching of engineering . Both China and India graduate six and eight times as many traditional engineers as does the United States .'}]
Code
mytext = """
Panther Labs, an early stage startup that specializes in detection and response analytics, has raised a whopping $120 million in a new round of funding led by hedge fund Coatue Management.

Panther Labs said the Series B investment was raised at a $1.4 billion valuation, putting the company among a growing list of ‘unicorn’ cybersecurity startups.

In addition to Coatue Management, Panther Labs scored investments from ICONIQ Growth and Snowflake Ventures along with money from existing investors Lightspeed Venture Partners, S28 Capital, and Innovation Endeavors.

The company previously raised $15 million in a September 2020 Series A round.

The San Francisco firm, which was founded by Airbnb and AWS alumni, styles itself as a “cloud-scale security analytics platform” that helps organizations prevent breaches by providing actionable insights from large volumes of data.

The Panther product can be used by security teams to perform continuous security monitoring, gain security visibility across cloud and on-premise infrastructure, and build data lakes for incident response investigations.

In the last year, Panther claims its customer roster grew by 300 percent, including deals with big companies like Dropbox, Zapier and Snyk.

Panther Labs said the new funding will be used to speed up product development, expand go-to-marketing initiatives and scale support for its customers.

Related: Panther Labs Launches Open-Source Cloud-Native SIEM

Related: CyCognito Snags $100 Million for Attack Surface Management
"""
Code
summarizer(mytext)
[{'summary_text': ' Panther Labs is a ‘cloud-scale security analytics platform’ that helps organizations prevent breaches by providing actionable insights from large volumes of data . The San Francisco startup claims its customer roster grew by 300 percent in the last year, including deals with Dropbox, Zapier and Snyk . The new funding will be used to speed up product development and expand go-to-marketing initiatives .'}]
Code
# Free up memory
del summarizer
gc.collect()
0

21.3.6.1 Try a different model

Code
from transformers import pipeline
import torch
summarizer = pipeline(task="summarization",
                      model="facebook/bart-large-cnn",
                      torch_dtype=torch.bfloat16)

Model info: ‘bart-large-cnn’

Code
%%time
summarizer(mytext, min_length=10, max_length=100)
CPU times: user 2min 53s, sys: 852 ms, total: 2min 54s
Wall time: 2min 56s
[{'summary_text': 'Panther Labs, an early stage startup that specializes in detection and response analytics, has raised a whopping $120 million in a new round of funding. The Series B investment was raised at a $1.4 billion valuation, putting the company among a growing list of ‘unicorn’ cybersecurity startups.'}]
Code
# Free up memory
del summarizer
gc.collect()
0

21.3.7 Translation

Code
# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'
        
Code
translator = pipeline("translation_en_to_fr")
translator("I do not speak French")
No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 100%|██████████| 1.21k/1.21k [00:00<00:00, 3.74MB/s]
model.safetensors: 100%|██████████| 892M/892M [00:08<00:00, 105MB/s]  
generation_config.json: 100%|██████████| 147/147 [00:00<00:00, 523kB/s]
spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 23.1MB/s]
tokenizer.json: 100%|██████████| 1.39M/1.39M [00:00<00:00, 29.0MB/s]
/opt/conda/envs/mggy8413/lib/python3.11/site-packages/transformers/models/t5/tokenization_t5_fast.py:160: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
[{'translation_text': 'Je ne parle pas français'}]
Code
# Free up memory
del translator
gc.collect()
33

21.3.7.1 Translation using NLLB (no language left behind)

Code
# First, some memory cleanup using garbage collector

# del summarizer
# del feature_extraction
# del fill_mask
# del classifier
# del ner
# del question_answerer
# del translator
gc.collect()
0
Code
from transformers import pipeline 
import torch
Code
translator = pipeline(task="translation",
                      model="facebook/nllb-200-distilled-600M",
                      torch_dtype=torch.bfloat16) 

NLLB: No Language Left Behind: ‘nllb-200-distilled-600M’.

Code
text = """Change will not come if we wait for some other person or some other time. We are the ones we've been waiting for. We are the change that we seek."""
Code
text_translated = translator(text,
                             src_lang="eng_Latn",
                             tgt_lang="fra_Latn")

To choose other languages, you can find the other language codes on the page: Languages in FLORES-200

For example: - Afrikaans: afr_Latn - Chinese: zho_Hans - Egyptian Arabic: arz_Arab - French: fra_Latn - German: deu_Latn - Greek: ell_Grek - Hindi: hin_Deva - Indonesian: ind_Latn - Italian: ita_Latn - Japanese: jpn_Jpan - Korean: kor_Hang - Mandarin Chinese (Standard Beijing):cmn_Hans - Mandarin Chinese (Taiwanese): cmn_Hant - Persian: pes_Arab - Portuguese: por_Latn - Russian: rus_Cyrl - Spanish: spa_Latn - Swahili: swh_Latn - Thai: tha_Thai - Turkish: tur_Latn - Vietnamese: vie_Latn - Yue Chinese (Hong Kong Cantonese): yue_Hant - Zulu: zul_Latn)

Code
text_translated
[{'translation_text': 'Le changement ne viendra pas si nous attendons une autre personne ou un autre moment. Nous sommes ceux que nous avons attendus. Nous sommes le changement que nous cherchons.'}]
Code
%%time
print(text)
translator(text, src_lang="eng_Latn", tgt_lang="hin_Deva")
Change will not come if we wait for some other person or some other time. We are the ones we've been waiting for. We are the change that we seek.
CPU times: user 25 s, sys: 135 ms, total: 25.1 s
Wall time: 25.5 s
[{'translation_text': 'परिवर्तन नहीं आएगा अगर हम किसी और व्यक्ति या किसी अन्य समय का इंतजार करें. हम वही हैं जिनका हम इंतजार कर रहे हैं. हम वही बदलाव हैं जिसकी हम तलाश कर रहे हैं।'}]
Code
%%time
print(text)
translator(text, src_lang="eng_Latn", tgt_lang="kor_Hang")
Change will not come if we wait for some other person or some other time. We are the ones we've been waiting for. We are the change that we seek.
CPU times: user 18.9 s, sys: 102 ms, total: 19 s
Wall time: 19.1 s
[{'translation_text': '변화는 다른 사람을 기다린다면 오지 않을 것입니다. 우리는 우리가 기다린 사람들입니다. 우리는 우리가 추구하는 변화입니다.'}]
Code
%%time
print(text)
translator(text, src_lang="eng_Latn", tgt_lang="yue_Hant")
Change will not come if we wait for some other person or some other time. We are the ones we've been waiting for. We are the change that we seek.
CPU times: user 13.6 s, sys: 71.4 ms, total: 13.7 s
Wall time: 13.8 s
[{'translation_text': '如果我哋等到另一個人或者其他時間 我哋就唔會改變'}]
Code
del translator
gc.collect()
165

21.3.8 Text Generation

Code
# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'
        
Code
generator = pipeline("text-generation")

generator("In this course, we will teach you how to", max_length = 100, num_return_sequences=4)
No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': "In this course, we will teach you how to create, use, and use a system for creating your own apps, with some guidance from our CTO, Andrew.\n\nIf you would like to make a suggestion for a new course, please do not hesitate to contact me.\n\nDisclaimer\n\nBy clicking 'help' we represent and encourage your interest in the development of our app, and if you find it useful or helpful please help out. We also take all advice from our technical"},
 {'generated_text': 'In this course, we will teach you how to apply the knowledge contained in your own training to make your own unique, unique work.\n\nClick Here to Register!\n\nAs you can see, one of the most common mistakes people make is to think that no one is doing your training. It is far from true. In this course, you will teach all you need to be a successful professional:\n\n• Know the exact workout and set up. This will be your foundation for training'},
 {'generated_text': 'In this course, we will teach you how to make a solid foundation for a successful and successful business.\n\n\nWe will build your website in the time of your choice, so that it helps you out and not hinder you. We will also explain how to build a solid community to get started building your website.\n\n\nWe will teach you how to keep yourself on the right track, and how to get you going fast. We will also describe how to design a good social media campaign. We'},
 {'generated_text': 'In this course, we will teach you how to build a strong online presence through social media, in your daily routine, and how to keep your own personal information secure on social media.\n\nLearn to build a strong online presence by making it your central field of inquiry, a key value in your company, and especially a tool of collaboration.\n\nLearn to build a strong online presence by making it your central field of inquiry, a key value in your company, and especially a tool of collaboration'}]
Code
del generator
gc.collect()
172

21.3.9 Zero Shot Classification

Code
# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'
        
Code
from transformers import pipeline

classifier = pipeline("zero-shot-classification")

classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)
No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445993661880493, 0.1119738519191742, 0.04342673718929291]}
Code
classifier(mytext, candidate_labels=["education", "politics", "business"])
{'sequence': '\nPanther Labs, an early stage startup that specializes in detection and response analytics, has raised a whopping $120 million in a new round of funding led by hedge fund Coatue Management.\n\nPanther Labs said the Series B investment was raised at a $1.4 billion valuation, putting the company among a growing list of ‘unicorn’ cybersecurity startups.\n\nIn addition to Coatue Management, Panther Labs scored investments from ICONIQ Growth and Snowflake Ventures along with money from existing investors Lightspeed Venture Partners, S28 Capital, and Innovation Endeavors.\n\nThe company previously raised $15 million in a September 2020 Series A round.\n\nThe San Francisco firm, which was founded by Airbnb and AWS alumni, styles itself as a “cloud-scale security analytics platform” that helps organizations prevent breaches by providing actionable insights from large volumes of data.\n\nThe Panther product can be used by security teams to perform continuous security monitoring, gain security visibility across cloud and on-premise infrastructure, and build data lakes for incident response investigations.\n\nIn the last year, Panther claims its customer roster grew by 300 percent, including deals with big companies like Dropbox, Zapier and Snyk.\n\nPanther Labs said the new funding will be used to speed up product development, expand go-to-marketing initiatives and scale support for its customers.\n\nRelated: Panther Labs Launches Open-Source Cloud-Native SIEM\n\nRelated: CyCognito Snags $100 Million for Attack Surface Management\n',
 'labels': ['business', 'politics', 'education'],
 'scores': [0.8694897890090942, 0.06767456978559494, 0.0628357082605362]}
Code
del classifier
gc.collect()
73

21.4 Audio Classification

Code
import librosa

# File path of the audio file
file_path = r"futuristic-timelapse-11951.mp3"

# Load the audio file
audio_data, sample_rate = librosa.load(file_path)

# Print the shape of the audio data and the sample rate
print("Shape of audio data:", audio_data.shape)
print("Sample rate:", sample_rate)
Shape of audio data: (2646720,)
Sample rate: 22050
Code
audio_data
array([ 1.6763806e-08,  1.8626451e-08,  3.7252903e-09, ...,
       -1.2981509e-06,  3.9851147e-07,  1.4643756e-06], dtype=float32)
Code
import librosa
from IPython.display import Audio

# File path of the audio file
# file_path = 'audio_file.mp3'

# Load the audio file
audio_data, sample_rate = librosa.load(file_path, sr=None)

# Play the audio
Audio(data=audio_data, rate=sample_rate)
Code
sample_rate
44100
Code
resampled_audio = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)
Code
from transformers import pipeline
Code
zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="laion/clap-htsat-unfused")
Code
candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner"]
Code
zero_shot_classifier(resampled_audio,
                     candidate_labels=candidate_labels)
[{'score': 0.990260899066925, 'label': 'Sound of a dog'},
 {'score': 0.009739157743752003, 'label': 'Sound of vacuum cleaner'}]
Code
candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane",
                    "Sound of techno music"]
Code
zero_shot_classifier(resampled_audio,
                     candidate_labels=candidate_labels)
[{'score': 0.9880968332290649, 'label': 'Sound of techno music'},
 {'score': 0.008278903551399708, 'label': 'Sound of an airplane'},
 {'score': 0.0028479923494160175, 'label': 'Sound of a child crying'},
 {'score': 0.0007249810150824487, 'label': 'Sound of a bird singing'},
 {'score': 5.125454845256172e-05, 'label': 'Sound of vacuum cleaner'}]
Code
# Free up memory
del zero_shot_classifier
gc.collect()
1284

21.4.1 Automatic Speech Recognition

Whisper large-v3 (2024): OpenAI’s distil-whisper/distil-small.en used here is a distilled model optimised for English. For multilingual transcription or maximum accuracy, use Whisper large-v3 (openai/whisper-large-v3) — it supports 99 languages and achieves near-human accuracy on many benchmarks. faster-whisper (pip install faster-whisper) is a CTranslate2-based reimplementation that runs 4× faster with the same accuracy.

Code
import librosa
from IPython.display import Audio

# File path of the audio file
# file_path = 'audio_file.mp3'

# Load the audio file
audio_data, sample_rate = librosa.load("stereo_file.wav", sr=None)

# Play the audio
Audio(data=audio_data, rate=sample_rate)
Code
sample_rate
16000
Code
from transformers import pipeline
Code
asr = pipeline(task="automatic-speech-recognition",
               model="distil-whisper/distil-small.en")
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Info about distil-whisper/distil-small.en

Code
asr.feature_extractor.sampling_rate
16000
Code
asr(audio_data)
{'text': ' Chapter 16 I might have told you of the beginning of this liaison in a few lines, but I wanted you to see every step by which we came. I too agree to whatever Marguerite wished.'}
Code
del asr
gc.collect()
36

21.5 Vision

Multimodal models in 2025: The vision tasks in this chapter (object detection, image captioning, VQA) use single-purpose models from 2022–2023. Modern multimodal foundation models can handle all of these tasks through a single interface:

Model Access Strengths
GPT-4o OpenAI API Best overall vision + language; reads documents, charts, screenshots
Gemini 2.0 Flash Google AI API Fast multimodal; excellent at structured extraction from images
LLaVA / LLaVA-Next HuggingFace / Ollama Best open-source vision model; ollama pull llava
IDEFICS 2 (HuggingFace) HuggingFaceM4/idefics2-8b Open multimodal; strong on document understanding

The DETR (object detection), BLIP (captioning), and BLIP-VQA models shown below remain the best choice when you need a dedicated, fine-tunable pipeline for a specific vision task.

Setup the helper functions first

Code
# This is the helper.py from deeplearning.ai - functions created instead of loading helper module

import io
import matplotlib.pyplot as plt
import requests
import inflect
from PIL import Image

def load_image_from_url(url):
    return Image.open(requests.get(url, stream=True).raw)

def render_results_in_image(in_pil_img, in_results):
    plt.figure(figsize=(16, 10))
    plt.imshow(in_pil_img)

    ax = plt.gca()

    for prediction in in_results:

        x, y = prediction['box']['xmin'], prediction['box']['ymin']
        w = prediction['box']['xmax'] - prediction['box']['xmin']
        h = prediction['box']['ymax'] - prediction['box']['ymin']

        ax.add_patch(plt.Rectangle((x, y),
                                   w,
                                   h,
                                   fill=False,
                                   color="green",
                                   linewidth=2))
        ax.text(
           x,
           y,
           f"{prediction['label']}: {round(prediction['score']*100, 1)}%",
           color='red'
        )

    plt.axis("off")

    # Save the modified image to a BytesIO object
    img_buf = io.BytesIO()
    plt.savefig(img_buf, format='png',
                bbox_inches='tight',
                pad_inches=0)
    img_buf.seek(0)
    modified_image = Image.open(img_buf)

    # Close the plot to prevent it from being displayed
    plt.close()

    return modified_image

def summarize_predictions_natural_language(predictions):
    summary = {}
    p = inflect.engine()

    for prediction in predictions:
        label = prediction['label']
        if label in summary:
            summary[label] += 1
        else:
            summary[label] = 1

    result_string = "In this image, there are "
    for i, (label, count) in enumerate(summary.items()):
        count_string = p.number_to_words(count)
        result_string += f"{count_string} {label}"
        if count > 1:
          result_string += "s"

        result_string += " "

        if i == len(summary) - 2:
          result_string += "and "

    # Remove the trailing comma and space
    result_string = result_string.rstrip(', ') + "."

    return result_string


##### To ignore warnings #####
import warnings
import logging
from transformers import logging as hf_logging

def ignore_warnings():
    # Ignore specific Python warnings
    warnings.filterwarnings("ignore", message="Some weights of the model checkpoint")
    warnings.filterwarnings("ignore", message="Could not find image processor class")
    warnings.filterwarnings("ignore", message="The `max_size` parameter is deprecated")

    # Adjust logging for libraries using the logging module
    logging.basicConfig(level=logging.ERROR)
    hf_logging.set_verbosity_error()

########
Code
from transformers import pipeline
Code
# Here is some code that suppresses warning messages.
from transformers.utils import logging
logging.set_verbosity_error()

# from helper import ignore_warnings
# ignore_warnings()

21.5.1 Object Detection

Code
od_pipe = pipeline("object-detection", "facebook/detr-resnet-50")

Info about facebook/detr-resnet-50

Explore more of the Hugging Face Hub for more object detection models

Code
from PIL import Image

raw_image = Image.open('20240321_194345.jpg')
raw_image

Code
import numpy as np
np.array(raw_image).shape
(2160, 2880, 3)
Code
np.array(raw_image)
array([[[152, 126, 109],
        [148, 122, 105],
        [143, 120, 102],
        ...,
        [ 98,  40,  29],
        [ 94,  37,  28],
        [ 94,  37,  28]],

       [[151, 126, 106],
        [149, 124, 104],
        [144, 122, 101],
        ...,
        [ 94,  36,  25],
        [ 93,  36,  27],
        [ 95,  38,  29]],

       [[149, 124, 102],
        [150, 125, 103],
        [146, 124, 101],
        ...,
        [ 92,  34,  23],
        [ 92,  35,  26],
        [ 95,  38,  29]],

       ...,

       [[ 57,  59,  58],
        [ 54,  56,  55],
        [ 56,  58,  57],
        ...,
        [ 11,  11,  11],
        [ 12,  14,  11],
        [ 13,  15,  12]],

       [[ 51,  53,  52],
        [ 50,  52,  51],
        [ 53,  55,  54],
        ...,
        [  7,   7,   5],
        [  7,   9,   6],
        [  8,  10,   7]],

       [[ 64,  66,  65],
        [ 63,  65,  64],
        [ 62,  64,  63],
        ...,
        [  5,   5,   3],
        [  6,   8,   5],
        [  7,   9,   6]]], dtype=uint8)
Code
pipeline_output = od_pipe(raw_image)
Code
pipeline_output
[{'score': 0.9962156414985657,
  'label': 'wine glass',
  'box': {'xmin': 678, 'ymin': 1975, 'xmax': 873, 'ymax': 2159}},
 {'score': 0.9778583645820618,
  'label': 'person',
  'box': {'xmin': 1122, 'ymin': 900, 'xmax': 1446, 'ymax': 2145}},
 {'score': 0.974013090133667,
  'label': 'wine glass',
  'box': {'xmin': 1039, 'ymin': 1154, 'xmax': 1174, 'ymax': 1355}},
 {'score': 0.9941200613975525,
  'label': 'person',
  'box': {'xmin': 2263, 'ymin': 620, 'xmax': 2879, 'ymax': 2143}},
 {'score': 0.9233799576759338,
  'label': 'person',
  'box': {'xmin': 2232, 'ymin': 976, 'xmax': 2317, 'ymax': 1074}},
 {'score': 0.9445099830627441,
  'label': 'person',
  'box': {'xmin': 1786, 'ymin': 823, 'xmax': 1946, 'ymax': 1124}},
 {'score': 0.9947138428688049,
  'label': 'wine glass',
  'box': {'xmin': 1684, 'ymin': 1374, 'xmax': 1873, 'ymax': 1735}},
 {'score': 0.9555243253707886,
  'label': 'cup',
  'box': {'xmin': 1409, 'ymin': 1278, 'xmax': 1537, 'ymax': 1539}},
 {'score': 0.97939133644104,
  'label': 'person',
  'box': {'xmin': 1412, 'ymin': 900, 'xmax': 1507, 'ymax': 1085}},
 {'score': 0.9982106685638428,
  'label': 'person',
  'box': {'xmin': 683, 'ymin': 590, 'xmax': 1312, 'ymax': 2138}},
 {'score': 0.9981060028076172,
  'label': 'person',
  'box': {'xmin': 2, 'ymin': 623, 'xmax': 971, 'ymax': 2139}},
 {'score': 0.9994183778762817,
  'label': 'person',
  'box': {'xmin': 1318, 'ymin': 697, 'xmax': 1993, 'ymax': 2136}},
 {'score': 0.9274911284446716,
  'label': 'person',
  'box': {'xmin': 1796, 'ymin': 819, 'xmax': 2003, 'ymax': 1215}},
 {'score': 0.9437280297279358,
  'label': 'wine glass',
  'box': {'xmin': 1410, 'ymin': 1270, 'xmax': 1539, 'ymax': 1541}},
 {'score': 0.9910680651664734,
  'label': 'person',
  'box': {'xmin': 1701, 'ymin': 675, 'xmax': 2766, 'ymax': 2135}}]
Code
processed_image = render_results_in_image(
    raw_image, 
    pipeline_output)
Code
processed_image

Code
text = summarize_predictions_natural_language(pipeline_output)
Code
text
'In this image, there are four wine glasss ten persons and one cup.'
Code
# Free up memory
del od_pipe
gc.collect()
3460

21.5.2 Image Captioning

Code
from transformers import BlipForConditionalGeneration
from transformers import AutoProcessor
Code
model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-base")

Info about Salesforce/blip-image-captioning-base

Code
processor = AutoProcessor.from_pretrained(
    "Salesforce/blip-image-captioning-base")
Code
# Load image
from PIL import Image

image = Image.open("20240321_194345.jpg")

image

21.5.2.1 Conditional Image Captioning

Code
text = "a photograph of"
inputs = processor(image, text, return_tensors="pt")
Code
print(np.array(image).shape)
print(inputs['pixel_values'].shape)
(2160, 2880, 3)
torch.Size([1, 3, 384, 384])
Code
out = model.generate(**inputs)
/opt/conda/envs/mggy8413/lib/python3.11/site-packages/transformers/generation/utils.py:1133: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
Code
out
tensor([[30522,  1037,  9982,  1997,  1037,  2177,  1997,  2273,  5948,  4511,
           102]])
Code
print(processor.decode(out[0], skip_special_tokens=True))
a photograph of a group of men drinking wine

21.5.3 Unconditional Image Captioning

Code
inputs = processor(image,return_tensors="pt")
Code
out = model.generate(**inputs)
Code
print(processor.decode(out[0], skip_special_tokens=True))
a group of men standing around a table
Code
# Free up memory
del processor
del model
gc.collect()
183

21.6 Visual Question & Answering

Alternative: LLaVA for open-ended VQA: The BLIP-VQA model below is specialised and efficient. For open-ended visual question answering that requires reasoning, LLaVA (Large Language and Vision Assistant) is the open-source alternative:

import ollama
response = ollama.chat(
    model='llava',
    messages=[{
        'role': 'user',
        'content': 'How many people are in this image?',
        'images': ['20240321_194345.jpg']
    }]
)
print(response['message']['content'])
Code
# Suppressing warnings
from transformers.utils import logging
logging.set_verbosity_error()

import warnings
warnings.filterwarnings("ignore", message="Using the model-agnostic default `max_length`")
Code
from transformers import BlipForQuestionAnswering
from transformers import AutoProcessor
Code
model = BlipForQuestionAnswering.from_pretrained(
    "Salesforce/blip-vqa-base")

Info about Salesforce/blip-vqa-base

Code
processor = AutoProcessor.from_pretrained(
    "Salesforce/blip-vqa-base")
Code
# Load image
from PIL import Image

image = Image.open('20240321_194345.jpg')

image

Code
# Write the question you want to ask to the model about the image.
question = "how many dogs are in the picture?"
Code
inputs = processor(image, question, return_tensors="pt")

out = model.generate(**inputs)
Code
print(processor.decode(out[0], skip_special_tokens=True))
0
Code
question = "how many people in the picture?"
inputs = processor(image, question, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
8
Code
question = "how many people wearing glasses in the picture?"
inputs = processor(image, question, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
3
Code
question = "what are the people in the picture doing?"
inputs = processor(image, question, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
drinking
Code
question = "what is the picture about?"
inputs = processor(image, question, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
people drinking
Code
# Free up memory
del processor
del model
gc.collect()
40

21.7 OPTIONAL - Chatbot

21.7.1 Build the chatbot pipeline using 🤗 Transformers Library

Initialize a chatbot instance, and pass it a “conversation”. The conversation object encapsulates a piece of text.

The chatbot responds, and adds its response to the “conversation”. You can make it keep responding to itself.

Note: The Conversation class and conversational pipeline were deprecated in transformers 4.38 (early 2024) and removed in 4.45+. For multi-turn conversation with modern models, use the standard text-generation pipeline with a list of messages:

from transformers import pipeline
pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "What are some fun activities in New York?"},
]
reply = pipe(messages, max_new_tokens=256)
print(reply[0]['generated_text'][-1]['content'])

The facebook/blenderbot-400M-distill model and Conversation pattern below are preserved for historical reference but will fail on transformers >= 4.45.

Code
# Here is some code that suppresses warning messages.
from transformers.utils import logging
logging.set_verbosity_error()
Code
from transformers import pipeline
import torch
  • Define the conversation pipeline
Code
chatbot = pipeline(task="conversational",
                   model="facebook/blenderbot-400M-distill")
config.json: 100%|██████████| 1.57k/1.57k [00:00<00:00, 1.11MB/s]
pytorch_model.bin: 100%|██████████| 730M/730M [00:13<00:00, 55.5MB/s] 
generation_config.json: 100%|██████████| 347/347 [00:00<00:00, 1.03MB/s]
tokenizer_config.json: 100%|██████████| 1.15k/1.15k [00:00<00:00, 3.85MB/s]
vocab.json: 100%|██████████| 127k/127k [00:00<00:00, 113MB/s]
merges.txt: 100%|██████████| 62.9k/62.9k [00:00<00:00, 71.4MB/s]
added_tokens.json: 100%|██████████| 16.0/16.0 [00:00<00:00, 42.1kB/s]
special_tokens_map.json: 100%|██████████| 772/772 [00:00<00:00, 1.64MB/s]
tokenizer.json: 100%|██████████| 310k/310k [00:00<00:00, 6.31MB/s]

Info about ‘blenderbot-400M-distill’

Code
user_message = """
What are some fun activities I can do in the summer in New York?
"""
Code
from transformers import Conversation
Code
conversation = Conversation(user_message)
Code
print(conversation)
Conversation id: 3d02c9a1-8d0e-40d1-9e8c-42194d5013bb
user: 
What are some fun activities I can do in the summer in New York?

Code
conversation = chatbot(conversation)
conversation
Conversation id: 3d02c9a1-8d0e-40d1-9e8c-42194d5013bb
user: 
What are some fun activities I can do in the summer in New York?

assistant:  I'm not sure, but I know that New York is the most populous city in the United States.
Code
conversation = chatbot(conversation)
conversation
Conversation id: 3d02c9a1-8d0e-40d1-9e8c-42194d5013bb
user: 
What are some fun activities I can do in the summer in New York?

assistant:  I'm not sure, but I know that New York is the most populous city in the United States.
assistant:  New York City has a population of 8,537,673. That's a lot of people!
Code
conversation = chatbot(conversation)
Code
print(conversation)
Conversation id: 3d02c9a1-8d0e-40d1-9e8c-42194d5013bb
user: 
What are some fun activities I can do in the summer in New York?

assistant:  I'm not sure, but I know that New York is the most populous city in the United States.
assistant:  New York City has a population of 8,537,673. That's a lot of people!
assistant:  I know! It's also the most densely populated major city in North America.
  • You can continue the conversation with the chatbot with:
print(chatbot(Conversation("What else do you recommend?")))
  • However, the chatbot may provide an unrelated response because it does not have memory of any prior conversations.

  • To include prior conversations in the LLM’s context, you can add a ‘message’ to include the previous chat history.

Code
conversation.add_message(
    {"role": "user",
     "content": """
Where all hve you been?
"""
    })
Code
print(conversation)
Conversation id: 3d02c9a1-8d0e-40d1-9e8c-42194d5013bb
user: 
What are some fun activities I can do in the summer in New York?

assistant:  I'm not sure, but I know that New York is the most populous city in the United States.
assistant:  New York City has a population of 8,537,673. That's a lot of people!
assistant:  I know! It's also the most densely populated major city in North America.
user: 
Where all hve you been?

Code
conversation = chatbot(conversation)

print(conversation)
Conversation id: 3d02c9a1-8d0e-40d1-9e8c-42194d5013bb
user: 
What are some fun activities I can do in the summer in New York?

assistant:  I'm not sure, but I know that New York is the most populous city in the United States.
assistant:  New York City has a population of 8,537,673. That's a lot of people!
assistant:  I know! It's also the most densely populated major city in North America.
user: 
Where all hve you been?

assistant:  I've been to Manhattan, Brooklyn, Queens, The Bronx, and Staten Island.