21 Transformers

The HuggingFace ecosystem in 2025: The transformers library now provides access to over 900,000 models on the Hub. Key developments since this chapter was first written:

Llama 3 / Llama 3.1 / Llama 3.2 — Meta’s open models significantly narrowed the gap with GPT-4. The 8B version is the new baseline for open-source general use.

Mistral 7B / Mixtral 8x7B — efficient models from Mistral AI; strong for code and reasoning relative to size

Gemma 2 (9B, 27B) — Google DeepMind’s open models; competitive with much larger predecessors

Phi-4 (14B) — Microsoft’s “small but mighty” model; punches above its weight on reasoning benchmarks

Qwen 2.5 — Alibaba’s 7B–72B series with strong multilingual support

The Inference API (now “Serverless Inference”) lets you call any hosted model via HTTP without loading weights locally. For production deployments, Inference Endpoints spins up a dedicated container for your model.

GPT-4o comparison: Many of the tasks shown in this chapter (summarization, NER, QA) can now be performed comparably by open 7–8B models via the HuggingFace pipeline when run on suitable hardware.

Consider the following two sentences:
- She waited at the river bank
- She was looking at her bank account

Under the Glove and Word2Vec embeddings, both of the uses of the word bank would have the same vector representation. Which is a problem as the word ‘bank’ refers to two completely different things based on the context. Fixed embedding schemes such as Word2Vec can’t solve for this.

Transformer based language models solve for this by creating context specific embeddings. Which means instead of creating a static word-to-vector embedding, they provide dynamic embeddings depending upon the context. You provide the model the entire sentence, and it returns an embedding paying attention to all other words in the context of which a word appears.

Transformers use something called an ‘attention mechanism’ to compute the embedding for a word in a sentence by also considering the words adjacent to the given word. By combining the transformer architecture with self-supervised learning, these models have achieved tremendous success as is evident in the tremendous popularity of large language models. The transformer architecture has been successfully applied to vision and audio tasks as well, and is currently all the rage to the point of making past deep learning architectures completely redundant.

21.1 Attention is All You Need

A seminal 2017 paper by Vaswani et al from the Google Brain team introduced and popularized the transformers architecture. The paper represented a turning point for deep learning practitioners, and transformers were soon applied to solving a wide variety of problems. The original paper can be downloaded from https://arxiv.org/abs/1706.03762.

The original paper on transformers makes difficult reading for a non-technical audience. A more intuitive and simpler explanation of the paper was provided by Jay Alammar in a blog post on GitHub that received immense popularity and accolades. Jay’s blog post is available at https://jalammar.github.io/illustrated-transformer/.

The core idea behind self-attention is to derive embeddings for a word based on the all the words that surround it, including a consideration of the order they appear in. There is a lot matrix algebra involved, but the essence of the idea is to take into account the presence of other words before and after a given word, and use their embeddings as weights in computing the context sensitive embedding for a given word.

This means the same word would have a different embedding vector when used in different sentences, and the model will need the entire sentence or document as an input to compute the embeddings of the word. All of these computations end up being compute heavy as the number of weights and biases explodes when compared to a traditional FCNN or RNN. These transformer models are self-trained on large amounts of text (generally public domain text), and require computational capabilities beyond the reach of the average person. These new transformer models tend to have billions of parameters, and are appropriately called ‘Large Language Models’, or LLMs for short.

Large corporations such as Google, Facebook, OpenAI and others have come up with their own LLMs, some of which are open source, and others not. Models that are not open sourced can be accessed through APIs, which meanse users send their data to the LLM provider (such as OpenAI), and the provider returns the answer. These providers charge for usage based on the volumes of data they have to process.

Models that are open sourced can be downloaded in their entirety on the user’s infrastructure, and run locally without incremental cost except that of the user’s hardware and compute costs.

LLMs come in a few different flavors, and current thinking makes the below distinctions. However this can change rapidly as ever advanced models are released:
- Foundational Models – Base models, cannot be used out of the box as not trained for anything other than predicting the next word
- Instruction Tuned Models – Have been trained to follow instructions
- Fine-tuned Models – Have been trained on additional text data specific to the user’s situation

The line demarcating the above can be fuzzy and the LLM space is evolving rapidly with different vendors competing to meet their users’ needs in the most efficient way.

21.2 Sentence Transformers

(https://www.sbert.net/)

Sentence BERT is a library that allows the creation of sentence embeddings based on transformer models, including nearly all models available on Huggingface. A ‘sentence’ does not mean a literal sentence, it refers to any text.

Once we have embeddings available, there is no limit to what we can do with it. We can pass the embeddings to traditional or network based models to drive classification, regression, or perform clustering of text data using any clustering method such as k-means or hierarchical clustering.

We will start with sentence BERT, and look at some examples of the kinds of problems we can solve with it.

21.2.1 Get some text data first

We import about 10,000 random articles that were collected using web scraping the net for articles that address cybersecurity. Some item are long, some are short, and others are not really even articles as those might just be ads or other website notices.
Local saving and loading of models > Save with:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-roberta-large-v1')

model.save(path)

Load with:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(path)

Code

# Set default locations for downloaded models
# If you are running things on your own hardware,
# you can ignore this cell completely.

# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'

Code

# Usual library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tqdm
import torch
import gc

Code

pwd

'C:\\Users\\user\\Google Drive\\jupyter'

Code

# Import the data from a pickle file
df = pd.read_pickle('sample.pkl')

Code

# How many rows and columns in our dataframe
df.shape

(10117, 7)

Code

# We look at the dataframe below.  The column of interest to us is the column titled 'text'
df.head(3)

	title	summary_x	URL	keywords	summary_y	text	published_date
0	Friday Squid Blogging: On Squid Brains	<p>Interesting <i>National Geographic</i> <a h...	https://www.schneier.com/blog/archives/2021/08...	working,school,technologist,security,schneier,...	About Bruce SchneierI am a public-interest tec...	About Bruce Schneier\n\nI am a public-interest...	2021-08-20 21:18:14
1	More on Apple’s iPhone Backdoor	<p>In this post, I’ll collect links on A...	https://www.schneier.com/blog/archives/2021/08...	service,using,wiserposted,iphone,security,appl...	More on Apple’s iPhone BackdoorIn this post, I...	More on Apple’s iPhone Backdoor\n\nIn this pos...	2021-08-20 13:54:51
2	T-Mobile Data Breach	<p>It’s a <a href="https://www.wired.com...	https://www.schneier.com/blog/archives/2021/08...	tmobiles,numbers,data,tmobile,security,schneie...	It’s a big one:As first reported by Motherboar...	It’s a big one:\n\nAs first reported by Mother...	2021-08-19 11:17:56

Code

# We create a dataframe with just the story text, and call it corpus

corpus = df[['text']]

Code

corpus

	text
0	About Bruce Schneier\n\nI am a public-interest...
1	More on Apple’s iPhone Backdoor\n\nIn this pos...
2	It’s a big one:\n\nAs first reported by Mother...
3	Apple’s NeuralHash Algorithm Has Been Reverse-...
4	Upcoming Speaking Engagements\n\nThis is a cur...
...	...
10112	Nigerian automotive tech company Autochek toda...
10113	— The Starters — Apple Inc. and Tesla Inc. hav...
10114	— Hello friends, and welcome back to Week in R...
10115	— Factorial, a startup out of Barcelona that h...
10116	But it’s not totally clear whether rural Ameri...

10117 rows × 1 columns

Code

# Next, we examine how long the articles are.  Perhaps we want to 
# throw out the outliers, ie really short articles, which may 
# not really be articles, and also very long articles.
# 
# We do this below, looking at the mean and distribution of article lengths

article_lengths = [(len(x.split())) for x in (corpus.text)]
article_lengths = pd.Series(article_lengths)
plt.figure(figsize = (14,9))
sns.histplot(article_lengths)
pd.Series(article_lengths).describe()

count    10117.000000
mean       559.145003
std        501.310623
min          0.000000
25%        293.000000
50%        450.000000
75%        724.000000
max       8807.000000
dtype: float64

Code

# Let us just keep the regular sized articles, ie the middle 50%. We are still 
# left with a sizable number in our corpus.

corpus = corpus[(article_lengths>article_lengths.quantile(.25)) & (article_lengths<article_lengths.quantile(.75))]
len(corpus)

Code

# Next we look at the distribution again

article_lengths = [(len(x.split())) for x in (corpus.text)]
article_lengths = pd.Series(article_lengths)
plt.figure(figsize = (14,9))
sns.histplot(article_lengths)
pd.Series(article_lengths).describe()

count    5050.000000
mean      468.483960
std       121.253301
min       294.000000
25%       358.000000
50%       450.000000
75%       565.000000
max       723.000000
dtype: float64

Our code becomes really slow if we use all 9600 articles, so we randomly pick just 100 articles from the corpus. This is just so we can finish in time with the demos. When you have more time, you can run the code for all the articles too.

Code

# We take only a sample of the entire corpus
# If we want to consider the entire set, we do not need to run this cell

corpus = corpus.sample(100)

Code

# Let us print out a random article

print(corpus.text.iloc[35])

Vancouver, British Columbia— Plurilock Security Inc. (TSXV: PLUR) (OTCQB: PLCKF) and related subsidiaries (“Plurilock” or the “Company”), an identity-centric cybersecurity solutions provider for workforces, has entered into definitive asset purchase agreements (the “Agreements”) dated October 21, 2021 to acquire certain assets (the “Purchased Assets”) of CloudCodes Software Private Limited (“CloudCodes”), an award winning cloud access security broker (“CASB”) based in India (the “Acquisition”).

Since 2011, CloudCodes has provided innovative cloud security SaaS solutions for protecting email and group collaboration platforms, offering single-sign-on (SSO), multi-factor authentication (MFA), and cloud data loss prevention (DLP) solutions. CloudCodes earned approximately CAD$576k in product revenue for its year ended March 31, 2021.

Following the Acquisition, CloudCodes’ existing customers will have access to a larger public organization with adequate financial resources, deep security, IT, AI capabilities and expertise, and the Company’s world-class sales team while Plurilock will gain a larger market presence in the international cybersecurity space and enter the growing CASB segment. In addition, Plurilock, through its Indian subsidiary, Plurilock Security Private Limited (“PSP”) will obtain a technical product team and a new office in Pune, India to complement its office in Mumbai, India.

The Acquisition will add additional functionality within Plurilock’s product portfolio, with CloudCodes’ CASB solution offered as an early access product under the name of Plurilock CLOUD. This additional technology solution creates new opportunity for Plurilock’s customers for a cost-effective cloud security solution and a path to integrate low-friction, high-security behavioral biometric identity with SSO and cloud security functionality. As a result, it is expected that the Acquisition will accelerate Plurilock’s sales growth and cement the Company’s position in the growing zero trust market.

“The acquisition of CloudCodes provides us with an award-winning CASB solution with broad customer adoption across small, medium and large enterprises. Businesses, especially small businesses, continue to face security risks with workforces that are working in a post-COVID, remote-centric world, and it has never been more important to secure cloud resources such as corporate email and file sharing,” said Ian L. Paterson, CEO of Plurilock. “This acquisition aligns with our commitment to becoming the premier cybersecurity solutions provider in the market, acquiring critical technology to enhance organizations’ zero trust architecture. We are looking forward to adding the CloudCodes product to our robust product portfolio and integrating their staff into our growing team, as we continue to develop cutting edge technology that empowers organizations to operate safely and securely while reducing friction for users.”

“We are pleased to join the Plurilock family of companies,” said Debasish Pramanik, co-founder of CloudCodes. “This transaction offers an opportunity to expand the use of our signature product in the North American market and join a fast-growing organization with deep security and IT expertise that is developing the next generation of cybersecurity solutions that can revolutionize the industry.”

Once the Acquisition is completed, CloudCodes assets will be transferred into the Plurilock family of companies, under the guidance of Plurilock’s management team.

Terms of Agreements

The Company and its subsidiaries, Plurilock Security Solutions Inc. and PSP, entered into the Agreements with CloudCodes whereby the Company will acquire the Purchased Assets. Pursuant to the terms of the Agreements, the Company has agreed to pay CloudCodes aggregate consideration of US$1,700,000 payable as follows: (i) US$1,000,000 in cash payable on closing; and (ii) US$700,000 in common shares of Plurilock (the “Consideration Shares”), less any deferred revenue. The Consideration Shares will be issued at a deemed price of C$0.59 per share and will be placed in escrow for 18 months to satisfy any indemnification obligations to the Company.

The Acquisition is subject to customary closing conditions and receipt of the approval of the TSXV. The Company expects to close the Acquisition on or around October 31, 2021.

About CloudCodes

CloudCodes is an internationally based Cloud Security SaaS platform company, offering a product that protects email and group collaboration platforms like Microsoft 365 and Google Workspace, while providing SSO, MFA and DLP functionality.

About Plurilock

Plurilock provides identity-centric cybersecurity for today’s workforces. The Plurilock family of companies enables organizations to operate safely and securely while reducing cybersecurity friction. Plurilock offers world-class IT and cybersecurity solutions through its Solutions Division, paired with proprietary, AI-driven and cloud-friendly security through its Technology Division. Together, the Plurilock family of companies delivers persistent identity assurance with unmatched ease of use.

21.2.2 Embeddings/Feature Extraction

Feature extraction means obtaining the embedding vectors for a given text from a pre-trained model. Once you have the embeddings, which are numerical representations of text, lots of possibilities open up. You can compare the similarity between documents, you can use the embeddings to match questions to answers, perform clustering based on any algorithm, use the embeddings as features to create clusters of similar documents, and so on.

Difference between word embeddings and document embeddings
So far, we have been talking of word embeddings, which means we have a large embedding vector for every single word in our text data. What do we mean when we say sentence or document embedding? A sentence’s embedding is derived from the embeddings for all the words in the sentence. The embedding vectors are generally averaged (‘mean-pooled’), though other techniques such as ‘max-pooling’ are also available. It is surprising that we spend so much effort computing separate embeddings for words considering context and word order, and then just mash everything up using an average to get a single vector for the entire sentence, or even the document. It is equally surprising that this approach works remarkably effectively for a large number of tasks.

Fortunately for us, the sentence transformers library knows how to computer mean-pooled or other representations of entire documents based upon the pre-trained model used. Effectively, we reduce the entire document to a single vector that may have 768 or such number of dimensions.

Let us look at this in action.

First, we get embeddings for our corpus using a specific model. We use the ‘all-MiniLM-L6-v2’ for symmetric queries, and any of the MSMARCO models for asymmetric queries. The difference between symmetric an asymmetric queries is that the query and the sentences are roughly the same length in symmetric queries. In asymmetric queries, the query is much smaller than the sentences.

This is based upon the documentation on sentence-bert’s website.

Code

# Toy example with just three sentences to see what embeddings look like

from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('all-MiniLM-L6-v2') #for symmetric queries
model = SentenceTransformer('msmarco-distilroberta-base-v2') #for asymmetric queries
#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: This framework generates embeddings for each input sentence
Embedding: [-1.94027126e-01 -1.22946411e-01 -1.03667654e-01 -5.60734332e-01
  1.10684156e-01  6.79868519e-01 -6.36458471e-02 -7.55182922e-01
  7.56757021e-01  2.64225334e-01 -1.42991528e-01  3.98469239e-01
  1.76254734e-01 -1.42204142e+00 -2.50023663e-01  6.46462897e-03
  4.95950967e-01  4.63492602e-01 -1.50225936e-02  8.64237368e-01
  1.83195844e-01 -8.47510993e-01 -7.40249932e-01 -1.01876450e+00
 -1.04469287e+00  5.33529818e-01  7.04183996e-01  3.23025554e-01
 -1.34202325e+00 -1.40403613e-01 -1.69761360e-01  9.34997141e-01
 -3.45071107e-01  4.92122434e-02  1.28702326e-02 -1.90801427e-01
  5.31530082e-01 -3.53034407e-01 -9.99688327e-01  1.29575178e-01
  8.10617507e-01  5.22234797e-01 -7.57190168e-01 -2.42323861e-01
  4.81890917e-01 -2.24909961e-01  5.87174833e-01 -9.55266654e-01
 -2.80447155e-01 -5.75490184e-02  1.38305819e+00 -6.43579885e-02
 -2.80887455e-01 -2.96108842e-01  6.02366269e-01 -6.88801527e-01
 -3.63944560e-01  1.24548979e-01  1.68449059e-01 -3.52236390e-01
 -5.34670591e-01  1.07049860e-01  1.89601004e-01  4.98377889e-01
  5.57314813e-01  9.96691082e-03  1.11395024e-01 -3.20706338e-01
 -5.68632662e-01 -2.54595071e-01 -1.17989182e-01  2.34521136e-01
  4.05370779e-02 -8.24391186e-01  6.77566350e-01 -8.15773308e-01
  6.42071486e-01 -7.75032520e-01 -2.13417113e-01  6.85814440e-01
  1.00933135e+00  3.57063770e-01 -4.13770676e-01  3.37253094e-01
 -3.41039188e-02 -3.45317245e-01  2.80249473e-02  9.73951519e-01
 -6.43461645e-02 -6.06842160e-01 -3.48319650e-01 -5.75613379e-02
 -6.01034939e-01  1.48180962e+00  2.74765521e-01  6.42698586e-01
  2.52264529e-01 -1.33694637e+00  2.61821836e-01 -1.21891707e-01
  1.12433112e+00  3.23991627e-01  1.90715790e-01  1.06098376e-01
 -5.28269231e-01  1.66739047e-01  4.35670942e-01  3.07411522e-01
 -7.34456956e-01 -2.05261961e-01  1.22825503e-01  1.61016196e-01
  4.43147391e-01  2.64934242e-01  8.47648621e-01 -7.37874135e-02
  2.99923062e-01  3.89373690e-01  3.17179821e-02  5.00585735e-01
 -2.81464159e-01 -8.12774718e-01 -5.90420187e-01 -1.62012696e-01
 -6.17273927e-01  3.92245650e-01  6.67506993e-01  7.01212466e-01
 -1.29788291e+00  4.20975447e-01  6.82982877e-02  1.05026746e+00
  1.90296426e-01  1.57451317e-01 -1.27690539e-01  1.70817673e-01
 -5.59714615e-01  2.86618054e-01  6.88185275e-01  1.76241621e-01
 -2.90350705e-01  5.54080665e-01  3.52999896e-01 -9.71634805e-01
  5.82876742e-01  7.67536610e-02 -8.55224058e-02  1.64016068e-01
 -4.47867244e-01 -2.59355217e-01  1.27354398e-01  9.79057074e-01
  3.73845607e-01  2.00498570e-02  3.08634132e-01 -8.47880661e-01
 -2.75358230e-01  4.34403330e-01  6.07397497e-01  1.44445404e-01
  3.02737325e-01 -8.48591924e-02  7.59577528e-02  2.25079998e-01
  3.31507593e-01 -3.65941495e-01  4.87931490e-01 -2.12545112e-01
 -6.65542066e-01  3.48111510e-01  2.20464692e-01 -3.09980959e-01
 -8.39646101e-01 -3.30511957e-01 -4.15750414e-01 -2.79508859e-01
 -1.40072629e-01  1.84452698e-01 -1.54586315e-01  5.54982841e-01
 -5.79781711e-01 -3.45990062e-01 -1.88777968e-01 -1.06845282e-01
 -3.00893903e-01 -4.41065788e-01  6.00923777e-01  4.12963390e-01
  5.86519182e-01  2.00733587e-01  1.36600316e+00 -1.49683118e-01
 -1.08713202e-01 -5.95987380e-01 -3.16461809e-02 -6.61388695e-01
  7.37694204e-01  7.15092123e-02 -3.63184452e-01 -6.92547485e-02
  2.76804715e-01 -9.55267191e-01 -9.52274427e-02  4.58616495e-01
 -4.26264793e-01 -4.42463070e-01  1.27647474e-01 -9.39838886e-01
 -1.15567088e-01 -6.55211329e-01  7.31721699e-01 -1.57167566e+00
 -1.10542095e+00 -9.03355658e-01 -5.43098509e-01  7.95553029e-01
 -7.08044181e-03 -2.85227060e-01  9.28429782e-01  9.71573889e-02
 -3.96223694e-01  4.94155139e-01  5.37390769e-01 -3.39529753e-01
  3.68308216e-01 -1.28579482e-01 -1.05017090e+00  4.17593837e-01
  2.48604313e-01 -9.68254879e-02 -3.59232217e-01 -1.08622682e+00
 -1.00478329e-01  2.23072171e-01 -4.37571257e-01  1.38826239e+00
  7.68635273e-01 -1.42441198e-01  6.20768249e-01 -2.65000969e-01
  1.35475969e+00  2.88145393e-01 -1.43894047e-01 -2.99536616e-01
  6.31552264e-02 -2.51712114e-01 -1.38677716e-01 -5.41011631e-01
  1.47185221e-01 -1.49833515e-01 -7.15740383e-01  2.88314521e-01
 -6.38389051e-01  3.16053420e-01  7.71043360e-01  1.43179849e-01
  1.48211978e-02  4.73498583e-01  8.03198099e-01 -1.08405840e+00
 -5.70261180e-01 -4.76540141e-02  5.26882231e-01 -2.81869620e-01
 -1.13989723e+00 -7.62864292e-01  2.67658662e-03 -5.99309146e-01
  5.08213304e-02  3.48600075e-02 -1.31661296e-01  3.43350202e-01
  1.47039965e-01  3.29475522e-01 -2.65227765e-01 -1.64056107e-01
  1.84712335e-01 -1.64587155e-01  2.68281907e-01 -1.01048581e-01
  3.19146842e-01 -1.23158330e-02  8.56841505e-01  2.03407288e-01
 -3.81547332e-01 -6.64151132e-01  1.32862270e+00  3.04318994e-01
  3.39265078e-01  4.92733121e-01 -1.24012269e-01 -7.18624413e-01
  7.86116779e-01 -1.71105146e-01 -6.88624561e-01 -5.21284103e-01
  3.24477136e-01 -6.42667353e-01 -4.49099392e-01 -1.64437735e+00
 -1.15677512e+00  1.04355657e+00 -3.67201120e-01  4.36934233e-01
 -3.68611068e-01 -5.88484347e-01  1.77582696e-01  4.92794275e-01
 -1.17947571e-01 -3.62115175e-01 -8.98680031e-01  1.27371326e-01
  1.12385474e-01  7.67848730e-01 -5.89435279e-01 -1.44601986e-01
 -1.09177661e+00  8.49221230e-01  5.22653639e-01  2.08491504e-01
 -5.28513193e-01 -4.64428753e-01  4.48831111e-01  5.75599611e-01
 -3.98134351e-01  9.21166122e-01  3.45954180e-01 -1.62111804e-01
 -1.04399741e-01 -2.50324935e-01  3.00042212e-01 -6.02201462e-01
  1.75128534e-01  4.32529122e-01 -5.86885870e-01 -3.32548052e-01
 -3.95463258e-01 -5.57754755e-01 -4.48471338e-01 -2.77211517e-01
  8.81520510e-02 -6.36177719e-01  3.19960952e-01  7.60709226e-01
  3.15277666e-01  4.44414705e-01  6.47632718e-01 -2.63870377e-02
  5.25060177e-01 -1.61295444e-01 -1.55720651e-01  8.36495817e-01
 -9.65523064e-01  3.01889300e-01  1.69886515e-01 -3.05459499e-02
  1.86375126e-01  1.04048140e-01  2.38540974e-02 -6.64686680e-01
  7.24217117e-01  3.38430315e-01 -5.57187736e-01 -6.26726449e-01
  2.66006470e-01  7.35096037e-01 -9.07033145e-01  3.59426349e-01
  6.95876300e-01 -7.21453428e-01  2.58155048e-01  5.54192603e-01
  5.41482761e-04 -7.63940573e-01  3.79112154e-01  1.46436945e-01
  5.97151697e-01 -7.88239002e-01 -3.30818325e-01  3.47732455e-01
 -9.91573274e-01  1.00135529e+00 -7.29097128e-01  6.53858840e-01
  8.88706267e-01  7.92914554e-02  5.46956718e-01 -1.07456028e+00
  2.40587711e-01 -2.27806821e-01 -5.90185463e-01  1.93599775e-01
 -2.94735581e-01  6.93159640e-01  5.71026504e-01 -5.83263189e-02
  7.66058385e-01 -1.13303483e+00 -2.08590925e-01  7.20142305e-01
  5.14323413e-01 -2.17918679e-01  5.63963242e-02  1.05543637e+00
  3.60624075e-01 -8.86408508e-01  6.73737109e-01 -6.02494895e-01
  2.93800205e-01  3.85887951e-01 -7.39043728e-02 -6.01254046e-01
  4.93471593e-01  4.23360497e-01  4.78618175e-01 -2.05091592e-02
  1.23453252e-01  3.61531049e-01 -6.02922976e-01  3.94695520e-01
 -6.17296934e-01  4.17496830e-01  1.88823715e-01  6.38141155e-01
 -2.95579195e-01 -1.13238625e-01 -2.92233139e-01  1.89026613e-02
  8.18741992e-02  2.89110005e-01 -4.24625039e-01  1.15595661e-01
  1.10594714e+00 -4.42896038e-01  9.07224491e-02 -3.24043751e-01
 -1.41674364e-02  3.84770662e-01 -1.10513262e-01  1.47955164e-01
  3.03255673e-02  9.41580951e-01  7.44941950e-01 -5.16233027e-01
 -1.07049954e+00 -3.39610100e-01 -9.81280506e-01  1.63678247e-02
 -3.09417546e-01  8.38646412e-01 -2.57466435e-01  2.66416728e-01
  8.29470217e-01  1.18659770e+00  4.45776910e-01 -5.46342313e-01
  3.46238047e-01  4.82638836e-01 -2.03869224e-01 -4.86987345e-02
 -1.36196777e-01  7.27507114e-01 -2.94585973e-01  4.04031157e-01
 -4.61239845e-01  1.53372660e-01  5.77554286e-01 -1.07579343e-01
 -1.07114184e+00 -6.10307753e-01 -1.70739844e-01  2.83243328e-01
 -2.24987030e-01  3.85358900e-01 -7.83190504e-02 -6.51502684e-02
 -4.53457981e-01  1.75708488e-01  9.54947174e-01 -4.80354220e-01
  3.67996283e-02  3.07653725e-01  9.76266861e-01 -2.82786489e-01
 -5.11633098e-01 -5.04429340e-01  2.25381449e-01  5.29595613e-01
 -1.00188531e-01  3.30978967e-02 -4.25292045e-01 -2.50480801e-01
  7.80557394e-01 -3.06186169e-01  1.09467424e-01 -6.34019434e-01
  3.03106278e-01 -1.41973698e+00 -4.36300397e-01  3.82955313e-01
  2.25167990e-01 -3.14564019e-01 -2.14847505e-01 -7.26124108e-01
  4.01522785e-01  1.61230147e-01 -2.14475319e-01 -8.14741179e-02
  1.44952789e-01  4.35495615e-01  1.60962373e-01  8.42103720e-01
  4.83167648e-01 -1.81479957e-02 -3.72209549e-01 -8.54204893e-02
 -1.25429153e+00  6.33920580e-02 -3.04254025e-01  1.19560093e-01
 -4.54789966e-01 -6.71517909e-01  8.25446323e-02  8.15794468e-02
  8.27028692e-01 -3.20302039e-01 -6.09916687e-01 -2.28958473e-01
 -3.23811322e-01 -5.48929155e-01 -7.08900273e-01  5.72744966e-01
 -9.07650590e-02  2.64598966e-01  2.70573050e-01 -9.85758960e-01
 -2.44654134e-01 -3.91785651e-01  2.55578756e-01 -6.70406878e-01
 -1.21352875e+00 -3.58353883e-01  9.98406172e-01  6.14020884e-01
  5.54472543e-02  2.67769605e-01  6.59718096e-01  6.53219372e-02
 -4.38049883e-01  9.86246109e-01 -2.51958340e-01  7.89942980e-01
 -7.73840129e-01  5.97827852e-01 -2.22646594e-01  4.02279533e-02
 -2.87521333e-01  3.42817307e-01 -2.41310239e-01  1.77004576e-01
 -9.65369642e-02  9.10423577e-01  4.00543928e-01  5.33569120e-02
 -2.18828022e-01 -2.59987801e-01 -2.06984386e-01  3.85516196e-01
  9.66344357e-01  2.62666583e-01 -5.70590794e-01  9.91979778e-01
  2.98639029e-01  9.17680323e-01 -9.80460465e-01 -5.94711006e-02
 -9.55575332e-02  8.68820190e-01 -6.75058305e-01 -2.41460040e-01
 -8.95355940e-01  4.71445829e-01 -2.14758113e-01  5.96137702e-01
 -6.81212023e-02 -1.22940755e-02 -3.48113567e-01  9.15871263e-02
 -8.74246836e-01 -6.46880984e-01 -2.76604384e-01 -4.86592144e-01
  3.61363381e-01 -4.31284517e-01 -2.53118962e-01 -2.11931407e-01
  7.04253241e-02  1.43149868e-01 -7.21811831e-01 -7.77530134e-01
 -2.66693115e-01  2.54974961e-02  3.14531595e-01  2.98289031e-01
  4.59118724e-01  4.35666889e-01 -6.02146268e-01 -3.29307169e-01
  2.72133678e-01  2.44671479e-02  3.10772389e-01 -6.65003121e-01
  3.58248562e-01  3.00383627e-01 -3.64194423e-01 -5.12525737e-01
  2.16460541e-01  5.01621068e-01  2.53829032e-01 -1.22401452e+00
  4.61754054e-01 -1.53161451e-01 -2.68886209e-01  1.27812326e+00
 -1.07412553e+00 -4.94798303e-01  6.21693552e-01  4.18770790e-01
  7.43999183e-01  2.84353107e-01  1.35036871e-01  8.22463810e-01
  5.11462271e-01 -2.76414454e-01  3.26247245e-01 -4.85349864e-01
  4.11561877e-01 -1.19246654e-01 -1.61334530e-01 -7.34282315e-01
 -9.41174507e-01 -1.15899551e+00 -2.58182764e-01 -4.81391102e-01
  1.41962335e-01 -1.07252918e-01 -2.61298269e-02 -4.07726765e-01
  3.95175695e-01  9.52931941e-01 -6.57295436e-02 -5.97879887e-01
 -4.26192760e-01  2.06618801e-01  6.77784741e-01 -1.12915194e+00
 -7.80462995e-02  3.37206364e-01 -6.69075772e-02  6.15010798e-01
 -2.87119716e-01 -2.27136135e-01 -2.42563352e-01 -2.03058645e-01
 -2.77406633e-01 -3.84487003e-01  1.71700701e-01  1.32659745e+00
 -1.54341653e-01 -9.40678045e-02 -3.46466780e-01  3.97526532e-01
 -3.61106247e-01  1.07136858e+00 -7.35428035e-01  4.52006727e-01
 -3.94796461e-01 -5.93080342e-01 -1.30981520e-01  2.37584129e-01
 -5.63736558e-01  7.58668244e-01  9.55792367e-01  3.89002115e-01
  6.69343948e-01 -4.48577017e-01 -5.99645674e-01 -5.11237085e-01
 -6.01219475e-01 -3.33563328e-01  3.43445688e-02  1.24906890e-01
 -3.98856193e-01 -4.00449544e-01 -1.91573918e-01 -9.40701723e-01
 -1.97318971e-01 -1.99874625e-01 -3.46654914e-02 -1.74211666e-01
 -9.32460129e-01 -6.68434799e-02  3.58897686e-01  2.40670264e-01
  1.68707371e-01  2.12407067e-01  3.82851698e-02  4.22058821e-01
  7.49818563e-01 -6.04370773e-01 -5.07282317e-01  6.40344441e-01
 -4.69703197e-01 -6.06814682e-01 -2.10751593e-01  5.21379858e-02
 -1.81016140e-02  3.84092331e-01 -1.14480209e+00 -3.46426129e-01
  4.44303304e-01  3.00263196e-01  9.76041034e-02  1.52969763e-01
  1.78943232e-01 -2.96392560e-01 -4.73998755e-01 -6.50664628e-01
 -1.90126374e-01  1.75953805e-01  1.06422436e+00  6.82281494e-01
  6.07434690e-01 -4.69581038e-01  2.85444587e-01  1.47231007e+00
  6.49958193e-01 -4.16353196e-01 -2.71410137e-01 -4.02401328e-01
  4.31929082e-01 -1.11652696e+00  9.89714801e-01 -4.93843794e-01
  2.96220332e-01  6.49991155e-01  1.71276465e-01 -3.89997095e-01
  2.96082497e-01 -6.96498632e-01  1.15289032e+00  5.26634753e-01
 -1.92738521e+00 -2.08714798e-01  2.58085877e-01 -2.02861592e-01
 -7.30242729e-01  9.42804396e-01 -1.71018064e-01  4.25120860e-01
  5.78499913e-01  5.67792714e-01 -3.58646393e-01 -4.07528400e-01
  1.21926451e+00 -4.26342487e-01  4.62184846e-03  9.98993576e-01]

Sentence: Sentences are passed as a list of string.
Embedding: [-1.16906703e-01 -3.39529991e-01  2.95595676e-01  6.28463686e-01
 -1.21640146e+00  1.65200818e+00 -3.72159153e-01  1.22192897e-01
  1.43514737e-01  1.89907885e+00  7.67186701e-01  1.97850570e-01
 -3.00641447e-01  2.56379187e-01 -3.48131806e-01 -4.73125935e-01
  1.08252943e+00  2.98562735e-01  7.63341844e-01  8.66353214e-01
  4.58364397e-01 -9.81929481e-01  2.39389911e-01 -2.22516790e-01
 -1.33060351e-01 -9.96134505e-02  3.78246218e-01  6.10263705e-01
 -2.39595935e-01 -6.06570303e-01 -1.00376761e+00  1.12918425e+00
  1.00350715e-01 -3.09985340e-01  5.68390429e-01  4.60176259e-01
  5.56804717e-01 -9.56280410e-01 -1.07998073e+00 -8.21260884e-02
 -5.05553067e-01  4.20840144e-01 -9.42075014e-01 -1.94354117e-01
 -7.87233829e-01 -3.89431119e-01  6.93553627e-01 -1.27063155e-01
 -3.93037200e-02 -3.24397892e-01  2.25299224e-01 -4.44827318e-01
  3.83225173e-01  1.55420110e-01 -4.00179952e-01  5.34262359e-01
 -7.52259672e-01 -1.48048389e+00 -3.27409536e-01 -6.32044077e-02
  2.56292492e-01 -6.87812805e-01 -5.23866713e-01 -8.44650269e-02
  7.01584160e-01 -5.68879962e-01 -3.34130555e-01  3.62065405e-01
 -2.21194491e-01  2.73137331e-01  1.07009542e+00 -8.99545193e-01
 -1.09715140e+00 -4.02705759e-01  4.93271738e-01 -1.13299298e+00
 -1.01656474e-01  1.21973073e+00  2.00953558e-01  6.92954242e-01
  1.01618135e+00  9.38402593e-01  1.15313955e-01  1.12252986e+00
 -3.70449662e-01 -3.82418424e-01 -5.63915260e-02 -6.26985848e-01
  1.02046466e+00  4.74569649e-01  2.05626383e-01 -2.17339873e-01
  1.67204976e-01  4.19093333e-02 -2.10443318e-01  4.12339449e-01
  2.06380725e-01 -5.14172077e-01  3.42742920e-01 -6.01808608e-01
  2.28810072e-01  4.86289233e-01 -8.57147396e-01  4.29493450e-02
 -5.76607525e-01 -5.80542564e-01  1.24514186e+00 -4.79772687e-01
 -3.29002924e-02  1.63348034e-01 -2.32700869e-01  6.50418699e-01
  5.11511207e-01  4.20596987e-01  8.68813574e-01 -6.27200127e-01
  1.10752177e+00 -1.90651700e-01  8.90402794e-02  2.78722495e-01
 -2.18247354e-01  3.38913113e-01 -3.35250974e-01  4.99915242e-01
 -6.69480503e-01  1.10688768e-01  8.38254988e-01 -3.01222503e-01
 -1.18903148e+00 -2.68496219e-02  6.16425693e-01  1.19437933e+00
  5.95553339e-01 -1.32276988e+00  1.98763236e-01 -2.30663300e-01
 -7.13005066e-01  2.79665422e-02  7.10063040e-01  3.44212174e-01
  8.25598016e-02  2.76916295e-01  7.48485327e-01 -3.27318966e-01
  1.07345498e+00  3.40998173e-01  3.17850679e-01  4.49840099e-01
  4.13322210e-01  9.21605751e-02 -2.65353888e-01  1.64882147e+00
 -3.41724962e-01  3.83048415e-01  3.46933603e-02  1.15815781e-01
 -5.06706655e-01 -9.16418970e-01  6.92660391e-01 -1.91819593e-01
  4.06172037e-01  3.52777958e-01  1.16981216e-01  1.12070453e+00
  9.78735149e-01  6.79820254e-02  8.12346160e-01 -1.63415521e-01
 -4.97115523e-01  1.41054779e-01  1.21359766e-01  1.46335497e-01
 -7.41519451e-01 -6.45965993e-01 -1.24297106e+00 -7.96831191e-01
  1.20233670e-02 -7.87057459e-01  6.79720104e-01 -2.38567099e-01
 -5.98563194e-01 -7.69117355e-01 -3.11015069e-01 -6.62288725e-01
  1.29011683e-02 -4.72290635e-01  6.81381881e-01 -4.00673419e-01
  2.86585927e-01 -9.39205348e-01  9.26605225e-01  1.39040515e-01
  2.27116466e-01  1.29718721e+00 -2.83729821e-01 -1.75453711e+00
  5.14367938e-01  1.06682390e-01  9.78547513e-01 -1.69397041e-01
  2.32441604e-01 -5.80137968e-02 -2.61542708e-01  7.10425794e-01
 -8.07003677e-01 -3.24614078e-01 -2.31425181e-01  1.46879926e-01
 -1.99359477e-01 -7.85942554e-01  4.01071787e-01  3.64467710e-01
 -1.64785957e+00  3.43260497e-01  6.66370034e-01  3.20747823e-01
  7.45556355e-01  1.49886513e+00  4.15916331e-02  2.38673940e-01
  3.11245531e-01  1.11624986e-01  7.58126557e-01 -5.90230763e-01
  7.38683701e-01 -3.79376650e-01 -4.98532891e-01 -5.99651784e-02
 -4.13518339e-01  5.47317922e-01  2.37316594e-01 -2.11386609e+00
 -3.93641740e-02  1.37290642e-01  2.58059531e-01  1.37962806e+00
  1.65988877e-01  6.67001307e-02  3.37507725e-01 -5.14431059e-01
  4.13342744e-01 -2.81218946e-01 -2.19349101e-01 -5.69458783e-01
 -4.63474154e-01  5.79096377e-01 -4.88767654e-01  1.13501060e+00
  2.89165288e-01  1.12575583e-01  2.79255897e-01  4.80988055e-01
 -5.67966998e-01 -5.34342825e-02 -9.01518047e-01 -3.24264139e-01
 -2.45777056e-01 -4.92247462e-01  1.03530514e+00  9.57975090e-01
  4.51066822e-01 -9.26326215e-01  1.34554291e+00 -3.74585301e-01
  2.47377172e-01 -1.81935370e-01 -2.40810290e-01 -5.23710484e-03
 -3.87806863e-01 -4.16272372e-01 -1.71081439e-01  3.55579376e-01
  8.26952681e-02  1.00085485e+00 -5.76247275e-01 -1.80821180e-01
  8.64278972e-01 -5.98722994e-01 -7.60923207e-01 -2.56919235e-01
  3.39388609e-01 -4.09686655e-01  3.80000882e-02  5.18352330e-01
 -1.36770442e-01  3.60791117e-01 -1.16104193e-01  1.77927837e-01
 -1.48969278e-01  4.53826547e-01 -6.20274186e-01 -1.56975731e-01
 -7.03017175e-01  9.73927259e-01  2.12830022e-01  5.20101003e-02
 -1.31684944e-01 -4.94677395e-01 -6.14997089e-01 -2.58644611e-01
 -7.12190568e-01  1.17969322e+00 -1.86769530e-01  7.47682750e-01
  1.40399203e-01  1.88243046e-01  1.12703241e-01  3.15180987e-01
 -1.09888017e-01  1.92593802e-02  8.62524450e-01  4.12412733e-01
  1.97270349e-01  3.58969830e-02  2.80338824e-01 -1.11712724e-01
 -1.95806876e-01 -8.96784544e-01 -8.74943435e-01 -5.09607911e-01
  2.54794061e-01 -1.11525692e-01  4.84610766e-01  2.03405425e-01
 -1.28510582e+00  6.13453805e-01 -7.62468040e-01 -4.45492864e-01
  8.98256302e-01 -4.65354651e-01 -2.69756705e-01  6.43096626e-01
 -1.17004596e-01  1.26671463e-01  7.34529570e-02 -6.00621253e-02
  2.99074471e-01 -2.24282682e-01 -1.75983489e-01  6.67618692e-01
 -6.75170124e-01  3.97939712e-01  2.71357536e-01 -7.92269781e-02
 -2.15837181e-01 -1.67162970e-01  3.36395413e-01  5.76822937e-01
  4.60953623e-01 -6.98052466e-01  2.63512045e-01  7.60804236e-01
 -5.87762117e-01  8.38262260e-01  3.91144991e-01 -4.16893512e-01
  3.68824095e-01 -3.06232218e-02  3.03764313e-01 -6.96085453e-01
 -6.19740427e-01 -6.71980381e-01  4.05086428e-01  2.55809754e-01
  7.36332119e-01  1.07301868e-01  8.99604142e-01  3.40113729e-01
  2.11660769e-02 -3.83403808e-01  4.60269779e-01 -1.18836604e-01
  1.00144535e-01 -2.24260044e+00  1.93750244e-02 -7.39753917e-02
 -8.71745408e-01  8.03703964e-01  1.01660228e+00  2.40650535e-01
 -2.53779978e-01 -4.69365865e-01 -2.86698461e-01  2.74048060e-01
  7.87152648e-02 -1.53373882e-01 -2.92661875e-01 -2.36835539e-01
  1.95323884e-01  2.89673895e-01  1.05472839e+00 -1.23539722e+00
 -5.54235220e-01 -2.46521588e-02  1.38157783e-02 -7.63832569e-01
  6.22972727e-01 -3.92603874e-02  7.64605999e-02  5.43346368e-02
  6.10556424e-01  1.02582574e-01  2.56898254e-01  1.37820661e-01
  4.16399390e-01 -1.39033392e-01  1.24321818e-01  6.18484206e-02
  5.80244243e-01 -5.59255958e-01  1.20674990e-01  4.10759360e-01
  1.28601402e-01 -3.12268585e-01  3.42458874e-01 -1.27690017e-01
 -3.82217243e-02 -9.15540695e-01 -1.02993524e+00  3.61140966e-01
 -3.60447168e-01  5.16320646e-01 -5.18503666e-01  6.51507974e-01
 -5.95811903e-01  2.35786811e-01  5.75912535e-01 -5.66179693e-01
 -1.10640474e-01 -7.76338518e-01 -2.11644843e-01 -8.05815756e-01
  8.35742533e-01 -2.62212425e-01  7.90669918e-01 -3.43366355e-01
 -3.72239590e-01 -4.08375002e-02  1.12646043e+00 -1.66463006e+00
  3.08841825e-01  7.88043499e-01  7.16356754e-01 -5.27685463e-01
 -8.58413041e-01 -4.89941746e-01 -6.18519545e-01 -5.47997952e-01
  2.82600641e-01  2.53601279e-02 -2.31744885e-01 -1.62017029e-02
  3.90202761e-01  4.31031227e-01  1.22245204e+00 -8.24961245e-01
 -4.07059699e-01  3.74508649e-01 -6.94210052e-01  3.41466069e-01
  5.05169392e-01  3.98315996e-01  5.49142420e-01  6.20304108e-01
  3.60187322e-01  1.61006555e-01  4.66429852e-02  4.81841236e-01
 -1.84292018e-01  4.89783406e-01  5.16658068e-01  4.50122833e-01
  3.07243675e-01 -1.70838699e-01 -2.76717335e-01  4.60514193e-03
 -2.14468315e-01  8.68432343e-01  3.81191730e-01 -6.10564768e-01
  7.38632858e-01  4.27027196e-02  2.78751403e-01 -1.05490424e-01
  1.88715577e-01  3.07165861e-01 -6.19095862e-01 -2.75719196e-01
 -5.85847080e-01  8.56540143e-01  5.67891657e-01 -1.51822820e-01
  2.37745881e-01  3.64973813e-01 -7.51306340e-02  3.16785611e-02
  3.98023933e-01 -4.46235865e-01 -7.03080416e-01  2.52316087e-01
  3.52890044e-01 -5.75693011e-01  1.24144804e+00  1.38290450e-01
  3.81564856e-01 -8.19765866e-01  2.28060409e-04 -5.46364903e-01
  2.03513607e-01  5.78800857e-01  3.69109660e-01  9.68074322e-01
 -2.43431076e-01  9.17764723e-01 -3.66045050e-02  7.57102013e-01
 -6.07912183e-01 -9.96343195e-01 -4.58300859e-01 -1.82977900e-01
 -5.52110493e-01  3.47473145e-01 -9.36147511e-01 -2.70746976e-01
  2.48594582e-01 -5.79488389e-02  2.39616349e-01  3.35074186e-01
 -1.06118715e+00 -1.42484140e+00 -7.67819941e-01 -1.43470192e+00
 -5.37024796e-01  1.65033668e-01  4.07063276e-01 -1.52938589e-01
 -1.18532348e+00 -2.95132369e-01 -1.73252201e+00 -4.88075972e-01
 -4.30523485e-01  5.56108117e-01  6.89623654e-01  1.09163724e-01
 -5.97034693e-01 -4.75037605e-01 -4.20480259e-02  9.49334621e-01
 -5.05421937e-01  5.95862627e-01 -6.86309278e-01 -1.74919188e+00
 -4.96481806e-01  4.71895546e-01 -5.22780418e-01 -1.12564278e+00
  1.33108401e+00 -4.00434047e-01 -2.46227980e-01 -2.05789194e-01
 -7.13341773e-01  9.93153036e-01  5.43550670e-01  1.40179068e-01
 -1.20376861e+00  1.13346800e-03 -7.26536810e-01  1.67122170e-01
  1.23233855e-01 -7.82044709e-01 -4.97816831e-01  3.81824762e-01
 -3.73728245e-01  2.39122152e+00 -1.07404327e+00  2.29385540e-01
 -1.38386741e-01  6.94290876e-01 -3.10964614e-01  4.20647860e-02
  9.38088417e-01 -1.04231320e-01  1.16593421e-01 -3.05111915e-01
 -9.77337956e-02 -9.86911952e-01 -1.09040000e-01 -4.07513171e-01
 -5.02026796e-01  2.71878634e-02 -2.00747356e-01 -6.90447330e-01
  1.33138835e-01 -1.00048327e+00 -1.72360018e-01  7.12540448e-01
  9.36333954e-01  1.94153622e-01  3.32033902e-01  4.40459579e-01
  4.60635155e-01  2.93384284e-01 -8.14757288e-01  9.33267295e-01
  1.13695204e+00 -3.12429607e-01  9.34470117e-01 -5.23371361e-02
  2.66572446e-01 -1.24626815e+00 -6.47320151e-01 -1.20385019e-02
  2.51794249e-01 -1.62435949e+00 -8.43286335e-01  7.72574186e-01
  3.02384049e-01 -3.15416664e-01 -5.72963536e-01 -9.20167029e-01
 -1.82137266e-01 -4.98007625e-01 -7.29632378e-01  1.04492652e+00
 -6.90359056e-01 -9.51736748e-01  3.10429007e-01  7.88420618e-01
  6.19803630e-02  3.74994054e-03 -7.25813925e-01  5.08510530e-01
 -6.10125065e-01  3.90015602e-01  4.52400684e-01 -6.01844117e-03
  2.28873387e-01  2.35855579e-01 -9.13136974e-02  3.06747228e-01
 -3.69900256e-01 -1.39348954e-01  5.83143353e-01 -1.25550878e+00
 -8.68154243e-02  6.80030107e-01 -8.99782240e-01  8.13376755e-02
 -3.56431007e-01 -2.15152636e-01  1.38490871e-01  2.13240579e-04
  4.07018214e-01 -4.40745592e-01  8.44455063e-01  3.03579599e-01
  1.49657160e-01  5.79764992e-02  7.15669692e-02  1.71762690e-01
 -6.31176412e-01 -2.79557973e-01  2.62509018e-01 -3.23241018e-02
  4.23288256e-01  3.77706289e-01  5.57583451e-01  8.59237432e-01
 -3.47556502e-01 -7.35680163e-01 -7.03005567e-02 -5.45158148e-01
  8.58226538e-01  9.47745323e-01 -4.69266266e-01 -2.98372269e-01
  1.17472284e-01  7.46314287e-01 -1.12415202e-01 -4.88281697e-01
 -9.37604487e-01  4.19724472e-02  7.81153500e-01 -5.68225868e-02
  7.27754295e-01  5.69072008e-01 -7.93714762e-01  1.44073918e-01
 -4.56198484e-01 -2.51369148e-01  9.05633252e-03 -4.03466634e-02
 -3.96877140e-01 -2.11487338e-01 -4.27028865e-01 -3.86283904e-01
 -2.77999073e-01 -2.68107265e-01  3.09029549e-01 -5.82618952e-01
 -1.06567276e+00  4.17070184e-03  4.01851311e-02 -6.01722717e-01
  4.67178106e-01  1.71983734e-01  1.02346766e+00  2.26993084e-01
  5.59997261e-02 -6.66350186e-01 -5.41385472e-01 -2.64908195e-01
  1.17840266e+00 -9.09033343e-02  5.70265710e-01  5.13671160e-01
 -5.46499901e-02  3.44300121e-01 -1.03550231e+00 -4.83340770e-01
  3.63577127e-01 -6.91399872e-01  3.50902170e-01  1.29768801e+00
 -4.58698839e-01 -5.93462646e-01  1.38791516e-01 -3.23593557e-01
 -3.75319362e-01  5.54265082e-02  8.91401529e-01  4.82140891e-02
  1.08048670e-01 -2.60419816e-01  1.30271280e+00 -1.25113881e+00
 -2.67142266e-01 -1.66038179e-03 -3.50445271e-01 -2.64513761e-01
  8.10346901e-01 -6.63675249e-01  4.60750848e-01 -4.22019243e-01
  1.34326071e-01  1.13470137e+00 -5.85057914e-01 -7.30287656e-02
  2.85868615e-01 -1.50318944e+00  3.82266909e-01  4.41001296e-01
 -1.80920377e-01 -3.12278360e-01  1.30556867e-01  2.84805689e-02
 -1.06328058e+00  1.11511700e-01  1.82903241e-02  6.55073702e-01
  3.26293200e-01  1.18603802e+00 -1.55810431e-01  1.99316982e-02
 -1.86682150e-01 -4.06430632e-01  4.99121279e-01  1.71999395e+00]

Sentence: The quick brown fox jumps over the lazy dog.
Embedding: [-2.68969119e-01 -5.03524899e-01 -1.75523773e-01  2.02556327e-01
 -2.23503485e-01 -1.07607991e-01 -1.00223982e+00 -9.82934162e-02
  3.46169442e-01 -4.59772974e-01 -7.90717065e-01 -6.96034372e-01
 -1.47489876e-01  1.45099306e+00  1.52760684e-01 -1.37310255e+00
  4.35586601e-01 -6.60499275e-01  3.41288656e-01  5.21309197e-01
 -3.79796118e-01  3.82933259e-01  1.93710133e-01  1.72207832e-01
  1.11666811e+00 -1.58467069e-01 -8.79326582e-01 -1.04076660e+00
 -5.95402718e-01 -5.07739961e-01 -8.78801286e-01  5.56477845e-01
  2.71484941e-01  1.14686406e+00  7.52792060e-01 -1.76436022e-01
  4.71736342e-01 -3.68952960e-01  5.48888266e-01  6.86078846e-01
 -5.23310713e-02 -9.48668122e-02 -1.66674972e-01 -1.00176132e+00
  5.21575809e-01 -9.06651616e-02  4.29446697e-01 -4.49900419e-01
  2.51435786e-01 -2.33954266e-01 -5.11107802e-01 -3.94426316e-01
  6.45667851e-01 -5.30599177e-01  1.85784638e-01  1.42533794e-01
 -3.00294191e-01  1.20069236e-01  4.23554778e-01 -4.89861488e-01
  4.29551601e-01 -7.01631829e-02 -9.37448665e-02 -1.15294479e-01
 -3.64667982e-01  3.96197349e-01 -2.02278331e-01  1.08975613e+00
  6.74837410e-01 -8.51848781e-01 -8.50385148e-03 -2.92631090e-01
 -8.90984535e-01  2.79130042e-01  7.33362496e-01 -2.50035077e-01
  4.23965812e-01  4.33227062e-01 -3.26750219e-01 -4.49382335e-01
  2.12669894e-01 -9.23774540e-02  8.03800896e-02 -9.57483351e-01
 -4.16921943e-01 -4.06423301e-01 -2.35502779e-01  1.82990298e-01
 -4.38782088e-02 -3.14177006e-01 -1.64121151e+00 -1.08092874e-01
  1.26185387e-01 -2.39004508e-01  2.31035843e-01 -1.37319088e+00
 -1.09651752e-01  8.69975746e-01  5.29625118e-01 -3.94928485e-01
 -4.19801116e-01  3.17379445e-01  1.01159728e+00 -3.20728064e-01
  7.06700504e-01 -2.87798136e-01  1.24787891e+00  1.35662451e-01
 -3.59829456e-01 -5.47928989e-01 -4.67555672e-01  2.81867355e-01
  4.62634295e-01 -6.17993474e-02 -9.53641593e-01  2.37935498e-01
 -2.29460478e-01 -3.87110800e-01  2.52903730e-01  3.61001283e-01
  1.38696238e-01  4.70266014e-01  4.66160059e-01  3.28223497e-01
  5.93105070e-02 -1.63352168e+00 -2.77716607e-01  3.24460685e-01
  4.59154934e-01  6.27264023e-01  7.00711906e-01  8.31256434e-02
 -3.90917547e-02  6.63699627e-01  6.51222885e-01 -1.23447984e-01
  1.16297640e-01 -3.19162756e-01 -3.68013196e-02 -2.44184196e-01
  6.71637595e-01  5.24988472e-01 -5.65380275e-01  4.64954853e-01
 -2.36027882e-01  1.26898900e-01 -8.10473979e-01 -4.33059335e-01
 -7.57938921e-01  8.53266776e-01 -3.98881525e-01  5.07005751e-01
 -1.62705883e-01 -1.30534649e-01  3.67368251e-01 -9.70499516e-01
  3.40842813e-01  4.97943670e-01  1.58791557e-01 -2.94252366e-01
 -2.42184117e-01 -3.72528404e-01 -1.02916189e-01 -9.32460651e-02
  5.89991271e-01  1.16003148e-01  2.60323495e-01  4.31694746e-01
 -5.11277854e-01 -6.45893812e-01  1.37274072e-01  1.14651644e+00
 -4.86506432e-01 -3.28468084e-01  3.27599883e-01  4.68083739e-01
 -2.47447770e-02 -1.60796180e-01 -1.17120497e-01 -9.79837701e-02
  1.10103719e-01  5.45698404e-01  5.11912763e-01 -6.92725420e-01
  9.79623273e-02  4.42452788e-01 -4.89460021e-01  2.34949112e-01
 -3.07361424e-01  6.56947553e-01  7.93626368e-01 -2.94100523e-01
 -2.89061934e-01 -1.43957710e+00  3.79291177e-01  8.70321751e-01
 -4.80792820e-02 -1.06954193e+00 -1.58590689e-01 -9.69053134e-02
  9.12153244e-01 -1.23418200e+00  4.51984435e-01 -4.57109302e-01
 -2.01666737e+00  2.20076397e-01  5.54018080e-01  1.22555375e+00
  3.02874416e-01  7.03863084e-01  3.94382149e-01  9.47180092e-01
  2.24415809e-02 -5.42042613e-01  2.69550920e-01 -7.95507282e-02
 -1.07106514e-01  1.02086997e+00  1.16716832e-01  3.97928715e-01
 -3.21070999e-01 -6.07484616e-02 -3.35352570e-01 -4.89043564e-01
  7.83755600e-01  4.48905230e-01 -3.26831609e-01 -6.30239606e-01
 -3.69371861e-01  5.18288910e-01 -2.31943205e-01  7.51047909e-01
 -9.50811133e-02  6.59678653e-02 -4.41956043e-01 -7.28520334e-01
  5.47576368e-01  8.39055538e-01 -3.89602065e-01 -1.11769289e-01
 -1.33700597e+00 -1.93452254e-01  4.31115389e-01  5.68186581e-01
  1.99087486e-01 -5.89395702e-01 -2.32292101e-01 -2.24908280e+00
 -2.52226561e-01 -3.92306596e-01 -4.02772695e-01  3.22517037e-01
  1.56779751e-01  1.95240542e-01  5.58442533e-01 -6.56266630e-01
  1.04244091e-01  7.31817901e-01 -4.68050033e-01 -9.43407536e-01
  8.49514231e-02  3.44091624e-01  5.19126177e-01  1.76345766e-01
 -1.47558255e-02  5.23190610e-02  2.12843597e-01  1.14475131e-01
 -1.42233863e-01 -1.51223883e-01  1.82097778e-01 -4.30664033e-01
  5.23616254e-01  5.61065137e-01 -1.14937700e-01  3.88169616e-01
 -1.85353205e-01  2.58062929e-01 -9.29598629e-01 -6.23448908e-01
 -1.90620422e-01  7.05193460e-01 -3.31302911e-01 -8.48224342e-01
  7.35408545e-01  1.90986216e-01  1.18175375e+00 -1.31905630e-01
  6.12540364e-01 -2.27061346e-01  6.12019718e-01 -2.15495512e-01
  8.65324080e-01 -9.04373944e-01 -7.20959663e-01 -1.09317759e-02
  5.78229368e-01  3.06568354e-01  1.27713576e-01  5.53308189e-01
  3.06025833e-01  1.36258829e+00  1.49002159e+00 -5.77752411e-01
 -6.27591908e-01  5.52487254e-01 -1.07050858e-01 -7.02992618e-01
  6.61346614e-01  1.48597682e+00 -6.77083015e-01 -4.57088761e-02
  4.91177052e-01  4.69605803e-01  4.79630589e-01 -6.30940378e-01
  3.36747140e-01  3.69836330e-01  1.56286880e-01 -5.31476140e-02
 -2.05094088e-02 -8.72779712e-02  4.99324769e-01  4.19809937e-01
  2.90960163e-01 -2.08618924e-01 -9.31023598e-01  2.00260162e-01
 -1.67006791e-01  1.26128837e-01 -2.68612027e+00  6.09867126e-02
  3.88033271e-01 -1.61151960e-01  5.81366599e-01 -8.78995955e-01
  3.46108675e-01 -2.94365883e-01 -6.19270861e-01 -2.72756126e-02
  2.50237972e-01 -8.14377964e-01 -9.11832377e-02  8.10585976e-01
 -3.08608059e-02  3.30792278e-01  1.92960143e-01  5.86867146e-03
 -2.81593978e-01  1.07093573e-01  4.95705485e-01 -5.93880177e-01
 -9.31353495e-02  4.96446282e-01 -6.09604299e-01 -7.22874463e-01
 -3.91721100e-01  5.72671115e-01 -9.82809141e-02 -9.71323967e-01
  2.91557168e-03  6.50036037e-01 -3.80878687e-01  4.00921375e-01
 -5.13824701e-01  3.31710905e-01 -1.11063111e+00 -2.53363192e-01
  2.70247668e-01  1.53764293e-01  4.10947949e-01  7.36399472e-01
  2.56115347e-01  3.60514194e-01  1.65222809e-01 -5.48293293e-01
  3.92309994e-01  2.21360728e-01  1.85525894e-01 -1.58027694e-01
  7.06265450e-01 -3.98958713e-01  3.22360903e-01  6.30561411e-02
 -7.96444893e-01 -3.65186334e-01  1.66747883e-01 -3.76820415e-01
 -4.38464791e-01 -8.31743419e-01 -8.69124904e-02 -1.07136369e+00
  1.15936637e+00 -7.67137930e-02 -4.17888194e-01  4.69102740e-01
 -8.31389904e-01  6.38637900e-01  9.34859931e-01  4.25273150e-01
 -7.34036863e-01 -2.99270123e-01  6.30385339e-01 -4.01807316e-02
  6.99251175e-01  4.14618962e-02 -3.52737427e-01 -8.50805044e-02
 -2.99162388e-01 -2.26154685e+00  1.94242299e-01  2.13073134e+00
 -3.08788896e-01  7.76493490e-01 -3.85470778e-01  5.09764440e-02
 -3.32625985e-01 -2.70368278e-01 -2.73551255e-01  5.12204349e-01
 -1.57281738e-02 -2.03380153e-01 -9.83158126e-02 -1.08133204e-01
 -1.30143538e-01 -5.75545073e-01 -2.37112567e-01  7.66147450e-02
 -8.78035605e-01 -4.69297498e-01 -5.42065442e-01 -3.27693969e-01
  4.26546298e-02 -3.32958400e-02  2.29725197e-01 -6.76590025e-01
  5.90870023e-01  4.67663854e-01  4.05347981e-02 -1.01889901e-01
  2.98107713e-01  9.38350379e-01  4.14085627e-01  3.07833761e-01
  1.42964447e+00 -4.45302457e-01 -3.95337015e-01  2.59421319e-01
  4.70786721e-01 -3.28690141e-01 -3.26702684e-01  4.60925788e-01
  3.00194889e-01 -1.27617776e+00  1.93441764e-01  6.96230233e-02
  2.50156671e-01  6.29924238e-01  4.90749180e-02 -6.81155145e-01
  7.49544725e-02 -5.32942772e-01 -1.46872148e-01  1.52525142e-01
 -1.40324622e-01 -4.47373897e-01  5.56249619e-01  2.14512050e-01
 -1.18132508e+00 -5.53278513e-02 -1.21997505e-01  4.59377319e-01
 -7.73455799e-01  6.49327457e-01 -2.88689077e-01  2.49826238e-01
  1.49201185e-01  8.95310715e-02 -1.65562227e-01  3.12327474e-01
 -2.94915408e-01  6.04736328e-04  1.51518643e-01 -2.43606791e-01
 -3.77415299e-01 -7.48818874e-01  1.97308421e-01  1.54566392e-01
  2.31411859e-01 -1.31587371e-01 -9.31632221e-01  5.21845639e-01
 -1.77722231e-01 -3.30963254e-01  8.78183618e-02 -3.89436334e-01
  1.18288970e+00  4.61944133e-01 -3.60817403e-01  9.62439552e-02
  3.29588264e-01 -7.63412178e-01 -4.32654470e-02  3.71987849e-01
  1.30858049e-01  3.64951998e-01 -1.14586122e-01  9.33710337e-02
  8.79911661e-01  8.51520672e-02  5.08776367e-01  8.31995070e-01
 -3.25600989e-02 -6.76322937e-01  2.73325801e-01 -5.52082062e-01
  7.04786360e-01 -9.38153565e-02  3.16393822e-01  9.30703938e-01
  1.43995300e-01  1.28289476e-01  7.13749886e-01 -6.91197693e-01
 -4.63513970e-01 -5.43086469e-01  3.93340915e-01  6.55048430e-01
  2.37008616e-01  5.90158224e-01 -1.45171452e+00 -5.42990565e-01
  7.16494992e-02 -5.74965961e-02 -3.19812238e-01 -4.15111750e-01
 -1.15385628e+00  9.88350138e-02 -3.99327636e-01 -3.86230469e-01
 -9.66311634e-01  6.24254346e-01 -8.67876589e-01  7.91857168e-02
 -2.16634810e-01 -1.30776152e-01  5.42041898e-01 -1.27456293e-01
  2.19354436e-01  2.45431010e-02 -1.31416008e-01 -9.94022846e-01
  3.11670303e-01  2.79867351e-01  1.76257046e-03  2.36769140e-01
  2.31806144e-01 -2.09278569e-01 -3.29869896e-01  5.31312287e-01
 -1.50237354e-02 -1.96521565e-01 -4.44440484e-01 -1.03522427e-01
  1.57737657e-01 -3.42689991e-01  6.51860237e-01  5.95698655e-01
  1.57644466e-01 -7.42945671e-01 -8.25147808e-01  8.19953501e-01
 -3.89360994e-01 -4.63993937e-01  4.91448015e-01  1.03501648e-01
 -1.43068001e-01  6.59973502e-01 -6.28924549e-01  8.06039393e-01
 -6.85656607e-01 -7.82158554e-01  9.65895131e-02 -4.44408804e-02
  5.49144030e-01 -3.65767390e-01  9.12624821e-02 -1.57133341e-01
  6.35210574e-01  1.13972329e-01 -1.81234777e-01  8.61219227e-01
  1.31244048e-01  5.98486185e-01  1.65067330e-01  5.73872924e-01
  4.87469345e-01 -1.68602914e-04 -1.20138995e-01  3.91712785e-01
  5.30987561e-01  2.69023091e-01  1.52858093e-01 -7.95847654e-01
 -9.93978977e-01  4.33744788e-01  1.67980358e-01 -1.70952514e-01
  3.58180523e-01  1.74466264e+00 -5.23976624e-01  4.59476858e-01
 -3.23338002e-01 -3.03671360e-01 -5.17562926e-02 -9.27555263e-01
  1.22588545e-01  9.21691000e-01 -7.77567685e-01  7.57553339e-01
  5.98537266e-01  1.51887700e-01 -5.41039586e-01 -6.00209981e-02
 -1.40656960e+00 -2.00708449e-01 -5.64499199e-01 -7.12801754e-01
 -6.20633423e-01  2.33130932e-01 -9.46428776e-01 -3.88114452e-01
 -3.07884246e-01 -1.85048208e-01 -5.36420830e-02  1.98027864e-01
  6.83651924e-01  2.92166889e-01  1.00554061e+00  5.15276730e-01
  9.16525349e-02  4.16358560e-01  1.63049176e-01  6.65169239e-01
  4.27927561e-02  2.41373792e-01 -3.95990938e-01 -2.23522838e-02
 -1.48183376e-01 -7.48705864e-01 -9.84093964e-01 -2.63506919e-01
 -7.75057673e-02  2.21898973e-01  3.77231210e-01 -2.79826611e-01
  4.35912460e-01  1.72021911e-01 -2.74398804e-01 -5.74143082e-02
  3.34030300e-01  3.96052748e-01 -8.62337589e-01 -3.87751102e-01
 -2.32265308e-01 -2.47504607e-01 -1.66177660e-01 -2.38482710e-02
  4.86696005e-01  2.90136546e-01  7.03744352e-01  2.41497960e-02
  7.77043641e-01  6.32856488e-02  5.27289271e-01 -3.04111600e-01
  1.47445953e+00 -3.12046498e-01 -9.46989298e-01  6.20721638e-01
 -2.51838416e-01 -4.54647511e-01  2.69545764e-01  4.68926907e-01
  3.01602572e-01  4.27661866e-01  1.22736610e-01 -4.31586295e-01
 -3.55660915e-01  2.95261294e-02 -8.30298737e-02 -1.24135458e+00
 -9.31493819e-01  1.69711602e+00  4.80521649e-01 -3.84667546e-01
  4.66282278e-01  2.56258368e-01 -2.17681658e-02 -1.55928707e+00
  5.75663626e-01 -1.57456219e-01  9.48550522e-01  1.66335061e-01
 -5.28567255e-01 -5.70252359e-01 -1.46020964e-01  4.87905383e-01
  4.78017591e-02 -5.77540576e-01  1.92407763e-03  6.13499105e-01
 -1.05793841e-01 -1.09683506e-01 -2.96360105e-02  4.35960323e-01
 -4.02660578e-01  3.87758702e-01  1.12706983e+00  9.24294665e-02
 -5.83021998e-01 -3.87693644e-01  2.39835214e-03 -5.64945757e-01
  1.48378536e-01 -2.77033478e-01 -2.34442890e-01  1.73353069e-02
  3.67671520e-01 -7.33666599e-01 -7.92824030e-01  6.30360723e-01
  3.33380669e-01  4.56192940e-01 -7.72260949e-02  1.27429411e-01
 -1.78493440e-01  1.97268978e-01  1.57322931e+00  1.07754266e+00
 -1.59494922e-01 -1.17894687e-01 -1.59461528e-01 -6.25817001e-01
  2.81546354e-01  2.70361573e-01 -4.11008507e-01  2.61917561e-01
  1.35742575e-01  2.32770786e-01 -1.96227446e-01  1.48295149e-01
  6.96589231e-01 -4.05377030e-01 -5.51130362e-02  6.23578541e-02
  6.14083529e-01 -2.98538417e-01 -8.09021533e-01 -2.79868711e-02
 -9.66248572e-01 -8.61432016e-01  2.46819437e-01 -3.50682139e-01
 -1.29827034e+00 -2.78865784e-01 -3.06518644e-01  6.44666135e-01]

Code

embedding.shape

(768,)

Code

%%time
# Use our data

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') #for symmetric queries
# model = SentenceTransformer('msmarco-distilroberta-base-v2') #for asymmetric queries

#Our sentences we like to encode
sentences = list(corpus.text)

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

CPU times: user 13.4 s, sys: 5.68 s, total: 19.1 s
Wall time: 19.9 s

Code

# At this point, the variable embeddings contains all our embeddings, one row for each document
# So we expect there to be 100 rows, and as many columns as the model we chose vectorizes text
# into.

embeddings.shape

(100, 384)

21.2.3 Cosine similarity between sentences

We can compute the cosine similarity between documents, and that gives us a measure of how similar sentences or documents are.

The below code uses brute force, and finds the most similar sentences. Very compute intensive, will not run if number of sentences is very large.

Code

from sentence_transformers import util
distances = util.cos_sim(embeddings, embeddings)
distances.shape

torch.Size([100, 100])

Code

df_dist = pd.DataFrame(distances, columns = corpus.index, index = corpus.index)
df_dist

	9920	3172	7798	7335	31	4229	2514	5830	8326	2855	...	2747	8549	8341	293	9501	5493	5258	1937	6687	6335
9920	1.000000	0.135734	0.224861	0.298378	0.150421	0.079714	0.159787	0.327536	0.321123	0.375711	...	0.095475	0.389827	0.469558	0.429317	0.272564	0.128299	0.036300	0.251787	0.106914	0.340685
3172	0.135734	1.000000	0.094552	0.162400	0.151268	0.148340	0.123678	0.103126	0.187191	0.216326	...	0.253598	0.136567	0.200792	0.082866	0.131629	0.076341	0.149354	0.169703	0.043003	0.274530
7798	0.224861	0.094552	1.000000	0.261107	0.184356	0.133621	0.131916	0.271464	0.386157	0.267628	...	0.224716	0.266434	0.210274	0.274348	0.217656	0.287046	0.164005	0.341288	0.310254	0.258850
7335	0.298378	0.162400	0.261107	1.000000	0.238793	0.051851	0.088298	0.303140	0.381001	0.247187	...	0.178367	0.321581	0.230840	0.161895	0.245882	0.115410	0.020552	0.346345	0.078837	0.328493
31	0.150421	0.151268	0.184356	0.238793	1.000000	0.046454	0.114700	0.199142	0.276150	0.218646	...	0.168328	0.218759	0.226452	0.076760	0.178390	0.075329	0.159295	0.202835	0.129672	0.151179
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5493	0.128299	0.076341	0.287046	0.115410	0.075329	0.125446	0.084448	0.131338	0.237292	0.313341	...	0.201184	0.045626	0.191044	0.164987	0.196647	1.000000	0.133942	0.178779	0.022278	0.112217
5258	0.036300	0.149354	0.164005	0.020552	0.159295	0.528685	0.129603	0.210878	0.080393	0.145991	...	0.416838	0.112070	0.152700	0.085756	0.178735	0.133942	1.000000	0.095838	0.126565	0.206801
1937	0.251787	0.169703	0.341288	0.346345	0.202835	0.109112	0.197919	0.304981	0.306736	0.316769	...	0.131771	0.246408	0.194218	0.181308	0.208915	0.178779	0.095838	1.000000	0.074157	0.300806
6687	0.106914	0.043003	0.310254	0.078837	0.129672	0.019954	0.174099	0.158508	0.204196	0.082941	...	0.189893	0.119275	0.162651	0.106629	0.052256	0.022278	0.126565	0.074157	1.000000	0.176698
6335	0.340685	0.274530	0.258850	0.328493	0.151179	0.134998	0.114199	0.237868	0.213965	0.325430	...	0.201053	0.322407	0.410377	0.326511	0.205842	0.112217	0.206801	0.300806	0.176698	1.000000

100 rows × 100 columns

At this point, we can use stack to rearrange the data to identify similar articles, but stack fails if you have a lot of documents. Let us see how stack does the job.

Code

# Using stack
df_dist = df_dist.stack().reset_index()
df_dist.columns = ['article', 'similar_article', 'similarity']
df_dist = df_dist.sort_values(by = ['article', 'similarity'], ascending = [True, False])
df_dist

	article	similar_article	similarity
404	31	31	1.000000
462	31	1631	0.422828
414	31	9949	0.395747
411	31	6627	0.390846
483	31	2289	0.346313
...	...	...	...
3327	9970	5720	0.009801
3367	9970	7427	0.006469
3331	9970	9176	-0.004562
3399	9970	6335	-0.010530
3310	9970	1157	-0.023462

10000 rows × 3 columns

Code

# Let us reset our df_dist dataframe
df_dist = pd.DataFrame(distances, columns = corpus.index, index = corpus.index)

Code

from tqdm import tqdm
# Using a loop
top_n = 21
temp = []
for col in tqdm(range(len(df_dist))):
    t = pd.DataFrame(df_dist.iloc[:, col].sort_values(ascending = False)[:top_n]).stack().reset_index()
    t.columns = ['similar_article', 'article', 'similarity']
    t = t[['article', 'similar_article', 'similarity']]
    temp.append(t)

pd.concat(temp)

100%|██████████| 100/100 [00:00<00:00, 446.37it/s]

	article	similar_article	similarity
0	9920	9920	1.000000
1	9920	6147	0.513100
2	9920	470	0.498859
3	9920	466	0.492767
4	9920	7427	0.479613
...	...	...	...
16	6335	7485	0.343014
17	6335	9920	0.340685
18	6335	6233	0.337505
19	6335	1559	0.336447
20	6335	6627	0.334233

2100 rows × 3 columns

21.2.4 Semantic paraphrasing

Finds similar articles, except more efficient than the prior method.

Code

%%time
from sentence_transformers import SentenceTransformer, util

# model = SentenceTransformer('all-MiniLM-L6-v2')

# Single list of sentences - Possible tens of thousands of sentences
sentences = list(corpus.text)

paraphrases = util.paraphrase_mining(model, sentences)
    
paraphrases[:10]

CPU times: user 13.3 s, sys: 5.34 s, total: 18.6 s
Wall time: 18.5 s

[[0.997217059135437, 25, 95],
 [0.8632327318191528, 9, 22],
 [0.8306697607040405, 13, 88],
 [0.7193999290466309, 7, 39],
 [0.7128127217292786, 41, 49],
 [0.6752821803092957, 8, 34],
 [0.6559962630271912, 29, 92],
 [0.6505193114280701, 22, 74],
 [0.6476482152938843, 11, 39],
 [0.6302182674407959, 12, 91]]

Code

print(sentences[13])

A cyber attack in Iran left petrol stations across the country crippled, disrupting fuel sales and defacing electronic billboards to display messages challenging the regime's ability to distribute gasoline.

Posts and videos circulated on social media showed messages that said, "Khamenei! Where is our gas?" — a reference to the country's supreme leader Ayatollah Ali Khamenei. Other signs read, "Free gas in Jamaran gas station," with gas pumps showing the words "cyberattack 64411" when attempting to purchase fuel, semi-official Iranian Students' News Agency (ISNA) news agency reported.

Abolhassan Firouzabadi, the head of Iran's Supreme Cyberspace Council, said the attacks were "probably" state-sponsored but added it was too early to determine which country carried out the intrusions.

Although no country or group has so far claimed responsibility for the incident, the attacks mark the second time digital billboards have been altered to display similar messaging.

In July 2021, Iranian Railways and the Ministry of Roads and Urban Development systems became the subject of targeted cyber attacks, displaying alerts about train delays and cancellations and urging passengers to call the phone number 64411 for further information. It's worth noting that the phone number belongs to the office of Ali Khamenei that supposedly handles questions about Islamic law.

The attacks involved the use of a never-before-seen reusable data-wiping malware called "Meteor."

Cybersecurity firm Check Point later attributed the train attack to a "regime opposition" threat actor that self-identifies as "Indra" — referring to the Hindu god of lightning, thunder, and war — and is believed to have ties to hacktivist and other cybercriminal groups, in addition to linking the malware to prior attacks targeting Syrian petroleum companies in early 2020.

"Aiming to bring a stop to the horrors of [Quds Force] and its murderous proxies in the region," the group's official Twitter account bio reads.

"While most attacks against a nation's sensitive networks are indeed the work of other governments, the truth is that there is no magic shield that prevents a non-state sponsored entity from creating the same kind of havoc, and harming critical infrastructure in order to make a statement," Check Point noted in July.

Code

print(sentences[19])

A hacker known only as “Mr. A” was picked up by authorities at a South Korean airport after getting stuck in the country due to COVID-19 travel restrictions.

Another alleged member of the TrickBot gang has been apprehended, this time when trying to leave South Korea, according to published reports.

The Russian national, who is an alleged developer of the notorious crimeware, reportedly had been trapped in South Korea since February 2020 due to COVID-19 travel restrictions. Seoul-based news outlet KBS News reported that the individual, identified only as “Mr A”, was arrested at a South Korea airport last week. Mr. A is believed to have worked as a web browser developer for the TrickBot crime syndicate while he lived in Russia in 2016.

Recorded Future’s The Record, who reported on the incident, cited the KBS report and said the accused criminal hacker was forced to spend more than a year in South Korea in order to renew his passport delaying his departure.

His arrest was the result of an investigation U.S. authorities began into TrickBot during his time in South Korea after the botnet was used “to facilitate ransomware attacks across the US throughout 2020,” according to the report.

Ever-Evolving Threat

TrickBot is a sophisticated malware first developed in 2016 to steal online banking credentials. Since then, it has evolved as operators have added new features.

The malware, once a simple banking trojan, is now a module-based crimeware platform leased as a malware-as-a-service solution to cybercriminals. TrickBot is typically leveraged against corporations and public infrastructure. The evolution and success of the TrickBot platform has pushed authorities to crack down on the criminals behind TrickBot beginning last year.

In February, authorities took alleged TrickBot developer Alla Witte into custody in Miami. Witte is known in cybercrime circles as “Max” and a main TrickBot coder, according to the Department of Justice (DoJ). Witte is believed responsible for developing TrickBot’s ransomware-related functionality, including control, deployment and payments, authorities said at the time of her arrest.

Her colleague, Mr. A, was arraigned in a Seoul court last Wednesday on an international arrest warrant and extradition request to the United States, according to The Record, citing the KBS news report. However, the suspect is fighting the extradition, with his lawyer claiming that if it happens, Mr. A “will be subjected to excessive punishment,” according to the report.

Industry-Driven Disruption

Prior to the official investigation and crackdown by the DoJ and related arrests, an earlier attempt to foil TrickBot’s operations came from Microsoft and some technology partners.

Last October, the tech giant and others used a court order they’d obtained to cut off key infrastructure to TrickBot operations so its operators no longer could initiate new infections or activate ransomware already dropped into computer systems.

Microsoft, ESET, Lumen’s Black Lotus Labs, NTT Ltd., Symantec and others were responsible for the coordinated legal and technical action to disrupt the group’s activity–which turned out to be a temporary scenario as TrickBot’s cybercriminals soon regrouped and resumed operations.

It’s time to evolve threat hunting into a pursuit of adversaries. JOIN Threatpost and Cybersixgill for Threat Hunting to Catch Adversaries, Not Just Stop Attacks and get a guided tour of the dark web and learn how to track threat actors before their next attack. REGISTER NOW for the LIVE discussion on Sept. 22 at 2 p.m. EST with Cybersixgill’s Sumukh Tendulkar and Edan Cohen, along with independent researcher and vCISO Chris Roberts and Threatpost host Becky Bracken.

Code

# Free up memory
del paraphrases
gc.collect()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 2
      1 # Free up memory
----> 2 del paraphrases
      3 gc.collect()

NameError: name 'paraphrases' is not defined

21.2.5 Semantic Search

Semantic search is search for meaning, as opposed to exact text searches. It considers what a word means in identifying similar documents.

A ‘symmetric’ query is one where both the query string and the context data being searched are roughly the same length.

A ‘non-symmetric query’ is one where the query string is much shorter than the text being searched.

This distinction is relevant as models are optimized for one or the other query type.

Code

# Query sentence - one at a time:
query = ['vaccine for children']



# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))

query_embedding = model.encode(query, convert_to_tensor=True)

# We use cosine-similarity and torch.topk to find the highest 5 scores
cos_scores = util.cos_sim(query_embedding, embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)

score, idx = top_results[0], top_results[1]

for index, i in enumerate(idx):
    print('\n------Similarly score is', score[index])
    print(corpus.text.iloc[int(i)])
    print('\n---------------------\n---------------------\n---------------------')


------Similarly score is tensor(0.6394)
On Wednesday, the World Health Organization (WHO) announced a first: it was recommending a vaccine against malaria for administration to children. The decision follows a program started in 2019 in three African countries, which eventually saw over 800,000 children vaccinated.

The vaccine itself is called RTS,S/AS01 or Mosquirix, and it checks all the boxes for conspiracy theorists, having been developed by a large pharmaceutical company (GlaxoSmithKline) with support from the Gates Foundation. The vaccine is based on proteins found on the surface of the most common malarial parasite, and it requires four doses starting when children are less than a year old. Development of the vaccine started in 1987, and testing of its efficacy dates back to 2014. With over 2.3 million doses administered, it has a solid safety profile, and it has shown an efficacy between 30 and 50 percent in different trials.

Advertisement

That may not seem all that great, especially compared to the numbers we've all watched many of the COVID-19 vaccines produce. But the WHO estimates that over a quarter-million children under the age of 5 die of malaria every year in Africa alone, with many others falling seriously ill. As such, even a 30 percent efficacy can have a substantial impact.

The WHO analysis also considered a number of additional factors, all of which weighed in favor of its use. These include its cost effectiveness and the ability to deliver the vaccine to even difficult-to-reach populations.

Mosquirix's wider distribution, however, will now be dependent on the international health community, which is already struggling with finding the funding and resources to ensure that developing nations get sufficient COVID-19 vaccines.

Meanwhile, there have been some very early trials of additional vaccines that are reporting higher efficacy levels. If those results hold up in larger trials and in actual use in Africa, RTS,S could be just the first in a string of good news for one of humanity's deadlier plagues.

---------------------
---------------------
---------------------

------Similarly score is tensor(0.4116)
A test tube labelled with the Vaccine is seen in front of AstraZeneca logo in this illustration taken, September 9, 2020. REUTERS/Dado Ruvic/Illustration

AstraZeneca said on Thursday that preliminary data from a trial showed that its COVID-19 shot, Vaxzevria, generated an increase in antibodies against the Omicron and other variants when given as a third booster dose.

The increased response, also against the Delta variant, was seen in a blood analysis of people who were previously vaccinated with either Vaxzevria or an mRNA vaccine, the drugmaker said, adding that it would submit this data to regulators worldwide given the urgent need for boosters.

AstraZeneca has developed the vaccine with researchers from the University of Oxford, and lab studies conducted by the university last month already found a three-dose course of Vaxzevria boosted antibody levels in the blood against the rapidly spreading Omicron variant.

The brief statement on Thursday, which did not include specific data, was the first by AstraZeneca on the protective potential of Vaxzevria as a booster shot following a two shot-course of either an mRNA based vaccine or Vaxzevria. Vaccines base on mRNA technology are made by BioNTech-Pfizer and Moderna.

The company said the findings “add to the growing body of evidence supporting Vaxzevria as a third dose booster irrespective of the primary vaccination schedules tested”.

The data on Vaxzevria’s potential as a booster came from a comparative analysis in a trial testing a redesigned vaccine which uses the vector technology behind Vaxzevria but targeting the now-superseded Beta variant. AstraZeneca is trying to show the Beta-specific vaccine has potential also against other variants and more trial data is expected during the first half of the year.

Separately, Oxford University and AstraZeneca last month started work on a vaccine specifically targeting Omicron though Astra – as well as other vaccine makers in similar development projects – have said it was not yet clear whether such an upgrade was needed.

A major British trial in December found that AstraZeneca’s shot increased antibodies when given as a booster after initial vaccination with its own shot or Pfizer’s, but that was before the explosive spread of the Omicron variant.

However, the study at the time concluded that mRNA vaccines made by Pfizer and Moderna gave a biggest boost to antibodies when given as a third dose.

AstraZeneca and its contract manufacturing partners have supplied over 2.5 billion doses globally of its vaccine, even though it is not approved in the United States, while BioNTech-Pfizer have shipped about 2.6 billion doses.

Source: ARY News





---------------------
---------------------
---------------------

------Similarly score is tensor(0.2400)
Happy New Year İn Advance To Everyone

The world began on Friday with the inauguration of 2022 after another turbulent and pandemic-ridden year, limited by new restrictions, a rising number of cases, and a slight glimmer of hope for better times.

The first Olympic Games without spectators and the dreams of democracy from Afghanistan to Myanmar to Hong Kong are being smashed by authoritarian regimes.

But it was the pandemic, now entering its third year, that once again dominated most people's lives. More than 5.4 million people have died since the coronavirus was first reported in central China in December 2019, and countless more have fallen ill, exposed to outbreaks, bans, bans and an alphabetical spaghetti of PCR testing, LFT and RAT.

The year 2021 began with hope when life-saving vaccines were used in around 60 percent of the world's population, though many of their poor still have limited access and some of the rich believe the coups are part of an unclear plot.

When it came to an end, the appearance of the Omicron variant caused the number of new cases of Covid-19 to rise to more than a million for the first time, according to an AFP balance sheet.

France became the last country to announce on Friday that Omicron is now its dominant coronavirus strain. In the UK, the US and even Australia, a long-standing haven from the pandemic, the variant's prominence is generating new record cases.

Parts of the Pacific nation of Kiribati were the first to greet the New Year from 1000 GMT.

In San Francisco, the festivities have been canceled or reduced again as infections increase.

One notable exception, however, was South Africa, which was the first country to report Omicron in November, where the curfew was lifted to allow the festivities.

Health officials said a drop in infections over the past week shows the current wave has peaked, crucially without a significant increase in deaths. Sydney, Australia's largest city, has also opted for fireworks that will light up the city, port, despite being one of the fastest growing cases in the world.

The country's Conservative government says its decision to abandon a "Covidzero" approach was based on vaccination rates and increasing evidence that Omicron is less lethal.

Tens of thousands of night owls were expected to populate Sydney Beach, although AFP journalists said the city was quieter than normal at nightfall.

"I'm just trying to focus on the positive things that happened this year instead of thinking about all of the bad things that happened," said Melinda, a 22-year-old medical student.

Howard, part of an enthusiastic but smaller than usual crowd waiting at the Opera House for the show to begin.

Despite numerous infections in the United Arab Emirates, Dubai is planning fireworks at the Burj Khalifa, the tallest tower in the world.

Meanwhile, the northern Emirate of Ras Al Khaimah is trying to break two world records with huge fireworks.

"Just a wish" In Rio, the celebrations on Copacabana Beach will take place in a reduced format, although many night owls are still expected.

"People have only one wish to leave their homes, to celebrate life," said a 45-year-old waiter on Copacabana Beach said Francisco Rodrigues.

Some Brazilians are more careful, like Roberta Assis, a 27-year-old lawyer. "This is not the time for big meetings," he said. The authorities in Seoul are showing similar caution and instead prohibit viewers from ringing the traditional midnight bell.

In India, fearing a repetition of the devastating surge in the virus that overwhelmed the country in April and May, cities and states have imposed assembly restrictions and Delhi put a curfew at 10 p.m. Mumbai police put nighttime bans on people on Friday Enact visiting public places such as city beaches and boardwalks, which are usually popular attractions in the New Year, with two-week restrictions.

The health organization warned of difficult times, saying Omicron could lead to "a tsunami of cases"."This ... will continue to put immense pressure on exhausted health workers and health systems that are on the verge of collapse," said WHO chief Tedros Adhanom Ghebreyesus.

But the restrictions have again led to frequent, loud and sometimes violent protests against the blockade, vaccinations and the government. Experts and non-experts alike hope that 2022 will be remembered as the new, less deadly phase of the pandemic.

"Be better for everyone," said Oscar Ramirez, a 31-year-old Sydney night owl. “Everyone in the world needs a big change.





İf you like this post then buy me my first coffee





---------------------
---------------------
---------------------

------Similarly score is tensor(0.2203)
Press coverage of Dorsey’s donation has been breathlessly positive. I, however, was reminded of an initiative from more than 15 years ago that made similar promises for the poorest children. At the World Summit on the Information Society in Tunis in November 2005, Nicholas Negroponte, cofounder of the MIT Media Lab, unveiled a bright-green mock-up laptop outlined in black rubber. A yellow hand crank, which was meant to charge the machine, extended from the hinge between keyboard and screen. Despite its toy-like appearance, Negroponte said the device would be a full-featured computer, packed with educational open-source software, and would cost a mere $100. He asserted that hundreds of millions of the devices would be in the hands of children around the world by the end of 2007, and that by 2010, every child in the Global South would have one—not only eliminating the digital divide in many countries, but providing children with all they needed to educate themselves. During the presentation, United Nations secretary-general Kofi Annan gave the hand crank a turn and, in a symbolically prescient moment, accidentally broke it off.

Still, reporting on what came to be known as One Laptop per Child (OLPC) was largely favorable in the years that followed, and technology firms donated millions of dollars and thousands of hours of developer labor. In dozens of high-profile venues throughout 2006 and 2007, Negroponte told unconfirmed stories of children using laptops to learn English and teach their parents to read, of impromptu laptop-enabled classrooms under trees, and of villages where laptop screens were the only light source. (Negroponte did not respond to a request for comment.) “I don’t want to place too much on OLPC,” he said in interview excerpts posted to OLPC’s YouTube channel in 2007, “but if I really had to look at how to eliminate poverty, create peace, and work on the environment, I can’t think of a better way to do it.”

“Disruptive” technology

Despite its prestigious pedigree and good intentions, OLPC struggled to fulfill the promises Negroponte made in its splashy debut. For one thing, the idea of powering the computers with a hand crank proved infeasible and they were shipped with standard AC adapters, refuting OLPC’s claims that its device could operate without electrical infrastructure and “leapfrog decades of development.” Moreover, two of the laptop’s most charismatic features—its mesh network, which was meant to allow the machines to act as wireless internet repeaters, and its “view source” button, which showed the source code of the program currently running—worked sporadically at best and were practically never used; the mesh network was dropped from later versions of the laptop’s software. And sales never reached the level that Negroponte had projected: rather than hundreds of millions of machines, One Laptop per Child has sold just shy of 3 million laptops total, including 1 million each to Uruguay and Peru. Nearly all these sales were in the early years of the project; the original OLPC Foundation dissolved in 2014, though the Miami-based OLPC Association continues to manage the brand.

Finally, the laptops cost far more than $100. The device itself was around $200 at the cheapest, and that did not include the substantial costs of infrastructure, support, maintenance, and repair. These ongoing costs ultimately sabotaged even OLPC projects that started strong, like the one in Paraguay. With 10,000 laptops, this project was not the largest, but many in the OLPC community initially considered it one of the most successful, with a world-class team, connections to leaders in government and media, and a flexible approach. Paraguay Educa, the small NGO spearheading it, invested heavily in infrastructure, installing wall outlets, WiMax towers, and Wi-Fi repeaters throughout schools. Adopting best practices from other one-to-one laptop programs, they hired teacher trainers for every school and a full-time repair team that rotated between schools every week. When OLPC failed to supply parts for repairs, they purchased them from Uruguay, which got them directly from the manufacturers.

Overloaded school internet connections brought web-based learning to a halt, and batteries that started out charged drained halfway through class.

But even with these resources, students and teachers struggled with charging, software management, and breakage—the kinds of issues all too familiar to parents and caregivers who suddenly had to facilitate their children’s remote education during covid-related school shutdowns. Though OLPC’s laptops were built to be rugged and repairable, about 15% of students had unusably broken laptops just one year into Paraguay Educa’s project. Many more had laptops with missing keys or dead spots on their screens that made them difficult and frustrating to use. Even students with working devices often forgot to charge them before class or had uninstalled software teachers wanted to use. Overloaded school internet connections brought web-based learning to a halt, and batteries that started out charged drained halfway through class. Most teachers quickly gave up trying to use the laptops in the classroom, and two-thirds of students had no interest in them outside school either.

---------------------
---------------------
---------------------

------Similarly score is tensor(0.1555)
A breastfeeding workshop, pants for postpartum comfort, consent education: These are a few of the services and products featured in ads that Facebook has rejected, according to a new report from the Center for Intimacy Justice.

For the report, Jackie Rotman, the nonprofit organization’s founder, interviewed employees and leaders at more than 35 companies focused on issues related to women’s sexual health — including pelvic pain, menopause, menstruation and fertility — and surveyed dozens more. (The survey was created in partnership with Origin, a pelvic floor physical therapy company.)

All 60 companies had ads rejected by Facebook, and about half of them said their accounts had been suspended at some point, according to the report, which was released on Tuesday. In most cases, Facebook had labeled the ads as containing “adult content” or promoting “adult products and services.”

In its advertising policies, Facebook says that “ads promoting sexual and reproductive health products or services, like contraception and family planning must be targeted to people 18 years or older and must not focus on sexual pleasure.”

---------------------
---------------------
---------------------

Code

# cos_scores is just the first column of the distance matrix
cos_scores

tensor([-2.5930e-02,  4.6089e-02,  7.4544e-02, -1.5298e-03,  1.5179e-02,
         4.3764e-02,  2.4673e-02, -9.4279e-03,  4.8973e-02,  7.8710e-02,
         1.0192e-01,  5.9062e-02, -1.5874e-02, -3.6081e-04,  6.1555e-02,
        -4.9712e-03, -4.1569e-02,  3.1474e-02,  1.1169e-01,  1.0212e-02,
        -7.9960e-02,  3.7587e-02,  7.2617e-02,  6.6107e-02,  5.9921e-02,
        -6.9621e-03, -1.2229e-01,  2.0081e-02,  4.4563e-02, -3.9940e-02,
         5.4055e-02,  6.0601e-02, -5.1942e-02,  5.7934e-02,  1.0597e-01,
         4.3217e-02, -2.2056e-02, -1.4546e-02,  1.8271e-02,  4.8267e-02,
         1.0869e-02,  4.5385e-02,  2.6042e-02,  9.0923e-02, -1.0062e-01,
         1.0299e-01,  1.5554e-01,  1.3502e-03, -2.3421e-02,  9.8555e-03,
         4.1160e-01,  1.0687e-01,  1.3165e-02,  3.1853e-02,  1.6820e-02,
         4.2002e-02,  2.9886e-02,  1.2214e-01,  8.0862e-02, -3.0763e-02,
        -9.2858e-03, -2.9507e-02, -3.1479e-02, -1.8545e-02,  6.3935e-01,
         1.0328e-01,  2.3996e-01,  1.3610e-01,  2.2371e-03,  3.3146e-02,
        -6.9140e-02,  5.7106e-02,  1.6351e-02,  1.3060e-01,  7.8198e-02,
         4.8683e-02,  9.8589e-03,  7.6067e-02,  3.9622e-02, -6.2848e-03,
         2.2032e-01, -9.6354e-03,  2.3405e-02,  4.2448e-02, -2.2017e-02,
         4.4749e-02, -6.3064e-02,  3.5561e-03,  2.6631e-02, -2.2054e-03,
         2.8767e-02,  3.6871e-02,  8.2640e-02,  1.0339e-01, -5.9509e-02,
        -4.6356e-03,  8.2969e-02,  8.2600e-03,  1.3567e-01,  8.7273e-02])

Code

query_embedding.shape

torch.Size([1, 384])

Code

embeddings.shape

(100, 384)

21.2.6 Clustering

If we know the embeddings, we can do clustering just like we can for regular tabular data.

21.2.6.1 KMeans

Code

from sklearn.cluster import KMeans

# Perform kmean clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters, n_init='auto')
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(list(corpus.text)[sentence_id])

Code

clustering_model.labels_.shape

(100,)

Code

cluster_assignment

array([1, 0, 3, 0, 0, 3, 3, 3, 0, 1, 1, 0, 2, 0, 0, 4, 3, 4, 1, 0, 1, 0,
       1, 3, 0, 4, 2, 2, 2, 1, 0, 0, 3, 3, 0, 2, 3, 2, 3, 0, 4, 3, 1, 0,
       3, 0, 4, 0, 2, 3, 0, 4, 4, 3, 2, 3, 3, 3, 3, 2, 4, 2, 0, 3, 3, 2,
       3, 2, 2, 0, 4, 0, 4, 2, 1, 1, 0, 0, 3, 3, 4, 2, 1, 0, 1, 4, 4, 3,
       0, 1, 4, 2, 1, 2, 3, 4, 4, 1, 4, 2], dtype=int32)

Code

pd.Series(cluster_assignment).value_counts()

0    25
3    25
2    18
4    17
1    15
Name: count, dtype: int64

21.3 Huggingface Pipeline function

The Huggingface Pipeline function wraps everything together for a number of common NLP tasks.

The format for the commands is as below:

from transformers import pipeline

# Using default model and tokenizer for the task
pipeline("<task-name>")

# Using a user-specified model
pipeline("<task-name>", model="<model_name>")

# Using custom model/tokenizer as str
pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')

By default, pipeline selects a particular pretrained model that has been fine-tuned for the specified task. The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.

Pipelines are made of:

A tokenizer in charge of mapping raw textual input to token.
A model to make predictions from the inputs.
Some (optional) post processing for enhancing model’s output.

Some of the currently available pipelines are:

feature-extraction (get the vector representation of a text)
fill-mask
ner (named entity recognition)
question-answering
sentiment-analysis
summarization
text-generation
translation
zero-shot-classification

Each pipeline has a default model, which can be obtained from https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/init.py

Pipeline	Default Model
“feature-extraction”	“distilbert-base-cased”
“fill-mask”	“distilroberta-base”
“ner”	“t5-base”
“question-answering”	“distilbert-base-cased-distilled-squad”
“summarization”	“sshleifer/distilbart-cnn-12-6”
“translation”	“t5-base”
“text-generation”	“gpt2”
“text2text-generation”	“t5-base”
“zero-shot-classification”	“facebook/bart-large-mnli”
“conversational”	“microsoft/DialoGPT-medium”

First, some library imports

Code

# First, some library imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tqdm
import torch

from transformers import AutoTokenizer, AutoModel, pipeline

Code

from platform import python_version

print(python_version())

3.12.4

Code

mytext = """
Panther Labs, an early stage startup that specializes in detection and response analytics, has raised a whopping $120 million in a new round of funding led by hedge fund Coatue Management.

Panther Labs said the Series B investment was raised at a $1.4 billion valuation, putting the company among a growing list of ‘unicorn’ cybersecurity startups.

In addition to Coatue Management, Panther Labs scored investments from ICONIQ Growth and Snowflake Ventures along with money from existing investors Lightspeed Venture Partners, S28 Capital, and Innovation Endeavors.

The company previously raised $15 million in a September 2020 Series A round.

The San Francisco firm, which was founded by Airbnb and AWS alumni, styles itself as a “cloud-scale security analytics platform” that helps organizations prevent breaches by providing actionable insights from large volumes of data.

The Panther product can be used by security teams to perform continuous security monitoring, gain security visibility across cloud and on-premise infrastructure, and build data lakes for incident response investigations.

In the last year, Panther claims its customer roster grew by 300 percent, including deals with big companies like Dropbox, Zapier and Snyk.

Panther Labs said the new funding will be used to speed up product development, expand go-to-marketing initiatives and scale support for its customers.

Related: Panther Labs Launches Open-Source Cloud-Native SIEM

Related: CyCognito Snags $100 Million for Attack Surface Management
"""

Code

print(len(mytext.split()))
print(mytext)

224

Panther Labs, an early stage startup that specializes in detection and response analytics, has raised a whopping $120 million in a new round of funding led by hedge fund Coatue Management.

Panther Labs said the Series B investment was raised at a $1.4 billion valuation, putting the company among a growing list of ‘unicorn’ cybersecurity startups.

In addition to Coatue Management, Panther Labs scored investments from ICONIQ Growth and Snowflake Ventures along with money from existing investors Lightspeed Venture Partners, S28 Capital, and Innovation Endeavors.

The company previously raised $15 million in a September 2020 Series A round.

The San Francisco firm, which was founded by Airbnb and AWS alumni, styles itself as a “cloud-scale security analytics platform” that helps organizations prevent breaches by providing actionable insights from large volumes of data.

The Panther product can be used by security teams to perform continuous security monitoring, gain security visibility across cloud and on-premise infrastructure, and build data lakes for incident response investigations.

In the last year, Panther claims its customer roster grew by 300 percent, including deals with big companies like Dropbox, Zapier and Snyk.

Panther Labs said the new funding will be used to speed up product development, expand go-to-marketing initiatives and scale support for its customers.

Related: Panther Labs Launches Open-Source Cloud-Native SIEM

Related: CyCognito Snags $100 Million for Attack Surface Management

Code

# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'

21.3.1 Embeddings/Feature Extraction

Feature extraction allows us to obtain embeddings for a sentence. This is similar (in fact identical) to embeddings obtained from sentence-transformers.

Code

pwd

'C:\\Users\\user\\Google Drive\\jupyter'

Code

feature_extraction = pipeline('feature-extraction')
features = feature_extraction("i am awesome")
features = np.squeeze(features)
print(features.shape)

No model was supplied, defaulted to distilbert/distilbert-base-cased and revision 935ac13 (https://huggingface.co/distilbert/distilbert-base-cased).
Using a pipeline without specifying a model name and revision in production is not recommended.

(5, 768)

Code

# If you summarize by column, you get the same results as `model.encode` with sentence-bert
features = np.mean(features, axis=0)

Code

features.shape

(768,)

Code

# Let us try feature extraction on mytext
features = feature_extraction(mytext)
features = np.squeeze(features)
print(features.shape)

(322, 768)

Code

# Free up memory
del feature_extraction
gc.collect()

21.3.2 Fill Mask

Code

fill_mask = pipeline('fill-mask') 
fill_mask('New York is a <mask>')

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

[{'score': 0.1009121835231781,
  'token': 8018,
  'token_str': ' joke',
  'sequence': 'New York is a joke'},
 {'score': 0.04816760495305061,
  'token': 4593,
  'token_str': ' democracy',
  'sequence': 'New York is a democracy'},
 {'score': 0.04618655890226364,
  'token': 7319,
  'token_str': ' mess',
  'sequence': 'New York is a mess'},
 {'score': 0.04198974370956421,
  'token': 20812,
  'token_str': ' circus',
  'sequence': 'New York is a circus'},
 {'score': 0.024249661713838577,
  'token': 43689,
  'token_str': ' wasteland',
  'sequence': 'New York is a wasteland'}]

Code

fill_mask = pipeline('fill-mask', model = 'distilroberta-base')
fill_mask('New <mask> is a great city')

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

[{'score': 0.4224233627319336,
  'token': 469,
  'token_str': ' York',
  'sequence': 'New York is a great city'},
 {'score': 0.23672223091125488,
  'token': 4942,
  'token_str': ' Orleans',
  'sequence': 'New Orleans is a great city'},
 {'score': 0.08853647857904434,
  'token': 3123,
  'token_str': ' Jersey',
  'sequence': 'New Jersey is a great city'},
 {'score': 0.06783472746610641,
  'token': 3534,
  'token_str': ' Delhi',
  'sequence': 'New Delhi is a great city'},
 {'score': 0.03218536078929901,
  'token': 12050,
  'token_str': ' Haven',
  'sequence': 'New Haven is a great city'}]

Code

fill_mask('Joe Biden is a good <mask>')

[{'score': 0.09071354568004608,
  'token': 2173,
  'token_str': ' guy',
  'sequence': 'Joe Biden is a good guy'},
 {'score': 0.07118388265371323,
  'token': 1441,
  'token_str': ' friend',
  'sequence': 'Joe Biden is a good friend'},
 {'score': 0.03984031453728676,
  'token': 30443,
  'token_str': ' listener',
  'sequence': 'Joe Biden is a good listener'},
 {'score': 0.03301309794187546,
  'token': 28587,
  'token_str': ' liar',
  'sequence': 'Joe Biden is a good liar'},
 {'score': 0.030751319602131844,
  'token': 313,
  'token_str': ' man',
  'sequence': 'Joe Biden is a good man'}]

Code

fill_mask('Joe Biden is in a good <mask>')

[{'score': 0.8292393088340759,
  'token': 6711,
  'token_str': ' mood',
  'sequence': 'Joe Biden is in a good mood'},
 {'score': 0.040497832000255585,
  'token': 3989,
  'token_str': ' shape',
  'sequence': 'Joe Biden is in a good shape'},
 {'score': 0.02688208967447281,
  'token': 317,
  'token_str': ' place',
  'sequence': 'Joe Biden is in a good place'},
 {'score': 0.024331938475370407,
  'token': 1514,
  'token_str': ' spot',
  'sequence': 'Joe Biden is in a good spot'},
 {'score': 0.013950899243354797,
  'token': 737,
  'token_str': ' position',
  'sequence': 'Joe Biden is in a good position'}]

21.3.3 Sentiment Analysis (+ve/-ve)

Code

# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# Ignore if working on your own hardware as the default
# locations will work.

# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'

Code

from transformers import pipeline

classifier = pipeline("sentiment-analysis")

classifier("It was sort of ok")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.

[{'label': 'POSITIVE', 'score': 0.9996662139892578}]

Code

classifier(mytext)

[{'label': 'POSITIVE', 'score': 0.8596639633178711}]

Code

# Free up memory
del classifier
gc.collect()

21.3.4 Named Entity Recognition

Identify tokens as belonging to one of 9 classes:

O, Outside of a named entity
B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
I-MIS, Miscellaneous entity
B-PER, Beginning of a person’s name right after another person’s name
I-PER, Person’s name
B-ORG, Beginning of an organisation right after another organisation
I-ORG, Organisation
B-LOC, Beginning of a location right after another location
I-LOC, Location

Code

# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'

Code

ner = pipeline("ner") 

ner("Seattle is a city in Washington where Microsoft is headquartered")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

[{'entity': 'I-LOC',
  'score': 0.99756324,
  'index': 1,
  'word': 'Seattle',
  'start': 0,
  'end': 7},
 {'entity': 'I-LOC',
  'score': 0.9981115,
  'index': 6,
  'word': 'Washington',
  'start': 21,
  'end': 31},
 {'entity': 'I-ORG',
  'score': 0.999338,
  'index': 8,
  'word': 'Microsoft',
  'start': 38,
  'end': 47}]

Code

ner(mytext)

[{'entity': 'I-ORG',
  'score': 0.99932563,
  'index': 1,
  'word': 'Panther',
  'start': 1,
  'end': 8},
 {'entity': 'I-ORG',
  'score': 0.9993229,
  'index': 2,
  'word': 'Labs',
  'start': 9,
  'end': 13},
 {'entity': 'I-ORG',
  'score': 0.9992663,
  'index': 37,
  'word': 'Co',
  'start': 171,
  'end': 173},
 {'entity': 'I-ORG',
  'score': 0.9986853,
  'index': 38,
  'word': '##at',
  'start': 173,
  'end': 175},
 {'entity': 'I-ORG',
  'score': 0.999196,
  'index': 39,
  'word': '##ue',
  'start': 175,
  'end': 177},
 {'entity': 'I-ORG',
  'score': 0.99944323,
  'index': 40,
  'word': 'Management',
  'start': 178,
  'end': 188},
 {'entity': 'I-ORG',
  'score': 0.9994549,
  'index': 42,
  'word': 'Panther',
  'start': 191,
  'end': 198},
 {'entity': 'I-ORG',
  'score': 0.9986261,
  'index': 43,
  'word': 'Labs',
  'start': 199,
  'end': 203},
 {'entity': 'I-ORG',
  'score': 0.99832755,
  'index': 85,
  'word': 'Co',
  'start': 367,
  'end': 369},
 {'entity': 'I-ORG',
  'score': 0.9989543,
  'index': 86,
  'word': '##at',
  'start': 369,
  'end': 371},
 {'entity': 'I-ORG',
  'score': 0.99904543,
  'index': 87,
  'word': '##ue',
  'start': 371,
  'end': 373},
 {'entity': 'I-ORG',
  'score': 0.99918145,
  'index': 88,
  'word': 'Management',
  'start': 374,
  'end': 384},
 {'entity': 'I-ORG',
  'score': 0.99947304,
  'index': 90,
  'word': 'Panther',
  'start': 386,
  'end': 393},
 {'entity': 'I-ORG',
  'score': 0.9986386,
  'index': 91,
  'word': 'Labs',
  'start': 394,
  'end': 398},
 {'entity': 'I-ORG',
  'score': 0.9969086,
  'index': 95,
  'word': 'I',
  'start': 423,
  'end': 424},
 {'entity': 'I-ORG',
  'score': 0.98679113,
  'index': 96,
  'word': '##CO',
  'start': 424,
  'end': 426},
 {'entity': 'I-ORG',
  'score': 0.9962644,
  'index': 97,
  'word': '##NI',
  'start': 426,
  'end': 428},
 {'entity': 'I-ORG',
  'score': 0.9870978,
  'index': 98,
  'word': '##Q',
  'start': 428,
  'end': 429},
 {'entity': 'I-ORG',
  'score': 0.995076,
  'index': 99,
  'word': 'Growth',
  'start': 430,
  'end': 436},
 {'entity': 'I-ORG',
  'score': 0.997384,
  'index': 101,
  'word': 'Snow',
  'start': 441,
  'end': 445},
 {'entity': 'I-ORG',
  'score': 0.99732804,
  'index': 102,
  'word': '##f',
  'start': 445,
  'end': 446},
 {'entity': 'I-ORG',
  'score': 0.9969291,
  'index': 103,
  'word': '##lake',
  'start': 446,
  'end': 450},
 {'entity': 'I-ORG',
  'score': 0.99730384,
  'index': 104,
  'word': 'Ventures',
  'start': 451,
  'end': 459},
 {'entity': 'I-ORG',
  'score': 0.99798065,
  'index': 111,
  'word': 'Lights',
  'start': 501,
  'end': 507},
 {'entity': 'I-ORG',
  'score': 0.9802942,
  'index': 112,
  'word': '##pe',
  'start': 507,
  'end': 509},
 {'entity': 'I-ORG',
  'score': 0.99478084,
  'index': 113,
  'word': '##ed',
  'start': 509,
  'end': 511},
 {'entity': 'I-ORG',
  'score': 0.99712026,
  'index': 114,
  'word': 'Venture',
  'start': 512,
  'end': 519},
 {'entity': 'I-ORG',
  'score': 0.99780315,
  'index': 115,
  'word': 'Partners',
  'start': 520,
  'end': 528},
 {'entity': 'I-ORG',
  'score': 0.9866433,
  'index': 117,
  'word': 'S',
  'start': 530,
  'end': 531},
 {'entity': 'I-ORG',
  'score': 0.97416526,
  'index': 118,
  'word': '##28',
  'start': 531,
  'end': 533},
 {'entity': 'I-ORG',
  'score': 0.9915843,
  'index': 119,
  'word': 'Capital',
  'start': 534,
  'end': 541},
 {'entity': 'I-ORG',
  'score': 0.9983632,
  'index': 122,
  'word': 'Innovation',
  'start': 547,
  'end': 557},
 {'entity': 'I-ORG',
  'score': 0.9993075,
  'index': 123,
  'word': 'End',
  'start': 558,
  'end': 561},
 {'entity': 'I-ORG',
  'score': 0.9934894,
  'index': 124,
  'word': '##eavor',
  'start': 561,
  'end': 566},
 {'entity': 'I-ORG',
  'score': 0.98961776,
  'index': 125,
  'word': '##s',
  'start': 566,
  'end': 567},
 {'entity': 'I-LOC',
  'score': 0.99653375,
  'index': 143,
  'word': 'San',
  'start': 653,
  'end': 656},
 {'entity': 'I-LOC',
  'score': 0.99250937,
  'index': 144,
  'word': 'Francisco',
  'start': 657,
  'end': 666},
 {'entity': 'I-ORG',
  'score': 0.9983175,
  'index': 151,
  'word': 'Air',
  'start': 694,
  'end': 697},
 {'entity': 'I-ORG',
  'score': 0.98135924,
  'index': 152,
  'word': '##b',
  'start': 697,
  'end': 698},
 {'entity': 'I-ORG',
  'score': 0.6833769,
  'index': 153,
  'word': '##n',
  'start': 698,
  'end': 699},
 {'entity': 'I-ORG',
  'score': 0.9928785,
  'index': 154,
  'word': '##b',
  'start': 699,
  'end': 700},
 {'entity': 'I-ORG',
  'score': 0.998475,
  'index': 156,
  'word': 'A',
  'start': 705,
  'end': 706},
 {'entity': 'I-ORG',
  'score': 0.99682593,
  'index': 157,
  'word': '##WS',
  'start': 706,
  'end': 708},
 {'entity': 'I-ORG',
  'score': 0.80408233,
  'index': 192,
  'word': 'Panther',
  'start': 886,
  'end': 893},
 {'entity': 'I-ORG',
  'score': 0.995609,
  'index': 231,
  'word': 'Panther',
  'start': 1122,
  'end': 1129},
 {'entity': 'I-ORG',
  'score': 0.9984397,
  'index': 247,
  'word': 'Drop',
  'start': 1218,
  'end': 1222},
 {'entity': 'I-ORG',
  'score': 0.9981306,
  'index': 248,
  'word': '##box',
  'start': 1222,
  'end': 1225},
 {'entity': 'I-ORG',
  'score': 0.99752074,
  'index': 250,
  'word': 'Z',
  'start': 1227,
  'end': 1228},
 {'entity': 'I-ORG',
  'score': 0.96972084,
  'index': 251,
  'word': '##ap',
  'start': 1228,
  'end': 1230},
 {'entity': 'I-ORG',
  'score': 0.99131715,
  'index': 252,
  'word': '##ier',
  'start': 1230,
  'end': 1233},
 {'entity': 'I-ORG',
  'score': 0.9980101,
  'index': 254,
  'word': 'S',
  'start': 1238,
  'end': 1239},
 {'entity': 'I-ORG',
  'score': 0.9695136,
  'index': 255,
  'word': '##ny',
  'start': 1239,
  'end': 1241},
 {'entity': 'I-ORG',
  'score': 0.99053967,
  'index': 256,
  'word': '##k',
  'start': 1241,
  'end': 1242},
 {'entity': 'I-ORG',
  'score': 0.9990858,
  'index': 258,
  'word': 'Panther',
  'start': 1245,
  'end': 1252},
 {'entity': 'I-ORG',
  'score': 0.99652547,
  'index': 259,
  'word': 'Labs',
  'start': 1253,
  'end': 1257},
 {'entity': 'I-ORG',
  'score': 0.99833304,
  'index': 290,
  'word': 'Panther',
  'start': 1407,
  'end': 1414},
 {'entity': 'I-ORG',
  'score': 0.9907589,
  'index': 291,
  'word': 'Labs',
  'start': 1415,
  'end': 1419},
 {'entity': 'I-ORG',
  'score': 0.98082525,
  'index': 306,
  'word': 'Cy',
  'start': 1469,
  'end': 1471},
 {'entity': 'I-ORG',
  'score': 0.9829427,
  'index': 307,
  'word': '##C',
  'start': 1471,
  'end': 1472},
 {'entity': 'I-ORG',
  'score': 0.9704884,
  'index': 308,
  'word': '##og',
  'start': 1472,
  'end': 1474},
 {'entity': 'I-ORG',
  'score': 0.87991095,
  'index': 309,
  'word': '##ni',
  'start': 1474,
  'end': 1476},
 {'entity': 'I-ORG',
  'score': 0.97091585,
  'index': 310,
  'word': '##to',
  'start': 1476,
  'end': 1478}]

Code

# Free up memory
del ner
gc.collect()

21.3.5 Question Answering

Code

# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'

Code

from transformers import pipeline

question_answerer = pipeline("question-answering") 

question_answerer(
    question="Where do I work?",
    context="My name is Mukul and I work at NYU Tandon in Brooklyn",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.

{'score': 0.7861828804016113, 'start': 31, 'end': 41, 'answer': 'NYU Tandon'}

Code

print(mytext)


Panther Labs, an early stage startup that specializes in detection and response analytics, has raised a whopping $120 million in a new round of funding led by hedge fund Coatue Management.

Panther Labs said the Series B investment was raised at a $1.4 billion valuation, putting the company among a growing list of ‘unicorn’ cybersecurity startups.

In addition to Coatue Management, Panther Labs scored investments from ICONIQ Growth and Snowflake Ventures along with money from existing investors Lightspeed Venture Partners, S28 Capital, and Innovation Endeavors.

The company previously raised $15 million in a September 2020 Series A round.

The San Francisco firm, which was founded by Airbnb and AWS alumni, styles itself as a “cloud-scale security analytics platform” that helps organizations prevent breaches by providing actionable insights from large volumes of data.

The Panther product can be used by security teams to perform continuous security monitoring, gain security visibility across cloud and on-premise infrastructure, and build data lakes for incident response investigations.

In the last year, Panther claims its customer roster grew by 300 percent, including deals with big companies like Dropbox, Zapier and Snyk.

Panther Labs said the new funding will be used to speed up product development, expand go-to-marketing initiatives and scale support for its customers.

Related: Panther Labs Launches Open-Source Cloud-Native SIEM

Related: CyCognito Snags $100 Million for Attack Surface Management

Code

question_answerer(
    question = "How much did Panther Labs raise",
    context = mytext,
)

{'score': 0.02731592394411564,
 'start': 249,
 'end': 261,
 'answer': '$1.4 billion'}

Code

question_answerer(
    question = "How much did Panther Labs raise previously",
    context = mytext,
)

{'score': 0.6693971753120422,
 'start': 600,
 'end': 611,
 'answer': '$15 million'}

Code

question_answerer(
    question = "Who founded Panter Labs",
    context = mytext,
)

{'score': 2.9083132176310755e-05,
 'start': 694,
 'end': 715,
 'answer': 'Airbnb and AWS alumni'}

Code

# Free up memory
del question_answerer
gc.collect()

21.3.6 Summarization

Code

# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'

Code

from transformers import pipeline

summarizer = pipeline("summarization")

summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China anbd India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
tokenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 76.4kB/s]
vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 10.5MB/s]
merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 17.8MB/s]

[{'summary_text': ' America suffers an increasingly serious decline in the number of engineering graduates and a lack of well-educated engineers . Rapidly developing economies such as China anbd India, as well as other industrial countries in Europe and Asia, continue to encourage and advance the teaching of engineering . Both China and India graduate six and eight times as many traditional engineers as does the United States .'}]

Code

mytext = """
Panther Labs, an early stage startup that specializes in detection and response analytics, has raised a whopping $120 million in a new round of funding led by hedge fund Coatue Management.

Panther Labs said the Series B investment was raised at a $1.4 billion valuation, putting the company among a growing list of ‘unicorn’ cybersecurity startups.

In addition to Coatue Management, Panther Labs scored investments from ICONIQ Growth and Snowflake Ventures along with money from existing investors Lightspeed Venture Partners, S28 Capital, and Innovation Endeavors.

The company previously raised $15 million in a September 2020 Series A round.

The San Francisco firm, which was founded by Airbnb and AWS alumni, styles itself as a “cloud-scale security analytics platform” that helps organizations prevent breaches by providing actionable insights from large volumes of data.

The Panther product can be used by security teams to perform continuous security monitoring, gain security visibility across cloud and on-premise infrastructure, and build data lakes for incident response investigations.

In the last year, Panther claims its customer roster grew by 300 percent, including deals with big companies like Dropbox, Zapier and Snyk.

Panther Labs said the new funding will be used to speed up product development, expand go-to-marketing initiatives and scale support for its customers.

Related: Panther Labs Launches Open-Source Cloud-Native SIEM

Related: CyCognito Snags $100 Million for Attack Surface Management
"""

Code

summarizer(mytext)

[{'summary_text': ' Panther Labs is a ‘cloud-scale security analytics platform’ that helps organizations prevent breaches by providing actionable insights from large volumes of data . The San Francisco startup claims its customer roster grew by 300 percent in the last year, including deals with Dropbox, Zapier and Snyk . The new funding will be used to speed up product development and expand go-to-marketing initiatives .'}]

Code

# Free up memory
del summarizer
gc.collect()

21.3.6.1 Try a different model

Code

from transformers import pipeline
import torch
summarizer = pipeline(task="summarization",
                      model="facebook/bart-large-cnn",
                      torch_dtype=torch.bfloat16)

Model info: ‘bart-large-cnn’

Code

%%time
summarizer(mytext, min_length=10, max_length=100)

CPU times: user 2min 53s, sys: 852 ms, total: 2min 54s
Wall time: 2min 56s

[{'summary_text': 'Panther Labs, an early stage startup that specializes in detection and response analytics, has raised a whopping $120 million in a new round of funding. The Series B investment was raised at a $1.4 billion valuation, putting the company among a growing list of ‘unicorn’ cybersecurity startups.'}]

Code

# Free up memory
del summarizer
gc.collect()

21.3.7 Translation

Code

# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'

Code

translator = pipeline("translation_en_to_fr")
translator("I do not speak French")

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 100%|██████████| 1.21k/1.21k [00:00<00:00, 3.74MB/s]
model.safetensors: 100%|██████████| 892M/892M [00:08<00:00, 105MB/s]  
generation_config.json: 100%|██████████| 147/147 [00:00<00:00, 523kB/s]
spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 23.1MB/s]
tokenizer.json: 100%|██████████| 1.39M/1.39M [00:00<00:00, 29.0MB/s]
/opt/conda/envs/mggy8413/lib/python3.11/site-packages/transformers/models/t5/tokenization_t5_fast.py:160: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(

[{'translation_text': 'Je ne parle pas français'}]

Code

# Free up memory
del translator
gc.collect()

21.3.7.1 Translation using NLLB (no language left behind)

Code

# First, some memory cleanup using garbage collector

# del summarizer
# del feature_extraction
# del fill_mask
# del classifier
# del ner
# del question_answerer
# del translator
gc.collect()

Code

from transformers import pipeline 
import torch

Code

translator = pipeline(task="translation",
                      model="facebook/nllb-200-distilled-600M",
                      torch_dtype=torch.bfloat16)

NLLB: No Language Left Behind: ‘nllb-200-distilled-600M’.

Code

text = """Change will not come if we wait for some other person or some other time. We are the ones we've been waiting for. We are the change that we seek."""

Code

text_translated = translator(text,
                             src_lang="eng_Latn",
                             tgt_lang="fra_Latn")

To choose other languages, you can find the other language codes on the page: Languages in FLORES-200

For example: - Afrikaans: afr_Latn - Chinese: zho_Hans - Egyptian Arabic: arz_Arab - French: fra_Latn - German: deu_Latn - Greek: ell_Grek - Hindi: hin_Deva - Indonesian: ind_Latn - Italian: ita_Latn - Japanese: jpn_Jpan - Korean: kor_Hang - Mandarin Chinese (Standard Beijing):cmn_Hans - Mandarin Chinese (Taiwanese): cmn_Hant - Persian: pes_Arab - Portuguese: por_Latn - Russian: rus_Cyrl - Spanish: spa_Latn - Swahili: swh_Latn - Thai: tha_Thai - Turkish: tur_Latn - Vietnamese: vie_Latn - Yue Chinese (Hong Kong Cantonese): yue_Hant - Zulu: zul_Latn)

Code

text_translated

[{'translation_text': 'Le changement ne viendra pas si nous attendons une autre personne ou un autre moment. Nous sommes ceux que nous avons attendus. Nous sommes le changement que nous cherchons.'}]

Code

%%time
print(text)
translator(text, src_lang="eng_Latn", tgt_lang="hin_Deva")

Change will not come if we wait for some other person or some other time. We are the ones we've been waiting for. We are the change that we seek.
CPU times: user 25 s, sys: 135 ms, total: 25.1 s
Wall time: 25.5 s

[{'translation_text': 'परिवर्तन नहीं आएगा अगर हम किसी और व्यक्ति या किसी अन्य समय का इंतजार करें. हम वही हैं जिनका हम इंतजार कर रहे हैं. हम वही बदलाव हैं जिसकी हम तलाश कर रहे हैं।'}]

Code

%%time
print(text)
translator(text, src_lang="eng_Latn", tgt_lang="kor_Hang")

Change will not come if we wait for some other person or some other time. We are the ones we've been waiting for. We are the change that we seek.
CPU times: user 18.9 s, sys: 102 ms, total: 19 s
Wall time: 19.1 s

[{'translation_text': '변화는 다른 사람을 기다린다면 오지 않을 것입니다. 우리는 우리가 기다린 사람들입니다. 우리는 우리가 추구하는 변화입니다.'}]

Code

%%time
print(text)
translator(text, src_lang="eng_Latn", tgt_lang="yue_Hant")

Change will not come if we wait for some other person or some other time. We are the ones we've been waiting for. We are the change that we seek.
CPU times: user 13.6 s, sys: 71.4 ms, total: 13.7 s
Wall time: 13.8 s

[{'translation_text': '如果我哋等到另一個人或者其他時間 我哋就唔會改變'}]

Code

del translator
gc.collect()

21.3.8 Text Generation

Code

# No need to run the code below as symlinks have been defined by the Jupyterhub team

# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'

Code

generator = pipeline("text-generation")

generator("In this course, we will teach you how to", max_length = 100, num_return_sequences=4)

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

[{'generated_text': "In this course, we will teach you how to create, use, and use a system for creating your own apps, with some guidance from our CTO, Andrew.\n\nIf you would like to make a suggestion for a new course, please do not hesitate to contact me.\n\nDisclaimer\n\nBy clicking 'help' we represent and encourage your interest in the development of our app, and if you find it useful or helpful please help out. We also take all advice from our technical"},
 {'generated_text': 'In this course, we will teach you how to apply the knowledge contained in your own training to make your own unique, unique work.\n\nClick Here to Register!\n\nAs you can see, one of the most common mistakes people make is to think that no one is doing your training. It is far from true. In this course, you will teach all you need to be a successful professional:\n\n• Know the exact workout and set up. This will be your foundation for training'},
 {'generated_text': 'In this course, we will teach you how to make a solid foundation for a successful and successful business.\n\n\nWe will build your website in the time of your choice, so that it helps you out and not hinder you. We will also explain how to build a solid community to get started building your website.\n\n\nWe will teach you how to keep yourself on the right track, and how to get you going fast. We will also describe how to design a good social media campaign. We'},
 {'generated_text': 'In this course, we will teach you how to build a strong online presence through social media, in your daily routine, and how to keep your own personal information secure on social media.\n\nLearn to build a strong online presence by making it your central field of inquiry, a key value in your company, and especially a tool of collaboration.\n\nLearn to build a strong online presence by making it your central field of inquiry, a key value in your company, and especially a tool of collaboration'}]

Code

del generator
gc.collect()

21.3.9 Zero Shot Classification

Code

# Set default locations for downloaded models
# import os

# if os.name != 'nt':  # Do this only if in a non-Windows environment
    
#     if 'instructor' in os.getcwd(): # Set default model locations when logged in as instructor
#         os.environ['TRANSFORMERS_CACHE'] = '/home/instructor/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/instructor/shared/huggingface'
            
#     else: # Set default model locations when logged in as a student
#         os.environ['TRANSFORMERS_CACHE'] = '/home/jovyan/shared/huggingface'
#         os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/home/jovyan/shared/huggingface'

Code

from transformers import pipeline

classifier = pipeline("zero-shot-classification")

classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445993661880493, 0.1119738519191742, 0.04342673718929291]}

Code

classifier(mytext, candidate_labels=["education", "politics", "business"])

{'sequence': '\nPanther Labs, an early stage startup that specializes in detection and response analytics, has raised a whopping $120 million in a new round of funding led by hedge fund Coatue Management.\n\nPanther Labs said the Series B investment was raised at a $1.4 billion valuation, putting the company among a growing list of ‘unicorn’ cybersecurity startups.\n\nIn addition to Coatue Management, Panther Labs scored investments from ICONIQ Growth and Snowflake Ventures along with money from existing investors Lightspeed Venture Partners, S28 Capital, and Innovation Endeavors.\n\nThe company previously raised $15 million in a September 2020 Series A round.\n\nThe San Francisco firm, which was founded by Airbnb and AWS alumni, styles itself as a “cloud-scale security analytics platform” that helps organizations prevent breaches by providing actionable insights from large volumes of data.\n\nThe Panther product can be used by security teams to perform continuous security monitoring, gain security visibility across cloud and on-premise infrastructure, and build data lakes for incident response investigations.\n\nIn the last year, Panther claims its customer roster grew by 300 percent, including deals with big companies like Dropbox, Zapier and Snyk.\n\nPanther Labs said the new funding will be used to speed up product development, expand go-to-marketing initiatives and scale support for its customers.\n\nRelated: Panther Labs Launches Open-Source Cloud-Native SIEM\n\nRelated: CyCognito Snags $100 Million for Attack Surface Management\n',
 'labels': ['business', 'politics', 'education'],
 'scores': [0.8694897890090942, 0.06767456978559494, 0.0628357082605362]}

Code

del classifier
gc.collect()

21.4 Audio Classification

Code

import librosa

# File path of the audio file
file_path = r"futuristic-timelapse-11951.mp3"

# Load the audio file
audio_data, sample_rate = librosa.load(file_path)

# Print the shape of the audio data and the sample rate
print("Shape of audio data:", audio_data.shape)
print("Sample rate:", sample_rate)

Shape of audio data: (2646720,)
Sample rate: 22050

Code

audio_data

array([ 1.6763806e-08,  1.8626451e-08,  3.7252903e-09, ...,
       -1.2981509e-06,  3.9851147e-07,  1.4643756e-06], dtype=float32)

Code

import librosa
from IPython.display import Audio

# File path of the audio file
# file_path = 'audio_file.mp3'

# Load the audio file
audio_data, sample_rate = librosa.load(file_path, sr=None)

# Play the audio
Audio(data=audio_data, rate=sample_rate)

Code

sample_rate

Code

resampled_audio = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)

Code

from transformers import pipeline

Code

zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="laion/clap-htsat-unfused")

Code

candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner"]

Code

zero_shot_classifier(resampled_audio,
                     candidate_labels=candidate_labels)

[{'score': 0.990260899066925, 'label': 'Sound of a dog'},
 {'score': 0.009739157743752003, 'label': 'Sound of vacuum cleaner'}]

Code

candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane",
                    "Sound of techno music"]

Code

zero_shot_classifier(resampled_audio,
                     candidate_labels=candidate_labels)

[{'score': 0.9880968332290649, 'label': 'Sound of techno music'},
 {'score': 0.008278903551399708, 'label': 'Sound of an airplane'},
 {'score': 0.0028479923494160175, 'label': 'Sound of a child crying'},
 {'score': 0.0007249810150824487, 'label': 'Sound of a bird singing'},
 {'score': 5.125454845256172e-05, 'label': 'Sound of vacuum cleaner'}]

Code

# Free up memory
del zero_shot_classifier
gc.collect()

21.4.1 Automatic Speech Recognition

Whisper large-v3 (2024): OpenAI’s distil-whisper/distil-small.en used here is a distilled model optimised for English. For multilingual transcription or maximum accuracy, use Whisper large-v3 (openai/whisper-large-v3) — it supports 99 languages and achieves near-human accuracy on many benchmarks. faster-whisper (pip install faster-whisper) is a CTranslate2-based reimplementation that runs 4× faster with the same accuracy.

Code

import librosa
from IPython.display import Audio

# File path of the audio file
# file_path = 'audio_file.mp3'

# Load the audio file
audio_data, sample_rate = librosa.load("stereo_file.wav", sr=None)

# Play the audio
Audio(data=audio_data, rate=sample_rate)

Code

sample_rate

Code

from transformers import pipeline

Code

asr = pipeline(task="automatic-speech-recognition",
               model="distil-whisper/distil-small.en")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Info about distil-whisper/distil-small.en

Code

asr.feature_extractor.sampling_rate

Code

asr(audio_data)

{'text': ' Chapter 16 I might have told you of the beginning of this liaison in a few lines, but I wanted you to see every step by which we came. I too agree to whatever Marguerite wished.'}

Code

del asr
gc.collect()

21.5 Vision

Multimodal models in 2025: The vision tasks in this chapter (object detection, image captioning, VQA) use single-purpose models from 2022–2023. Modern multimodal foundation models can handle all of these tasks through a single interface:

Model Access Strengths

GPT-4o OpenAI API Best overall vision + language; reads documents, charts, screenshots

Gemini 2.0 Flash Google AI API Fast multimodal; excellent at structured extraction from images

LLaVA / LLaVA-Next HuggingFace / Ollama Best open-source vision model; ollama pull llava

IDEFICS 2 (HuggingFace) HuggingFaceM4/idefics2-8b Open multimodal; strong on document understanding

The DETR (object detection), BLIP (captioning), and BLIP-VQA models shown below remain the best choice when you need a dedicated, fine-tunable pipeline for a specific vision task.

Model	Access	Strengths
GPT-4o	OpenAI API	Best overall vision + language; reads documents, charts, screenshots
Gemini 2.0 Flash	Google AI API	Fast multimodal; excellent at structured extraction from images
LLaVA / LLaVA-Next	HuggingFace / Ollama	Best open-source vision model; `ollama pull llava`
IDEFICS 2 (HuggingFace)	`HuggingFaceM4/idefics2-8b`	Open multimodal; strong on document understanding

Setup the helper functions first

Code

# This is the helper.py from deeplearning.ai - functions created instead of loading helper module

import io
import matplotlib.pyplot as plt
import requests
import inflect
from PIL import Image

def load_image_from_url(url):
    return Image.open(requests.get(url, stream=True).raw)

def render_results_in_image(in_pil_img, in_results):
    plt.figure(figsize=(16, 10))
    plt.imshow(in_pil_img)

    ax = plt.gca()

    for prediction in in_results:

        x, y = prediction['box']['xmin'], prediction['box']['ymin']
        w = prediction['box']['xmax'] - prediction['box']['xmin']
        h = prediction['box']['ymax'] - prediction['box']['ymin']

        ax.add_patch(plt.Rectangle((x, y),
                                   w,
                                   h,
                                   fill=False,
                                   color="green",
                                   linewidth=2))
        ax.text(
           x,
           y,
           f"{prediction['label']}: {round(prediction['score']*100, 1)}%",
           color='red'
        )

    plt.axis("off")

    # Save the modified image to a BytesIO object
    img_buf = io.BytesIO()
    plt.savefig(img_buf, format='png',
                bbox_inches='tight',
                pad_inches=0)
    img_buf.seek(0)
    modified_image = Image.open(img_buf)

    # Close the plot to prevent it from being displayed
    plt.close()

    return modified_image

def summarize_predictions_natural_language(predictions):
    summary = {}
    p = inflect.engine()

    for prediction in predictions:
        label = prediction['label']
        if label in summary:
            summary[label] += 1
        else:
            summary[label] = 1

    result_string = "In this image, there are "
    for i, (label, count) in enumerate(summary.items()):
        count_string = p.number_to_words(count)
        result_string += f"{count_string} {label}"
        if count > 1:
          result_string += "s"

        result_string += " "

        if i == len(summary) - 2:
          result_string += "and "

    # Remove the trailing comma and space
    result_string = result_string.rstrip(', ') + "."

    return result_string


##### To ignore warnings #####
import warnings
import logging
from transformers import logging as hf_logging

def ignore_warnings():
    # Ignore specific Python warnings
    warnings.filterwarnings("ignore", message="Some weights of the model checkpoint")
    warnings.filterwarnings("ignore", message="Could not find image processor class")
    warnings.filterwarnings("ignore", message="The `max_size` parameter is deprecated")

    # Adjust logging for libraries using the logging module
    logging.basicConfig(level=logging.ERROR)
    hf_logging.set_verbosity_error()

########

Code

from transformers import pipeline

Code

# Here is some code that suppresses warning messages.
from transformers.utils import logging
logging.set_verbosity_error()

# from helper import ignore_warnings
# ignore_warnings()

21.5.1 Object Detection

Code

od_pipe = pipeline("object-detection", "facebook/detr-resnet-50")

Info about facebook/detr-resnet-50

Explore more of the Hugging Face Hub for more object detection models

Code

from PIL import Image

raw_image = Image.open('20240321_194345.jpg')
raw_image

Code

import numpy as np
np.array(raw_image).shape

(2160, 2880, 3)

Code

np.array(raw_image)

array([[[152, 126, 109],
        [148, 122, 105],
        [143, 120, 102],
        ...,
        [ 98,  40,  29],
        [ 94,  37,  28],
        [ 94,  37,  28]],

       [[151, 126, 106],
        [149, 124, 104],
        [144, 122, 101],
        ...,
        [ 94,  36,  25],
        [ 93,  36,  27],
        [ 95,  38,  29]],

       [[149, 124, 102],
        [150, 125, 103],
        [146, 124, 101],
        ...,
        [ 92,  34,  23],
        [ 92,  35,  26],
        [ 95,  38,  29]],

       ...,

       [[ 57,  59,  58],
        [ 54,  56,  55],
        [ 56,  58,  57],
        ...,
        [ 11,  11,  11],
        [ 12,  14,  11],
        [ 13,  15,  12]],

       [[ 51,  53,  52],
        [ 50,  52,  51],
        [ 53,  55,  54],
        ...,
        [  7,   7,   5],
        [  7,   9,   6],
        [  8,  10,   7]],

       [[ 64,  66,  65],
        [ 63,  65,  64],
        [ 62,  64,  63],
        ...,
        [  5,   5,   3],
        [  6,   8,   5],
        [  7,   9,   6]]], dtype=uint8)

Code

pipeline_output = od_pipe(raw_image)

Code

pipeline_output

[{'score': 0.9962156414985657,
  'label': 'wine glass',
  'box': {'xmin': 678, 'ymin': 1975, 'xmax': 873, 'ymax': 2159}},
 {'score': 0.9778583645820618,
  'label': 'person',
  'box': {'xmin': 1122, 'ymin': 900, 'xmax': 1446, 'ymax': 2145}},
 {'score': 0.974013090133667,
  'label': 'wine glass',
  'box': {'xmin': 1039, 'ymin': 1154, 'xmax': 1174, 'ymax': 1355}},
 {'score': 0.9941200613975525,
  'label': 'person',
  'box': {'xmin': 2263, 'ymin': 620, 'xmax': 2879, 'ymax': 2143}},
 {'score': 0.9233799576759338,
  'label': 'person',
  'box': {'xmin': 2232, 'ymin': 976, 'xmax': 2317, 'ymax': 1074}},
 {'score': 0.9445099830627441,
  'label': 'person',
  'box': {'xmin': 1786, 'ymin': 823, 'xmax': 1946, 'ymax': 1124}},
 {'score': 0.9947138428688049,
  'label': 'wine glass',
  'box': {'xmin': 1684, 'ymin': 1374, 'xmax': 1873, 'ymax': 1735}},
 {'score': 0.9555243253707886,
  'label': 'cup',
  'box': {'xmin': 1409, 'ymin': 1278, 'xmax': 1537, 'ymax': 1539}},
 {'score': 0.97939133644104,
  'label': 'person',
  'box': {'xmin': 1412, 'ymin': 900, 'xmax': 1507, 'ymax': 1085}},
 {'score': 0.9982106685638428,
  'label': 'person',
  'box': {'xmin': 683, 'ymin': 590, 'xmax': 1312, 'ymax': 2138}},
 {'score': 0.9981060028076172,
  'label': 'person',
  'box': {'xmin': 2, 'ymin': 623, 'xmax': 971, 'ymax': 2139}},
 {'score': 0.9994183778762817,
  'label': 'person',
  'box': {'xmin': 1318, 'ymin': 697, 'xmax': 1993, 'ymax': 2136}},
 {'score': 0.9274911284446716,
  'label': 'person',
  'box': {'xmin': 1796, 'ymin': 819, 'xmax': 2003, 'ymax': 1215}},
 {'score': 0.9437280297279358,
  'label': 'wine glass',
  'box': {'xmin': 1410, 'ymin': 1270, 'xmax': 1539, 'ymax': 1541}},
 {'score': 0.9910680651664734,
  'label': 'person',
  'box': {'xmin': 1701, 'ymin': 675, 'xmax': 2766, 'ymax': 2135}}]

Code

processed_image = render_results_in_image(
    raw_image, 
    pipeline_output)

Code

processed_image

Code

text = summarize_predictions_natural_language(pipeline_output)

Code

text

'In this image, there are four wine glasss ten persons and one cup.'

Code

# Free up memory
del od_pipe
gc.collect()

21.5.2 Image Captioning

Code

from transformers import BlipForConditionalGeneration
from transformers import AutoProcessor

Code

model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-base")

Info about Salesforce/blip-image-captioning-base

Code

processor = AutoProcessor.from_pretrained(
    "Salesforce/blip-image-captioning-base")

Code

# Load image
from PIL import Image

image = Image.open("20240321_194345.jpg")

image

21.5.2.1 Conditional Image Captioning

Code

text = "a photograph of"
inputs = processor(image, text, return_tensors="pt")

Code

print(np.array(image).shape)
print(inputs['pixel_values'].shape)

(2160, 2880, 3)
torch.Size([1, 3, 384, 384])

Code

out = model.generate(**inputs)

/opt/conda/envs/mggy8413/lib/python3.11/site-packages/transformers/generation/utils.py:1133: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(

Code

out

tensor([[30522,  1037,  9982,  1997,  1037,  2177,  1997,  2273,  5948,  4511,
           102]])

Code

print(processor.decode(out[0], skip_special_tokens=True))

a photograph of a group of men drinking wine

21.5.3 Unconditional Image Captioning

Code

inputs = processor(image,return_tensors="pt")

Code

out = model.generate(**inputs)

Code

print(processor.decode(out[0], skip_special_tokens=True))

a group of men standing around a table

Code

# Free up memory
del processor
del model
gc.collect()

21.6 Visual Question & Answering

Alternative: LLaVA for open-ended VQA: The BLIP-VQA model below is specialised and efficient. For open-ended visual question answering that requires reasoning, LLaVA (Large Language and Vision Assistant) is the open-source alternative:
import ollama
response = ollama.chat(
    model='llava',
    messages=[{
        'role': 'user',
        'content': 'How many people are in this image?',
        'images': ['20240321_194345.jpg']
    }]
)
print(response['message']['content'])

Code

# Suppressing warnings
from transformers.utils import logging
logging.set_verbosity_error()

import warnings
warnings.filterwarnings("ignore", message="Using the model-agnostic default `max_length`")

Code

from transformers import BlipForQuestionAnswering
from transformers import AutoProcessor

Code

model = BlipForQuestionAnswering.from_pretrained(
    "Salesforce/blip-vqa-base")

Info about Salesforce/blip-vqa-base

Code

processor = AutoProcessor.from_pretrained(
    "Salesforce/blip-vqa-base")

Code

# Load image
from PIL import Image

image = Image.open('20240321_194345.jpg')

image

Code

# Write the question you want to ask to the model about the image.
question = "how many dogs are in the picture?"

Code

inputs = processor(image, question, return_tensors="pt")

out = model.generate(**inputs)

Code

print(processor.decode(out[0], skip_special_tokens=True))

Code

question = "how many people in the picture?"
inputs = processor(image, question, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Code

question = "how many people wearing glasses in the picture?"
inputs = processor(image, question, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Code

question = "what are the people in the picture doing?"
inputs = processor(image, question, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

drinking

Code

question = "what is the picture about?"
inputs = processor(image, question, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

people drinking

Code

# Free up memory
del processor
del model
gc.collect()

21.7 OPTIONAL - Chatbot

21.7.1 Build the `chatbot` pipeline using 🤗 Transformers Library

Initialize a chatbot instance, and pass it a “conversation”. The conversation object encapsulates a piece of text.

The chatbot responds, and adds its response to the “conversation”. You can make it keep responding to itself.

Note: The Conversation class and conversational pipeline were deprecated in transformers 4.38 (early 2024) and removed in 4.45+. For multi-turn conversation with modern models, use the standard text-generation pipeline with a list of messages:
from transformers import pipeline
pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "What are some fun activities in New York?"},
]
reply = pipe(messages, max_new_tokens=256)
print(reply[0]['generated_text'][-1]['content'])
The facebook/blenderbot-400M-distill model and Conversation pattern below are preserved for historical reference but will fail on transformers >= 4.45.

Code

# Here is some code that suppresses warning messages.
from transformers.utils import logging
logging.set_verbosity_error()

Code

from transformers import pipeline
import torch

Define the conversation pipeline

Code

chatbot = pipeline(task="conversational",
                   model="facebook/blenderbot-400M-distill")

config.json: 100%|██████████| 1.57k/1.57k [00:00<00:00, 1.11MB/s]
pytorch_model.bin: 100%|██████████| 730M/730M [00:13<00:00, 55.5MB/s] 
generation_config.json: 100%|██████████| 347/347 [00:00<00:00, 1.03MB/s]
tokenizer_config.json: 100%|██████████| 1.15k/1.15k [00:00<00:00, 3.85MB/s]
vocab.json: 100%|██████████| 127k/127k [00:00<00:00, 113MB/s]
merges.txt: 100%|██████████| 62.9k/62.9k [00:00<00:00, 71.4MB/s]
added_tokens.json: 100%|██████████| 16.0/16.0 [00:00<00:00, 42.1kB/s]
special_tokens_map.json: 100%|██████████| 772/772 [00:00<00:00, 1.64MB/s]
tokenizer.json: 100%|██████████| 310k/310k [00:00<00:00, 6.31MB/s]

Info about ‘blenderbot-400M-distill’

Code

user_message = """
What are some fun activities I can do in the summer in New York?
"""

Code

from transformers import Conversation

Code

conversation = Conversation(user_message)

Code

print(conversation)

Conversation id: 3d02c9a1-8d0e-40d1-9e8c-42194d5013bb
user: 
What are some fun activities I can do in the summer in New York?

Code

conversation = chatbot(conversation)
conversation

Conversation id: 3d02c9a1-8d0e-40d1-9e8c-42194d5013bb
user: 
What are some fun activities I can do in the summer in New York?

assistant:  I'm not sure, but I know that New York is the most populous city in the United States.

Code

conversation = chatbot(conversation)
conversation

Conversation id: 3d02c9a1-8d0e-40d1-9e8c-42194d5013bb
user: 
What are some fun activities I can do in the summer in New York?

assistant:  I'm not sure, but I know that New York is the most populous city in the United States.
assistant:  New York City has a population of 8,537,673. That's a lot of people!

Code

conversation = chatbot(conversation)

Code

print(conversation)

Conversation id: 3d02c9a1-8d0e-40d1-9e8c-42194d5013bb
user: 
What are some fun activities I can do in the summer in New York?

assistant:  I'm not sure, but I know that New York is the most populous city in the United States.
assistant:  New York City has a population of 8,537,673. That's a lot of people!
assistant:  I know! It's also the most densely populated major city in North America.

You can continue the conversation with the chatbot with:

print(chatbot(Conversation("What else do you recommend?")))

However, the chatbot may provide an unrelated response because it does not have memory of any prior conversations.
To include prior conversations in the LLM’s context, you can add a ‘message’ to include the previous chat history.

Code

conversation.add_message(
    {"role": "user",
     "content": """
Where all hve you been?
"""
    })

Code

print(conversation)

Conversation id: 3d02c9a1-8d0e-40d1-9e8c-42194d5013bb
user: 
What are some fun activities I can do in the summer in New York?

assistant:  I'm not sure, but I know that New York is the most populous city in the United States.
assistant:  New York City has a population of 8,537,673. That's a lot of people!
assistant:  I know! It's also the most densely populated major city in North America.
user: 
Where all hve you been?

Code

conversation = chatbot(conversation)

print(conversation)

Conversation id: 3d02c9a1-8d0e-40d1-9e8c-42194d5013bb
user: 
What are some fun activities I can do in the summer in New York?

assistant:  I'm not sure, but I know that New York is the most populous city in the United States.
assistant:  New York City has a population of 8,537,673. That's a lot of people!
assistant:  I know! It's also the most densely populated major city in North America.
user: 
Where all hve you been?

assistant:  I've been to Manhattan, Brooklyn, Queens, The Bronx, and Staten Island.

21.1 Attention is All You Need

21.2 Sentence Transformers

21.2.1 Get some text data first

21.2.2 Embeddings/Feature Extraction

21.2.3 Cosine similarity between sentences

21.2.4 Semantic paraphrasing

21.2.5 Semantic Search

21.2.6 Clustering

21.2.6.1 KMeans

21.3 Huggingface Pipeline function

21.3.1 Embeddings/Feature Extraction

21.3.2 Fill Mask

21.3.3 Sentiment Analysis (+ve/-ve)

21.3.4 Named Entity Recognition

21.3.5 Question Answering

21.3.6 Summarization

21.3.6.1 Try a different model

21.3.7 Translation

21.3.7.1 Translation using NLLB (no language left behind)

21.3.8 Text Generation

21.3.9 Zero Shot Classification

21.4 Audio Classification

21.4.1 Automatic Speech Recognition

21.5 Vision

21.5.1 Object Detection

21.5.2 Image Captioning

21.5.2.1 Conditional Image Captioning

21.5.3 Unconditional Image Captioning

21.6 Visual Question & Answering

21.7 OPTIONAL - Chatbot

21.7.1 Build the chatbot pipeline using 🤗 Transformers Library

21.7.1 Build the `chatbot` pipeline using 🤗 Transformers Library