1)Understanding Sentence Similarity
When we say two sentences are similar, humans often look at shared words or overall meaning:
The cat is sleeping on my chair.
The cat sleeps in my bed.
Both sentences share the subject (“cat”) and the action (“sleep”), even though the locations differ. Computers, however, need numbers to “understand” language. We convert words or sentences into distributed representations, i.e., vectors, and then compare those vectors.
Cosine Similarity
The most common metric for comparing two vectors x and y is cosine similarity:
$$\cos(\theta) = \frac{\langle x,\,y\rangle}{\|x\|\;\|y\|}$$
<x, y>is the dot product (captures magnitude and direction).‖x‖and‖y‖are the lengths (magnitudes) of x and y.
The result ranges from -1 (opposite) through 0 (orthogonal) to 1 (identical direction).
Visual Intuition
- 1.0: vectors point in exactly the same direction
- 0.0: vectors are perpendicular (no similarity)
- –1.0: vectors point in opposite directions
Code Example: Computing Cosine Similarity in NumPy
import numpy as np
def cosine_similarity(x, y):
return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
# Similar example
x_vect = np.array([0.1, 0.13])
y_vect = np.array([0.02, 0.2])
print(cosine_similarity(x_vect, y_vect))
# → 0.8493
# Unrelated example (tiny vs. normal scale)
x_vect = np.array([1e-15, -1e-16])
y_vect = np.array([0.02, 0.2])
print(cosine_similarity(x_vect, y_vect))
# → ~0.0
# Opposite-direction example
x_vect = np.array([-0.1, -0.1])
y_vect = np.array([0.02, 0.2])
print(cosine_similarity(x_vect, y_vect))
# → -0.7739
2) Extractive Summarization with TF-IDF
Whereas similarity measures how two texts relate, summarization condenses a longer text into its most important points. There are two broad methods:
- Extractive Summarization
Selects and concatenates sentences directly from the original text. Preserves exact wording. - Abstractive Summarization
Generates new sentences that capture the gist. Used by modern generative models.
In this tutorial, we focus on extractive summarization using TF-IDF combined with cosine similarity.
TF-IDF Refresher
- Term Frequency (TF):
TF(t, d) = (count of term t in document d) / (total terms in d). Higher when a term appears more often in that document. - Inverse Document Frequency (IDF):
IDF(t) = log(N / (number of documents containing t)). Higher when a term is rare across the corpus. - TF-IDF Score:
TF-IDF(t, d) = TF(t, d) × IDF(t). Highlights terms that are both common in one document and rare across the corpus.
Code Example: Extractive Summarization in Python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents = [
"We are demonstrating summarization using TF-IDF.",
"Summarization means extracting the key points of a text.",
"Extractive summarization uses the original text unchanged.",
"TF-IDF (Term Frequency-Inverse Document Frequency) checks term importance."
]
# 1. Vectorize the documents
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# 2. Compute pairwise cosine similarities
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
# 3. Choose a target sentence to summarize
target_idx = 0 # summarizing the first document
# 4. Sort sentences by similarity (excluding itself)
sims = list(enumerate(cosine_sim[target_idx]))
sims = sorted(sims, key=lambda x: x[1], reverse=True)
# 5. Pick the top sentence(s) as the summary
n = 1
summary = [documents[i] for i, _ in sims[1:n+1]]
print("\n".join(summary))
# → "TF-IDF (Term Frequency-Inverse Document Frequency) checks term importance."
Conclusion
– Sentence similarity: embed text as vectors and measure cosine similarity to quantify likeness.
– Extractive summarization: use TF-IDF to score and select the most representative sentences.
Stay tuned!
