The Road to LLM: Measuring Sentence Similarity & Extractive Summarization Day 5]

The Road to LLM Advent Calendar 2023: Sentence Similarity & Extractive Summarization

1)Understanding Sentence Similarity

When we say two sentences are similar, humans often look at shared words or overall meaning:

The cat is sleeping on my chair.
The cat sleeps in my bed.

Both sentences share the subject (“cat”) and the action (“sleep”), even though the locations differ. Computers, however, need numbers to “understand” language. We convert words or sentences into distributed representations, i.e., vectors, and then compare those vectors.

Cosine Similarity

The most common metric for comparing two vectors x and y is cosine similarity:

$$\cos(\theta) = \frac{\langle x,\,y\rangle}{\|x\|\;\|y\|}$$

  • <x, y> is the dot product (captures magnitude and direction).
  • ‖x‖ and ‖y‖ are the lengths (magnitudes) of x and y.

The result ranges from -1 (opposite) through 0 (orthogonal) to 1 (identical direction).

Visual Intuition

  • 1.0: vectors point in exactly the same direction
  • 0.0: vectors are perpendicular (no similarity)
  • –1.0: vectors point in opposite directions

Code Example: Computing Cosine Similarity in NumPy

import numpy as np

def cosine_similarity(x, y):
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

# Similar example
x_vect = np.array([0.1, 0.13])
y_vect = np.array([0.02, 0.2])
print(cosine_similarity(x_vect, y_vect))
# → 0.8493

# Unrelated example (tiny vs. normal scale)
x_vect = np.array([1e-15, -1e-16])
y_vect = np.array([0.02, 0.2])
print(cosine_similarity(x_vect, y_vect))
# → ~0.0

# Opposite-direction example
x_vect = np.array([-0.1, -0.1])
y_vect = np.array([0.02, 0.2])
print(cosine_similarity(x_vect, y_vect))
# → -0.7739

2) Extractive Summarization with TF-IDF

Whereas similarity measures how two texts relate, summarization condenses a longer text into its most important points. There are two broad methods:

  1. Extractive Summarization
    Selects and concatenates sentences directly from the original text. Preserves exact wording.
  2. Abstractive Summarization
    Generates new sentences that capture the gist. Used by modern generative models.

In this tutorial, we focus on extractive summarization using TF-IDF combined with cosine similarity.

TF-IDF Refresher

  • Term Frequency (TF): TF(t, d) = (count of term t in document d) / (total terms in d). Higher when a term appears more often in that document.
  • Inverse Document Frequency (IDF): IDF(t) = log(N / (number of documents containing t)). Higher when a term is rare across the corpus.
  • TF-IDF Score: TF-IDF(t, d) = TF(t, d) × IDF(t). Highlights terms that are both common in one document and rare across the corpus.

Code Example: Extractive Summarization in Python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

documents = [
    "We are demonstrating summarization using TF-IDF.",
    "Summarization means extracting the key points of a text.",
    "Extractive summarization uses the original text unchanged.",
    "TF-IDF (Term Frequency-Inverse Document Frequency) checks term importance."
]

# 1. Vectorize the documents
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# 2. Compute pairwise cosine similarities
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# 3. Choose a target sentence to summarize
target_idx = 0  # summarizing the first document

# 4. Sort sentences by similarity (excluding itself)
sims = list(enumerate(cosine_sim[target_idx]))
sims = sorted(sims, key=lambda x: x[1], reverse=True)

# 5. Pick the top sentence(s) as the summary
n = 1
summary = [documents[i] for i, _ in sims[1:n+1]]
print("\n".join(summary))
# → "TF-IDF (Term Frequency-Inverse Document Frequency) checks term importance."

Conclusion

Sentence similarity: embed text as vectors and measure cosine similarity to quantify likeness.
Extractive summarization: use TF-IDF to score and select the most representative sentences.

Stay tuned!


Laisser un commentaire