Bleu score in python from scratch

The Bleu score is a metric used to evaluate the quality of machine-generated translations by comparing them to one or more reference translations. In this article, we will explore three different ways to calculate the Bleu score in Python from scratch.

Option 1: Using NLTK

The Natural Language Toolkit (NLTK) is a powerful library for natural language processing in Python. It provides various tools and algorithms for text processing, including the calculation of the Bleu score.

import nltk

def calculate_bleu_score(candidate, references):
    candidate = candidate.split()
    references = [reference.split() for reference in references]
    return nltk.translate.bleu_score.sentence_bleu(references, candidate)

candidate = "The cat is on the mat"
references = ["The cat is sitting on the mat", "The cat is lying on the mat"]
bleu_score = calculate_bleu_score(candidate, references)
print("Bleu Score:", bleu_score)

In this code, we first import the NLTK library. Then, we define a function calculate_bleu_score that takes a candidate translation and a list of reference translations as input. We split the candidate and reference translations into lists of tokens and use the sentence_bleu function from the NLTK library to calculate the Bleu score. Finally, we print the Bleu score.

Option 2: Implementing the Bleu Score Algorithm

If you prefer to implement the Bleu score algorithm from scratch without using any external libraries, you can follow this approach.

import math

def calculate_bleu_score(candidate, references):
    candidate = candidate.split()
    references = [reference.split() for reference in references]
    
    candidate_length = len(candidate)
    reference_lengths = [len(reference) for reference in references]
    
    candidate_counts = {}
    reference_counts = {}
    
    for n in range(1, 5):
        candidate_ngrams = set(zip(*[candidate[i:] for i in range(n)]))
        reference_ngrams = [set(zip(*[reference[i:] for i in range(n)])) for reference in references]
        
        for ngram in candidate_ngrams:
            candidate_counts[ngram] = candidate_counts.get(ngram, 0) + 1
        
        for reference_ngram in reference_ngrams:
            reference_counts[reference_ngram] = max(reference_counts.get(reference_ngram, 0), sum([1 for reference in reference_ngram if reference_counts.get(reference, 0) < sum([1 for ref in reference_ngrams if reference in ref])]))
    
    candidate_clip_counts = {}
    
    for ngram, count in candidate_counts.items():
        candidate_clip_counts[ngram] = min(count, reference_counts.get(ngram, 0))
    
    candidate_clip_length = sum(candidate_clip_counts.values())
    reference_length = max(reference_lengths)
    
    brevity_penalty = 1 if candidate_length > reference_length else math.exp(1 - reference_length / candidate_length)
    
    bleu_score = brevity_penalty * math.exp(sum([math.log(candidate_clip_counts[ngram] / candidate_clip_length) for ngram in candidate_clip_counts]) / 4)
    
    return bleu_score

candidate = "The cat is on the mat"
references = ["The cat is sitting on the mat", "The cat is lying on the mat"]
bleu_score = calculate_bleu_score(candidate, references)
print("Bleu Score:", bleu_score)

In this code, we first split the candidate and reference translations into lists of tokens. Then, we calculate the lengths of the candidate and reference translations. Next, we initialize dictionaries to store the counts of n-grams in the candidate and reference translations. We iterate over different values of n (from 1 to 4) and calculate the n-gram counts for the candidate and reference translations. We also calculate the maximum count of each n-gram in the reference translations. Using these counts, we calculate the clipped counts for the candidate translations. Finally, we calculate the brevity penalty, which adjusts the Bleu score based on the length of the candidate and reference translations, and compute the Bleu score.

Option 3: Using the SacreBLEU Library

The SacreBLEU library is a popular choice for calculating the Bleu score in Python. It provides a simple and efficient implementation of the Bleu score algorithm.

import sacrebleu

def calculate_bleu_score(candidate, references):
    return sacrebleu.corpus_bleu(candidate, [references]).score

candidate = ["The cat is on the mat"]
references = [["The cat is sitting on the mat", "The cat is lying on the mat"]]
bleu_score = calculate_bleu_score(candidate, references)
print("Bleu Score:", bleu_score)

In this code, we import the SacreBLEU library. Then, we define a function calculate_bleu_score that takes a candidate translation and a list of reference translations as input. We use the corpus_bleu function from the SacreBLEU library to calculate the Bleu score. Finally, we print the Bleu score.

After exploring these three options, it is clear that using the SacreBLEU library provides a simple and efficient solution for calculating the Bleu score in Python. It abstracts away the complexities of the algorithm and provides a straightforward interface. Therefore, option 3 is the recommended choice for calculating the Bleu score in Python.

Rate this post

3 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

Table of Contents