Any way to highlight pdf document for given words via python

When working with PDF documents in Python, there are several ways to highlight specific words or phrases. In this article, we will explore three different approaches to achieve this task.

Option 1: PyPDF2 and ReportLab

The first option involves using the PyPDF2 library to extract the text from the PDF document and the ReportLab library to generate a new PDF with the highlighted words.

import PyPDF2
from reportlab.lib.colors import yellow
from reportlab.pdfgen import canvas

def highlight_pdf(input_file, output_file, words):
    pdf = PyPDF2.PdfFileReader(input_file)
    output = PyPDF2.PdfFileWriter()

    for page_num in range(pdf.getNumPages()):
        page = pdf.getPage(page_num)
        text = page.extractText()

        for word in words:
            if word in text:
                c = canvas.Canvas(output_file)
                c.setFillColor(yellow)
                c.setFont("Helvetica", 12)
                c.drawString(100, 100, word)
                c.save()

                output.addPage(page)

    with open(output_file, "wb") as f:
        output.write(f)

In this code snippet, we first import the necessary libraries. Then, we define a function called highlight_pdf that takes the input file path, output file path, and a list of words to highlight as parameters.

We use PyPDF2 to read the input PDF file and create a new PDF writer object. We iterate over each page of the input PDF and extract the text. For each word in the list of words, we check if it exists in the extracted text. If it does, we create a new canvas using ReportLab, set the fill color to yellow, and draw the word on the canvas. We then add the original page to the output PDF.

Finally, we save the output PDF file.

Option 2: PyMuPDF

The second option involves using the PyMuPDF library, which provides a high-level interface for working with PDF documents.

import fitz

def highlight_pdf(input_file, output_file, words):
    doc = fitz.open(input_file)

    for page_num in range(doc.page_count):
        page = doc.load_page(page_num)
        text = page.get_text()

        for word in words:
            if word in text:
                highlight = page.search_for(word)
                page.add_highlight_annot(highlight)

    doc.save(output_file)

In this code snippet, we import the fitz module from PyMuPDF. We define a function called highlight_pdf that takes the input file path, output file path, and a list of words to highlight as parameters.

We open the input PDF file using fitz.open and iterate over each page. For each page, we extract the text using page.get_text(). We then check if each word in the list of words exists in the extracted text. If it does, we use page.search_for to find the coordinates of the word on the page and page.add_highlight_annot to add a highlight annotation to the word.

Finally, we save the modified PDF using doc.save.

Option 3: PyPDF2 and PyFPDF

The third option combines the PyPDF2 library for text extraction and the PyFPDF library for PDF generation.

import PyPDF2
from fpdf import FPDF

class PDF(FPDF):
    def highlight_text(self, text, words):
        self.set_text_color(255, 255, 0)
        self.set_fill_color(255, 255, 0)

        for word in words:
            self.multi_cell(0, 10, txt=word, fill=True)

def highlight_pdf(input_file, output_file, words):
    pdf = PyPDF2.PdfFileReader(input_file)
    output = PyPDF2.PdfFileWriter()

    for page_num in range(pdf.getNumPages()):
        page = pdf.getPage(page_num)
        text = page.extractText()

        pdf_writer = PDF()
        pdf_writer.add_page()
        pdf_writer.set_font("Arial", size=12)
        pdf_writer.highlight_text(text, words)

        with open(output_file, "wb") as f:
            pdf_writer.output(f)

In this code snippet, we define a custom class called PDF that extends the FPDF class from PyFPDF. We add a new method called highlight_text that takes the extracted text and a list of words to highlight as parameters.

Inside the highlight_text method, we set the text and fill color to yellow. We then iterate over each word in the list of words and use multi_cell to draw a highlighted cell with the word.

In the highlight_pdf function, we use PyPDF2 to read the input PDF file and create a new PDF writer object. We iterate over each page of the input PDF, extract the text, and create a new instance of the PDF class. We add a new page to the PDF, set the font, and call the highlight_text method to highlight the words.

Finally, we save the modified PDF.

After exploring these three options, the best approach depends on the specific requirements of your project. If you need more advanced features like searching for words and adding annotations, PyMuPDF provides a comprehensive solution. However, if you prefer a simpler solution with basic highlighting capabilities, options 1 and 3 using PyPDF2 and ReportLab or PyFPDF respectively are suitable choices.

Rate this post

11 Responses

  1. Option 2: PyMuPDF seems like the real MVP here! Its got the power and flexibility to highlight PDFs like a boss. 💪🔍

    1. Nah, Option 2 might be trendy, but PyMuPDF is the real deal. Its got the power and flexibility that sets it apart. No need to follow the crowd, stick with the best. 💪🔥

    1. PyMuPDF is a great choice! Option 1 may be cool, but PyMuPDF is more versatile and efficient. Its all about personal preference though. What matters is getting the job done, right? Cheers to all the PyMuPDF enthusiasts out there!

Leave a Reply

Your email address will not be published. Required fields are marked *

Table of Contents