When working with PDF documents in Python, there are several ways to highlight specific words or phrases. In this article, we will explore three different approaches to achieve this task.
Option 1: PyPDF2 and ReportLab
The first option involves using the PyPDF2 library to extract the text from the PDF document and the ReportLab library to generate a new PDF with the highlighted words.
import PyPDF2
from reportlab.lib.colors import yellow
from reportlab.pdfgen import canvas
def highlight_pdf(input_file, output_file, words):
pdf = PyPDF2.PdfFileReader(input_file)
output = PyPDF2.PdfFileWriter()
for page_num in range(pdf.getNumPages()):
page = pdf.getPage(page_num)
text = page.extractText()
for word in words:
if word in text:
c = canvas.Canvas(output_file)
c.setFillColor(yellow)
c.setFont("Helvetica", 12)
c.drawString(100, 100, word)
c.save()
output.addPage(page)
with open(output_file, "wb") as f:
output.write(f)
In this code snippet, we first import the necessary libraries. Then, we define a function called highlight_pdf
that takes the input file path, output file path, and a list of words to highlight as parameters.
We use PyPDF2 to read the input PDF file and create a new PDF writer object. We iterate over each page of the input PDF and extract the text. For each word in the list of words, we check if it exists in the extracted text. If it does, we create a new canvas using ReportLab, set the fill color to yellow, and draw the word on the canvas. We then add the original page to the output PDF.
Finally, we save the output PDF file.
Option 2: PyMuPDF
The second option involves using the PyMuPDF library, which provides a high-level interface for working with PDF documents.
import fitz
def highlight_pdf(input_file, output_file, words):
doc = fitz.open(input_file)
for page_num in range(doc.page_count):
page = doc.load_page(page_num)
text = page.get_text()
for word in words:
if word in text:
highlight = page.search_for(word)
page.add_highlight_annot(highlight)
doc.save(output_file)
In this code snippet, we import the fitz
module from PyMuPDF. We define a function called highlight_pdf
that takes the input file path, output file path, and a list of words to highlight as parameters.
We open the input PDF file using fitz.open
and iterate over each page. For each page, we extract the text using page.get_text()
. We then check if each word in the list of words exists in the extracted text. If it does, we use page.search_for
to find the coordinates of the word on the page and page.add_highlight_annot
to add a highlight annotation to the word.
Finally, we save the modified PDF using doc.save
.
Option 3: PyPDF2 and PyFPDF
The third option combines the PyPDF2 library for text extraction and the PyFPDF library for PDF generation.
import PyPDF2
from fpdf import FPDF
class PDF(FPDF):
def highlight_text(self, text, words):
self.set_text_color(255, 255, 0)
self.set_fill_color(255, 255, 0)
for word in words:
self.multi_cell(0, 10, txt=word, fill=True)
def highlight_pdf(input_file, output_file, words):
pdf = PyPDF2.PdfFileReader(input_file)
output = PyPDF2.PdfFileWriter()
for page_num in range(pdf.getNumPages()):
page = pdf.getPage(page_num)
text = page.extractText()
pdf_writer = PDF()
pdf_writer.add_page()
pdf_writer.set_font("Arial", size=12)
pdf_writer.highlight_text(text, words)
with open(output_file, "wb") as f:
pdf_writer.output(f)
In this code snippet, we define a custom class called PDF
that extends the FPDF
class from PyFPDF. We add a new method called highlight_text
that takes the extracted text and a list of words to highlight as parameters.
Inside the highlight_text
method, we set the text and fill color to yellow. We then iterate over each word in the list of words and use multi_cell
to draw a highlighted cell with the word.
In the highlight_pdf
function, we use PyPDF2 to read the input PDF file and create a new PDF writer object. We iterate over each page of the input PDF, extract the text, and create a new instance of the PDF
class. We add a new page to the PDF, set the font, and call the highlight_text
method to highlight the words.
Finally, we save the modified PDF.
After exploring these three options, the best approach depends on the specific requirements of your project. If you need more advanced features like searching for words and adding annotations, PyMuPDF provides a comprehensive solution. However, if you prefer a simpler solution with basic highlighting capabilities, options 1 and 3 using PyPDF2 and ReportLab or PyFPDF respectively are suitable choices.
11 Responses
Option 2: PyMuPDF seems like the real MVP here! Its got the power and flexibility to highlight PDFs like a boss. 💪🔍
Option 2: PyMuPDF seems like the winner here! So much more functionality and flexibility. Love it!
Option 3 sounds intriguing, but I wonder if its worth the extra hassle. Thoughts?
Option 3 seems like the perfect combo – PyPDF2 and PyFPDF! Why not have the best of both worlds?
Option 2 seems like the cool kid on the block. PyMuPDF FTW! 🙌🏼😎
Nah, Option 2 might be trendy, but PyMuPDF is the real deal. Its got the power and flexibility that sets it apart. No need to follow the crowd, stick with the best. 💪🔥
Option 1 seems cool, but Im more of a PyMuPDF person. What about you guys?
PyMuPDF is a great choice! Option 1 may be cool, but PyMuPDF is more versatile and efficient. Its all about personal preference though. What matters is getting the job done, right? Cheers to all the PyMuPDF enthusiasts out there!
I personally think Option 3 is the way to go! PyPDF2 and PyFPDF are a power couple! 💪🏼
Option 3 seems like a hassle. Why not just use Option 1 or 2?
Option 1 seems like a solid choice, but I wonder if Option 2 is faster? 🤔