Merge PDF files

ghz 3months ago ⋅ 28 views

Is it possible, using Python, to merge separate PDF files?

Assuming so, I need to extend this a little further. I am hoping to loop through folders in a directory and repeat this procedure.

And I may be pushing my luck, but is it possible to exclude a page that is contained in each of the PDFs (my report generation always creates an extra blank page).

Answers

Yes, it is possible to merge separate PDF files in Python using libraries like PyPDF2 or pypdf. Additionally, you can loop through directories to merge PDF files and exclude specific pages.

Below is an example of how you can achieve this using pypdf (an improved and maintained version of PyPDF2):

1. Install pypdf:

First, install the pypdf library:

pip install pypdf

2. Merge PDF files and exclude specific pages:

Here’s an example that merges PDF files in folders, while excluding the last page (if it is blank) from each PDF:

import os
from pypdf import PdfReader, PdfWriter

def merge_pdfs_in_folder(folder_path, output_path, exclude_last_page=True):
    pdf_writer = PdfWriter()

    # Loop through all the files in the folder
    for filename in os.listdir(folder_path):
        if filename.endswith('.pdf'):
            pdf_path = os.path.join(folder_path, filename)
            pdf_reader = PdfReader(pdf_path)

            # Determine the number of pages, and exclude the last page if needed
            num_pages = len(pdf_reader.pages)
            if exclude_last_page:
                num_pages -= 1  # Exclude last page

            # Add all pages except the last one (if excluded)
            for page_num in range(num_pages):
                page = pdf_reader.pages[page_num]
                pdf_writer.add_page(page)

    # Write the merged PDF to the output file
    with open(output_path, 'wb') as output_pdf:
        pdf_writer.write(output_pdf)

    print(f"PDFs from {folder_path} merged into {output_path}")

# Example usage
folder_to_merge = "path/to/folder"  # Folder containing PDF files
output_pdf = "merged_output.pdf"  # Output merged PDF file
merge_pdfs_in_folder(folder_to_merge, output_pdf)

Explanation:

  1. Loop through folders: The function merge_pdfs_in_folder() goes through each PDF in the specified folder.
  2. Merge PDFs: It uses PdfReader to read each PDF and PdfWriter to merge the pages.
  3. Exclude the last page: If the exclude_last_page parameter is True, the script excludes the last page of each PDF (assuming it’s a blank page).

3. Extend to loop through multiple directories:

If you want to loop through multiple directories and merge the PDFs in each one, you can modify the code like this:

def merge_pdfs_in_directories(base_directory, output_directory):
    for root, dirs, files in os.walk(base_directory):
        if files:
            folder_name = os.path.basename(root)
            output_pdf = os.path.join(output_directory, f"merged_{folder_name}.pdf")
            merge_pdfs_in_folder(root, output_pdf)

# Example usage
base_dir = "path/to/main_directory"  # Base directory containing subfolders
output_dir = "path/to/output_directory"  # Where to save the merged PDFs
merge_pdfs_in_directories(base_dir, output_dir)

This will loop through all the directories inside base_directory, merge the PDFs in each subfolder, and save the merged output to output_directory.

Notes:

  • You can modify the logic to exclude a specific page other than the last one by adjusting the exclude_last_page logic.
  • This approach can be customized further to skip blank pages by analyzing page content, but you’ll need to implement specific logic to detect blank pages.