Reading A Big File With Python

ghz 11hours ago ⋅ 1 views

I'm trying to read some files in a directory, which has 10 text files. With time, the number of files increases, and the total size as of now goes around 400MB.

File contents are in the format:

student_name:student_ID:date_of_join:anotherfield1:anotherfield2

In case of a match, I have to print out the whole line. Here's what I've tried.

findvalue = "student_id" #this is users input alphanumeric
directory = "./RecordFolder"
for filename in os.listdir(directory):
    with open(os.path.join(directory, filename)) as f:
        for line in f:
            if findvalue in line:
                print line

This works, but it takes a lot of time. How can I reduce the run time?

Answer

Your current approach is functional but can indeed be slow, especially when the number of files and the size of the files grow. Here are several ways you can optimize this:

1. Use mmap for memory-mapped file access:

Instead of reading the file line by line, you can use mmap to map the entire file into memory and perform string searches more efficiently. This avoids repeated disk I/O, making the process faster for large files.

import mmap
import os

findvalue = "student_id"
directory = "./RecordFolder"

for filename in os.listdir(directory):
    filepath = os.path.join(directory, filename)
    with open(filepath, 'r') as f:
        # Memory-map the file to enable efficient searching
        mmapped_file = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        for line in mmapped_file.splitlines():  # splitlines() avoids the need to iterate over the entire file
            if findvalue in line.decode('utf-8'):  # decode if using text data
                print(line.decode('utf-8'))

2. Multi-threading or Multi-processing:

When you are reading large files from multiple files concurrently, parallelizing the task can improve performance. Python's concurrent.futures module makes it easy to parallelize your code using threads or processes.

Here’s how you could use ThreadPoolExecutor to parallelize the file search:

import os
from concurrent.futures import ThreadPoolExecutor

def search_in_file(filename, findvalue):
    with open(filename, 'r') as f:
        for line in f:
            if findvalue in line:
                print(line)

def main():
    findvalue = "student_id"
    directory = "./RecordFolder"
    
    files = [os.path.join(directory, filename) for filename in os.listdir(directory)]

    with ThreadPoolExecutor(max_workers=4) as executor:
        # Map each file to the search function
        executor.map(lambda file: search_in_file(file, findvalue), files)

if __name__ == '__main__':
    main()

This code will open multiple files concurrently using 4 worker threads, improving I/O-bound performance. If you're dealing with CPU-bound work, you can switch to ProcessPoolExecutor for better scalability.

3. Use grep in Python (If you're working on Linux or macOS):

If you're on a UNIX-like system (Linux/macOS), you can use the grep command-line tool, which is highly optimized for searching within files, through Python using subprocess.

import subprocess

findvalue = "student_id"
directory = "./RecordFolder"

# Use grep to search for the term across all text files in the directory
command = f"grep -H '{findvalue}' {directory}/*.txt"
subprocess.run(command, shell=True)

grep is optimized for performance, and using it from Python with subprocess will significantly speed up your search.

4. Optimize Disk I/O with Buffered Reading:

Although you’re already reading the files line-by-line, another way to speed up file reading is to increase the buffer size while reading. You can do this by passing a buffering parameter to open().

findvalue = "student_id"
directory = "./RecordFolder"

buffer_size = 8192  # 8KB buffer
for filename in os.listdir(directory):
    with open(os.path.join(directory, filename), 'r', buffering=buffer_size) as f:
        for line in f:
            if findvalue in line:
                print(line)

Increasing the buffer size reduces the number of I/O operations for larger files, speeding up the process.

5. Use os.scandir() Instead of os.listdir():

os.scandir() is faster than os.listdir() because it returns an iterator of DirEntry objects that allow you to access file metadata without additional system calls.

import os

findvalue = "student_id"
directory = "./RecordFolder"

for entry in os.scandir(directory):
    if entry.is_file() and entry.name.endswith('.txt'):  # Ensure it's a file and a .txt file
        with open(entry.path, 'r') as f:
            for line in f:
                if findvalue in line:
                    print(line)

This reduces the overhead of calling os.path.join() and os.path.isfile() repeatedly.

6. Search for Specific Lines:

If the "findvalue" can be associated with a specific pattern (for instance, a specific field or position in the line), using regular expressions to match those patterns can sometimes improve performance compared to using in.

import re

findvalue = "student_id"
directory = "./RecordFolder"
pattern = re.compile(findvalue)

for filename in os.listdir(directory):
    with open(os.path.join(directory, filename), 'r') as f:
        for line in f:
            if pattern.search(line):
                print(line)

Regular expressions are highly optimized for searching and can help avoid unnecessary substring checks.

7. Profiling and Performance Monitoring:

If the performance still doesn't meet expectations, you can profile your code to see where the bottleneck lies using the cProfile module or timeit.

import cProfile

def search_in_directory(directory, findvalue):
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename), 'r') as f:
            for line in f:
                if findvalue in line:
                    print(line)

cProfile.run('search_in_directory("./RecordFolder", "student_id")')

This will give you an idea of where the time is being spent (I/O, CPU, etc.).


Summary:

  • Use mmap for memory-mapped file access to speed up reading.
  • Multi-threading or multi-processing with ThreadPoolExecutor or ProcessPoolExecutor can reduce runtime for large numbers of files.
  • If you're on a UNIX-like system, use grep through subprocess for an optimized search.
  • Increase buffer sizes and use os.scandir() for better performance in file I/O operations.