I'm trying to read some files in a directory, which has 10 text files. With time, the number of files increases, and the total size as of now goes around 400MB.
File contents are in the format:
student_name:student_ID:date_of_join:anotherfield1:anotherfield2
In case of a match, I have to print out the whole line. Here's what I've tried.
findvalue = "student_id" #this is users input alphanumeric
directory = "./RecordFolder"
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
for line in f:
if findvalue in line:
print line
This works, but it takes a lot of time. How can I reduce the run time?
Answer
Your current approach is functional but can indeed be slow, especially when the number of files and the size of the files grow. Here are several ways you can optimize this:
1. Use mmap
for memory-mapped file access:
Instead of reading the file line by line, you can use mmap
to map the entire file into memory and perform string searches more efficiently. This avoids repeated disk I/O, making the process faster for large files.
import mmap
import os
findvalue = "student_id"
directory = "./RecordFolder"
for filename in os.listdir(directory):
filepath = os.path.join(directory, filename)
with open(filepath, 'r') as f:
# Memory-map the file to enable efficient searching
mmapped_file = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
for line in mmapped_file.splitlines(): # splitlines() avoids the need to iterate over the entire file
if findvalue in line.decode('utf-8'): # decode if using text data
print(line.decode('utf-8'))
2. Multi-threading or Multi-processing:
When you are reading large files from multiple files concurrently, parallelizing the task can improve performance. Python's concurrent.futures
module makes it easy to parallelize your code using threads or processes.
Here’s how you could use ThreadPoolExecutor
to parallelize the file search:
import os
from concurrent.futures import ThreadPoolExecutor
def search_in_file(filename, findvalue):
with open(filename, 'r') as f:
for line in f:
if findvalue in line:
print(line)
def main():
findvalue = "student_id"
directory = "./RecordFolder"
files = [os.path.join(directory, filename) for filename in os.listdir(directory)]
with ThreadPoolExecutor(max_workers=4) as executor:
# Map each file to the search function
executor.map(lambda file: search_in_file(file, findvalue), files)
if __name__ == '__main__':
main()
This code will open multiple files concurrently using 4 worker threads, improving I/O-bound performance. If you're dealing with CPU-bound work, you can switch to ProcessPoolExecutor
for better scalability.
3. Use grep
in Python (If you're working on Linux or macOS):
If you're on a UNIX-like system (Linux/macOS), you can use the grep
command-line tool, which is highly optimized for searching within files, through Python using subprocess
.
import subprocess
findvalue = "student_id"
directory = "./RecordFolder"
# Use grep to search for the term across all text files in the directory
command = f"grep -H '{findvalue}' {directory}/*.txt"
subprocess.run(command, shell=True)
grep
is optimized for performance, and using it from Python with subprocess
will significantly speed up your search.
4. Optimize Disk I/O with Buffered Reading:
Although you’re already reading the files line-by-line, another way to speed up file reading is to increase the buffer size while reading. You can do this by passing a buffering
parameter to open()
.
findvalue = "student_id"
directory = "./RecordFolder"
buffer_size = 8192 # 8KB buffer
for filename in os.listdir(directory):
with open(os.path.join(directory, filename), 'r', buffering=buffer_size) as f:
for line in f:
if findvalue in line:
print(line)
Increasing the buffer size reduces the number of I/O operations for larger files, speeding up the process.
5. Use os.scandir()
Instead of os.listdir()
:
os.scandir()
is faster than os.listdir()
because it returns an iterator of DirEntry
objects that allow you to access file metadata without additional system calls.
import os
findvalue = "student_id"
directory = "./RecordFolder"
for entry in os.scandir(directory):
if entry.is_file() and entry.name.endswith('.txt'): # Ensure it's a file and a .txt file
with open(entry.path, 'r') as f:
for line in f:
if findvalue in line:
print(line)
This reduces the overhead of calling os.path.join()
and os.path.isfile()
repeatedly.
6. Search for Specific Lines:
If the "findvalue" can be associated with a specific pattern (for instance, a specific field or position in the line), using regular expressions to match those patterns can sometimes improve performance compared to using in
.
import re
findvalue = "student_id"
directory = "./RecordFolder"
pattern = re.compile(findvalue)
for filename in os.listdir(directory):
with open(os.path.join(directory, filename), 'r') as f:
for line in f:
if pattern.search(line):
print(line)
Regular expressions are highly optimized for searching and can help avoid unnecessary substring checks.
7. Profiling and Performance Monitoring:
If the performance still doesn't meet expectations, you can profile your code to see where the bottleneck lies using the cProfile
module or timeit
.
import cProfile
def search_in_directory(directory, findvalue):
for filename in os.listdir(directory):
with open(os.path.join(directory, filename), 'r') as f:
for line in f:
if findvalue in line:
print(line)
cProfile.run('search_in_directory("./RecordFolder", "student_id")')
This will give you an idea of where the time is being spent (I/O, CPU, etc.).
Summary:
- Use
mmap
for memory-mapped file access to speed up reading. - Multi-threading or multi-processing with
ThreadPoolExecutor
orProcessPoolExecutor
can reduce runtime for large numbers of files. - If you're on a UNIX-like system, use
grep
throughsubprocess
for an optimized search. - Increase buffer sizes and use
os.scandir()
for better performance in file I/O operations.