In big data world, many of us handing large data files. When the file size is very big (above 10 GB) it is difficult to handle it as a single big file, at the time we need to split into several smaller chunks and than process it.
Unix having inbuilt spilt command for split a files into smaller files.
Example: split -l 1000 myfile.csv
We can able to split files in typical python way. you can read the file line by line and write it to new file. based on the line count we can split the files.
There are several ways to split a large files, but here I have given one of the two ways to do this process.
- Using unix commands to split a file
- Second option is typical pythonic way
1. Using unix commands:
Unix having inbuilt spilt command for split a files into smaller files.
split [options] filename prefix
Options:
There are several options are there in split command (see using man command). main options are split based on number of lines or size
-l linenumber -b bytes
Example: split -l 1000 myfile.csv
How to call unix commands using python?
Python having inbuilt OS module to do this. But I recommend to use subprocess module for calling unix commands inside python code. The
subprocess
module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.
code:
import subprocess cmd = ["split", "-l", "10000", "", "myfile.csv"] subprocess.check_call(cmd)
2. Using python file object to split files:
chunked_file_line_count = 1000
with open("mylargefile.csv") as file_obj:
line_count = 0
file_count = 0
print "================ Spliting file ======================== "+self.file_name
chunked_file_object = open("file_"+str(file_count)+".csv","wb")
for line in file_obj:
chunked_file_object.write(line)
line_count = line_count + 1
if line_count == chunked_file_line_count:
file_count = file_count + 1
line_count = 0
print "writing file " + str(file_count)
chunked_file_object = open("file_"+str(file_count)+".csv","wb")
chunked_file_object.close()