In big data world, many of us handing large data files. When the file size is very big (above 10 GB) it is difficult to handle it as a single big file, at the time we need to split into several smaller chunks and than process it.
Unix having inbuilt spilt command for split a files into smaller files.
Example: split -l 1000 myfile.csv
We can able to split files in typical python way. you can read the file line by line and write it to new file. based on the line count we can split the files.
There are several ways to split a large files, but here I have given one of the two ways to do this process.
- Using unix commands to split a file
- Second option is typical pythonic way
1. Using unix commands:
Unix having inbuilt spilt command for split a files into smaller files.
split [options] filename prefix
Options:
There are several options are there in split command (see using man command). main options are split based on number of lines or size
-l linenumber -b bytes
Example: split -l 1000 myfile.csv
How to call unix commands using python?
Python having inbuilt OS module to do this. But I recommend to use subprocess module for calling unix commands inside python code. The
subprocess
module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.
code:
import subprocess cmd = ["split", "-l", "10000", "", "myfile.csv"] subprocess.check_call(cmd)
2. Using python file object to split files:
chunked_file_line_count = 1000
with open("mylargefile.csv") as file_obj:
line_count = 0
file_count = 0
print "================ Spliting file ======================== "+self.file_name
chunked_file_object = open("file_"+str(file_count)+".csv","wb")
for line in file_obj:
chunked_file_object.write(line)
line_count = line_count + 1
if line_count == chunked_file_line_count:
file_count = file_count + 1
line_count = 0
print "writing file " + str(file_count)
chunked_file_object = open("file_"+str(file_count)+".csv","wb")
chunked_file_object.close()
This comment has been removed by the author.
ReplyDeleteI wish to show thanks to you just for bailing me out of this particular trouble.As a result of checking through the net and meeting techniques that were not productive, I thought my life was done.
ReplyDeleteSurya Informatics
Glad it helped you
DeleteI have to add "chunked_file_object.close()" to make it work. Edited code below:
ReplyDeleteprint("writing file " + str(file_count))
chunked_file_object.close()
chunked_file_object = open("file_"+str(file_count)+".csv","wb")
Harrah's Cherokee Casino - JSHub
ReplyDeleteHotel 청주 출장마사지 Review. 나주 출장마사지 Location, Cherokee, North Carolina · Room, 300-square-foot (5,700 sq. ft.) casino, 광주광역 출장안마 located in Cherokee at 3001 과천 출장마사지 Highway 50, Rating: 포천 출장샵 3.7 · Review by Josh Middleton