Wednesday, 31 August 2016

How to split large files into smaller chunk files using python?

In big data world, many of us handing large data files. When the file size is very big (above 10 GB) it is difficult to handle it as a single big file, at the time we need to split into several smaller chunks and than process it.

There are several ways to split a large files, but here I have given one of the two ways to do this process.
  • Using unix commands to split a file
  • Second option is typical pythonic way

1. Using unix commands:


Unix having inbuilt spilt command for split a files into smaller files.
split [options] filename prefix
Options:

   There are several options are there in split command (see using man command).  main options are split based on number of lines or size
  -l linenumber

  -b bytes

Example: split -l 1000 myfile.csv

How to call unix commands using python?


    Python having inbuilt OS module to do this. But I recommend to use subprocess module for calling unix commands inside python code. The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.

code:
 import subprocess
 cmd = ["split", "-l", "10000", "", "myfile.csv"]
 subprocess.check_call(cmd) 

2. Using python file object to split files:


We can able to split files in typical python way. you can read the file line by line and write it to new file. based on the line count we can split the files.

chunked_file_line_count = 1000
with open("mylargefile.csv") as file_obj:
                line_count = 0
                file_count = 0
                print "================ Spliting file ======================== "+self.file_name
                chunked_file_object = open("file_"+str(file_count)+".csv","wb")
                for line in file_obj:
                    chunked_file_object.write(line)
                    line_count = line_count + 1
                    if line_count == chunked_file_line_count:
                        file_count = file_count + 1
                        line_count = 0
                        print "writing file " + str(file_count)
                        chunked_file_object = open("file_"+str(file_count)+".csv","wb")

                chunked_file_object.close()









5 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. I wish to show thanks to you just for bailing me out of this particular trouble.As a result of checking through the net and meeting techniques that were not productive, I thought my life was done.
    Surya Informatics

    ReplyDelete
  3. I have to add "chunked_file_object.close()" to make it work. Edited code below:

    print("writing file " + str(file_count))
    chunked_file_object.close()
    chunked_file_object = open("file_"+str(file_count)+".csv","wb")

    ReplyDelete
  4. Harrah's Cherokee Casino - JSHub
    Hotel 청주 출장마사지 Review. 나주 출장마사지 Location, Cherokee, North Carolina · Room, 300-square-foot (5,700 sq. ft.) casino, 광주광역 출장안마 located in Cherokee at 3001 과천 출장마사지 Highway 50,  Rating: 포천 출장샵 3.7 · ‎Review by Josh Middleton

    ReplyDelete