Category List

python (6) airflow (2) docker (2) ssh (2) Ansible (1) Apache-storm (1) SMTP (1) Zookeeper (1) apache-airflow (1) argparse (1) argumentparser (1) containers (1) installing apache-storm (1) numpy (1) pycharm (1) pylint (1) python-flask (1) ssh keys (1) ssh-keygen (1) streamparse (1) unix-split command (1)
Showing posts with label unix-split command. Show all posts
Showing posts with label unix-split command. Show all posts

Wednesday, 31 August 2016

How to split large files into smaller chunk files using python?

In big data world, many of us handing large data files. When the file size is very big (above 10 GB) it is difficult to handle it as a single big file, at the time we need to split into several smaller chunks and than process it.

There are several ways to split a large files, but here I have given one of the two ways to do this process.
  • Using unix commands to split a file
  • Second option is typical pythonic way

1. Using unix commands:


Unix having inbuilt spilt command for split a files into smaller files.
split [options] filename prefix
Options:

   There are several options are there in split command (see using man command).  main options are split based on number of lines or size
  -l linenumber

  -b bytes

Example: split -l 1000 myfile.csv

How to call unix commands using python?


    Python having inbuilt OS module to do this. But I recommend to use subprocess module for calling unix commands inside python code. The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.

code:
 import subprocess
 cmd = ["split", "-l", "10000", "", "myfile.csv"]
 subprocess.check_call(cmd) 

2. Using python file object to split files:


We can able to split files in typical python way. you can read the file line by line and write it to new file. based on the line count we can split the files.

chunked_file_line_count = 1000
with open("mylargefile.csv") as file_obj:
                line_count = 0
                file_count = 0
                print "================ Spliting file ======================== "+self.file_name
                chunked_file_object = open("file_"+str(file_count)+".csv","wb")
                for line in file_obj:
                    chunked_file_object.write(line)
                    line_count = line_count + 1
                    if line_count == chunked_file_line_count:
                        file_count = file_count + 1
                        line_count = 0
                        print "writing file " + str(file_count)
                        chunked_file_object = open("file_"+str(file_count)+".csv","wb")

                chunked_file_object.close()