Sunday, 3 June 2018

How to Undelete a bigquery table

    You can undelete big query table using table snapshot decorators and copy command. Deleted table will be available in big query temporary location for next 48 hours. So it is possible to retrieve your data that were deleted within the last 2 days :)


But you can’t undelete the table in following situations.

  • You have recreated new table with same name
  • Tables which were deleted time is more then 2 days

Step 1: Find Unix timestamp when the table was alive:


    First find the unix timestamp when the table was alive. For example if you were deleted the table at today 10 AM, then you can use any timestamp before 10 AM to retrieve the table.


Unix/Linux: 
$ date -d '06/12/2012 07:21:22' +"%s"
1339485682
Mac OS :

$ date -juf "%Y-%m-%d %H:%M:%S" "2018-01-08 14:45:00" +%s
1515422700

Above commands will return seconds value but we need milliseconds value, so multiply it with 1000 to convert to milliseconds 



Step 2: Copy table snapshot to temporary table


    Once you got the appropriate timestamp, then you can use bq copy command to copy deleted table snapshot using table decorator. It will copy the data where available at the point of timestamp we passed in table decorators to temporary table. 

$ bq cp dataset1.table1@ 1515422700000 dataset1.temp --project project-dev
Waiting on bqjob_r4d8174e2e41ae73_0000014a5af2a028_1 ... (0s)
    Current status: DONE   
Tables 'project-dev:dataset1.table1@1515422700000' successfully copied to    
    'project-dev:dataset1.temp'


Step 3 : Verify temporary table data:


$ bq query "select * from dataset1.table1"  --project project-dev
Waiting on bqjob_r5967bea49ed9e97f_0000014a5af34dec_1 ... (0s)
    Current status: DONE


Step 4: Rename the temporary table to actual table name:


$ bq cp dataset1.temp dataset1.table1 --project project-dev
Waiting on bqjob_r3c0bb9302fb81d59_0000014a5af2dc7b_1 ... (0s)
    Current status: DONE   
Tables 'project-dev:dataset1.temp' successfully copied to
    'project-dev:dataset1.table1’


Step 5:  Delete temporary table once you copied all data to destination table:


$ bq rm dataset1.temp
rm: remove table 'helixdata2:dataset1.temp'? (y/N) y 

Saturday, 26 May 2018

Install apache storm and Zookeeper

Requirements:


  • install java
  • install zookeeper
  • install storm

Installing Java:

    sudo apt-get install software-properties-common python-software-properties
    sudo add-apt-repository ppa:webupd8team/java
    sudo apt-get update
    sudo apt-get install oracle-java8-installer

    

 Check Java version after successful installation :


java -version

Setting Java path


Add following two lines into .bashrc file

Refer https://askubuntu.com/questions/175514/how-to-set-java-home-for-java

export JAVA_HOME =/usr/lib/jvm/java-8-oracle
export PATH=$PATH:$JAVA_HOME/bin
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_151.jdk/Contents/Home
export PATH=$PATH:$JAVA_HOME/bin


Installing zookeeper:

mv zookeeper-3.4.10/ zookeeper
cd zookeeper
mkdir data
cp conf/zoo_sample.cfg conf/zoo.cfg 
bin/zkServer.sh start


Installing Storm:

mv apache-storm-1.1.1/ apache-storm
cd apache-storm
mkdir data
cd apache-storm/bin/
chmod +x storm 

configuring storm.yaml:

vim apache-storm/conf/storm.yaml
storm.zookeeper.servers:     
    - "localhost"
nimbus.seeds: ["localhost"]

Adding Storm path:


Add following line into .bashrc file

export PATH=$PATH:/home/storm_user/storm-pixel/apache-storm/bin


check storm commands after successful installation:

  • storm version
  • storm nimbus 
  • storm supervisor
  • storm ui

Configuring streamparse:



Dependencies:

  • lein
  • storm

Install lein:

# refer http://leiningen.org/ cd /usr/bin wget https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein chmod +x lein run lein command. it will install packages to local lein version 


numpy basics python

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities


Importing Numpy:

import numpy as np

Creating numpy Array:

 d = np.array([1,2,3,4,5])

Numpy range:

  d = np.arange(1,10). # It will create numpy array from range 1 to 9
   

numpy shape:


It will return total elements count based on rows or shape
d = np.array([1,2,3])
print d   # array([1, 2, 3])
print d.shape # (3,)

numpy reshape:



It will change the shape of numpy arrays

d = np.arange(1,10)    # array([1,2,3,4,5,6,7,8,9])
d.shape     # (9,)
d.reshape(3,3)
print d    #  Array([[1, 2, 3],
                                [4, 5, 6],
                                [7, 8, 9]])
Above example, it will reshape like 3X3 matrix structure


np.zeros()


    It will create zero value matrix numpy array. We have to give dimension value in the function and it will create matrix arrays.
np.zeros(3, 3)       # Array([[0., 0., 0.],
                                            [0., 0., 0.],
                                            [0., 0., 0.]]) 


np.vstack()


It will vertically stack each elements in  numpy array.

c = np.array([1,2,3])  # array([1, 2, 3])
np.vstack(c)    # array([ [ 1],
                                       [ 2],
                                       [ 3]])
   

np.eye()


It will create numpy  identical matrix array.

 np.eye(3) # it will create 3X3 matrix
                     Array (  [ 1,    0,   0]
                                   [ 0,     1,   1]
                                   [ 0,     0,   1])
        

     
np.dot()


 It will dot product of two matrix (multiplication)       
 np.dot(M1, M2)

np.sum()


 It will sum of all the elements in given array.
#M = Array([[1, 2, 3],
            [4, 5, 6],
            [7, 8, 9]])
np.sum(M) # 45 it will sum all the elements in array
np.sum(M, axis=0)   # [[12, 15, 18]] 
    
If axis= 0, it will sum column wise, it 
If axis = 1, it will sum row wise


np.random.rand()


It will produce random np arrays


np.append()


Append elements to nd array
 A = array([1, 2, 3])
 B = np.append(A, 4) # [1, 2, 3, 4]
 B = np.append(A, [4, 5,6,7]) # [1, 2, 3, 4, 5, 6, 7]
        
    






            
    

     

Install apache airflow on ubuntu

What is Airflow:

Airflow is a platform to programmatically author, schedule and monitor workflows. This blog contains following procedures to install airflow in ubuntu/linux machine. 
  1. Installing system dependencies 
  2. Installing airflow with extra packages
  3. Installing airflow meta database
    1. Mysql
    2. Postgres
  4. Installing Rabbitmq (Message broker for CeleryExecutor)
We can use RabbitMQ as a message broker if you are using Celery executor. For LocalExecutor no need to install any message brokers like Rabbitmq/Redis 

1. Installing Dependency packages:


apt-get update && apt-get upgrade -y

sudo apt-get -yqq install git \
    python-dev \
    libkrb5-dev \
    libsasl2-dev \
    libssl-dev \
    libffi-dev \
    build-essential \
    libblas-dev \
    liblapack-dev \
    libpq-dev \
    python-pip \
    python-requests \
    apt-utils \
    curl \
    netcat \
    locales \
    libmysqlclient-dev \
    supervisor
pip install --upgrade pip 

2. Install Apache airflow


pip install PyYAML==3.12
pip install requests==2.18.4
pip install simplejson==3.12.0

pip install apache-airflow[crypto,celery,postgres,hive,hdfs,jdbc,gcp_api,rabbitmq,password,s3,mysql]==1.8.1
pip install celery==3.1.17 

3. Install Meta Database:


i. Install  Mysql


#Installing and enable mysql server
sudo debconf-set-selections <<< 'mysql-server mysql-server/root_password password airflowd2p'
sudo debconf-set-selections <<< 'mysql-server mysql-server/root_password_again password airflowd2p'

sudo apt-get -y install mysql-server    libmysqlclient-dev 

ii . Install Postgressql


# Installing and enable postgresql in systemd and starting server
sudo apt-get -y install postgresql \
    postgresql-contrib
update-rc.d postgresql enable

service postgresql start

4. Install rabbitbq

apt-get update && apt-get upgrade -y
#Install erlang - dependency package for rabbitmq
sudo dpkg -i erlang-solutions_1.0_all.deb
sudo apt-get update
#Install rabbitmq server
echo "deb https://dl.bintray.com/rabbitmq/debian xenial main" | sudo tee /etc/apt/sources.list.d/bintray.rabbitmq.list
sudo apt-get update

sudo apt-get -yqq install erlang    rabbitmq-server

5. Create Rabbitmq users:


#!/usr/bin/env bash
#Creating airflow user, tag, virtual host
rabbitmq-plugins enable rabbitmq_management
rabbitmqctl add_user airflow_user airflow_user
rabbitmqctl add_vhost airflow
rabbitmqctl set_user_tags airflow_user airflow_tag
rabbitmqctl set_user_tags airflow_user administrator

rabbitmqctl set_permissions -p airflow airflow_user ".*" ".*" ".*"


ansible basics for beginners

What is Ansible


Ansible interacting with machines via SSH. So nothing need to be installed in client machines. Only prerequisite is ansible need to be installed in controller machine with python and ssh enabled.

Inventory:


Inventory file:


Inventory file is an simple text file which contains List of machines going to interact with it. We can mention single machines or group of machines going to use it. We can pass direct commands to modules in cmd line using ansible cli.

Cmd: ansible group-name -i <inventory-filename> -m <module-name> <module-params>

ansible group-name -i <inventory-filename> -m <module-name> <module-params>
 
Inventory:
server1.mycomp.com
server2.mycomp.com
 
[clients] #group name
server3.mycomp.com
server4.mycomp.com  


Ex: 
ansible clients -i inventory -m ping
ansible clients -i inventory -m apt -a "name=mysql-server state=present"

    Inventory file can also be an executable file. For example if you don’t know the number of instances running in AWS means we can simple write a script to return running instances name from AWS.

Ansible play books:


    Ansible playbook is an simple YAML file which contains list of tasks that need to be performed in client machines which we mentioned in inventory file.

playbook.yaml
---

- hosts: all
  tasks:
    - name: updating package list
      apt: update_cache=yes cache_valid_time=3600
- hosts: clients
  tasks:
    - name: installing mysql server
      apt: name=mysql-server state=present

In above code snippet, we used apt module for updating and installing packages. Host all specifies perform the task to all the host machines which we mentioned in inventory file. 

And also we can perform task to specific group of hosts. “hosts: client” specifies perform  below mentioned tasks only to client group which we created in inventory file. “-name” of each tasks contains some human readable message which will print while performing the tasks. This will be very helpful while monitoring the execution

    Running playbook:

  ansible-playbook -i inventory playbook.yaml

Vaiables in playbook:


Ansible using jinja2 templating system for dealing with varibles.

playbook.yaml
---
- hosts: all
  tasks:
    - name: updating package list
      apt: update_cache=yes cache_valid_time=3600
- hosts: clients
  vars:
    init_script: "create_db.sql"
  tasks:
    - name: installing mysql server
      apt: name=mysql-server state=present
    - name: coping init sql files

      copy: src=/tmp/{{init_script}} dest=/tmp/mysql/{{init_script}}


Variable loops in playbook:


playbook.yaml
---
- hosts: all
  tasks:
    - name: updating package list
      apt: update_cache=yes cache_valid_time=3600
- hosts: clients
  vars:
    init_script: “create_db.sql"
  tasks:
    - name: installing mysql server
      apt: name={{item}} state=present
      with_items:
        - python 
        - python-pip 
        - vim
    - name: coping init sql files

      copy: src=/tmp/{{init_script}} dest=/tmp/mysql/{{init_script}}

Other way - we can combine the variables based on hosts vise

playbook.yaml


---

- hosts: all
  tasks:
    - name: updating package list
      apt: update_cache=yes cache_valid_time=3600
- hosts: clients
  vars:
    packages:
      - python 
      - python-pip 
      - vim
  tasks:
    - name: installing mysql server
      apt: name={{item}} state=present
      with_items: {{packages}}
        - name: coping init sql files
          copy: src=/tmp/{{init_script}} dest=/tmp/mysql/{{init_script}}
     

Directory Group variables:


In default ansible will look directory called “group_vars” and “host_vars” in same location which playbook located. If you define any variables under the group_vars directory it will automatically applied to that specific group.

My folder structure:
    - inventory
    - playbook.yml
    - group_vars
            - all 
            - clients
    - host_vars
            - server.com

In above folder structure, variable defined in the file called “all” under the group_vars directory which will be available for all hosts defined in inventory hosts. If you want to define variables for specific host create file with same hostname under the “host_vars” directory.

Inventory directory:


    Normally inventory file will be simple test file but it can also be an directory. 

     ansible-playbook -i <inventory-dirctory> playbook.yml

  • ansible-playbook -i uat deploy.yml
  • ansible-playbook -i dev deploy.yml
  • ansible-playbook -i prod deploy.yml

Directory structure of inventory folder:
        
        dev
              - hosts
              - group_vars
              - host_vars
        uat
              - hosts
              - group_vars
              - host_vars
        Prod
              - hosts
              - group_vars
              - host_vars
        deploy.yml

Is there any text files available in your inventory directory, ansible will treat it as inventory file.

Roles in ansible:


You can use single playbook file for managing entire tasks of your infrastructure. But once in a stage your playbook file will be more bigger and hard to manage. For this ansible has the “role” feature, so you can split your playbook yaml file into more moduler way.

You can create a directory called “roles” and create playbook modules.

Roles directory structure:

        dev
              - hosts
              - group_vars
              - host_vars
        roles
              - common
                    - defaults
                        - main.yml   # variable values
                    - tasks
                        - main.yml   # list of tasks need to be execute
                    - files
                        - server.py   # file need to be copy
                    - templates
                        - config.py.j2  # template file used for template module
                    - meta
                        - main.yml  # list the dependency task before perform
              - webserver
                      - defaults
                        - main.yml   # variable values
                    - tasks
                        - main.yml   # list of tasks
              - db
                    - tasks
                        - main.yml   # list of tasks
        deploy.yml

Deploy.yaml

- hosts: database-server
  roles:
    - common
    - db
- hosts: web-server
  roles:
    - common
    - webserver



Here we can break down the roles folder into more modules. It has documented in ansible documentation site. 
  • Defaults folder contains the variable need to be register
  • Task folder contains task need to be perform for that group
  • Files folder contains the files need to be transferred
  • Templates folder is for template module
  • Meta folder contains the dependency list for That specific group
    
        Ex:
                main.yml
                --- 
                Dependencies:
                    - common
                    - db 

Sunday, 11 February 2018

Introduction to Python Argparse


What is Python Command line arguments?


While executing python script, we can provide additional arguments in command line. These arguments are passed into the program. We can access those arguments inside the program with help of python modules(sys, argparse, etc..).  python "sys" library module is one of the traditional and simple way of handling command line arguments.

my-script.py
import sys
print len(sys.argv)
print sys.argv
print sys.argv[0]
print sys.argv[1]
print sys.argv[2]


$ python my-script.py arg1 arg2
3
['my-script.py', 'arg1', 'arg2']
test.py
arg1
arg2


Python "argparse" module:


There are many python modules available for handling python command line arguments. One of the most popular module is argparse.  Argparse was added into python 2.7 as replacement of optparse.  It provided more features then traditional sys module.

Parsing command line arguments:


There is an function called "arg_parse" from ArgumentParser class which is used to parse the command line arguments. In default it will take arguments from sys.argv[1:], but we can also provide our own list. 

We can define arguments using add_argument function it will return the Namespace object which containing the arguments to the commands.

import sys
import argparse
parser = argparse.ArgumentParser(description='sample app')
parser.add_argument("name", help="Please enter your name")
args = parser.parse_args()
print args
print args.name
$ python my-script.py -h
usage: my-script.py [-h] name

sample app

positional arguments:
  name        Please enter your name

optional arguments:
  -h, --help  show this help message and exit
$ python my-script.py Jerry
Namespace(name='Jerry')
Jerry
-h or --help is an default feature added into your script when you import argparse module.  it will show the available positional arguments and optional arguments with help messages provided by us.


argument type:


We can externally specify the type of the argument to argparse can accept. "type" field is used for specifying cast. it will convert the argument value to specified type while parsing the arguments. if cannot convert to specified type it will throw error.

import sys
import argparse
parser = argparse.ArgumentParser(description='sample app')
parser.add_argument("square", type=int, help="Please enter your name")
args = parser.parse_args()
print args.square**2
$ python my-script.py 4
16

Optional arguments:


When you add positional argument to parser, we must provide value to positional arguments otherwise it will throw an error. But optional arguments are actually optional, there is no error when running the program without it.

import sys
import argparse
parser = argparse.ArgumentParser(description='sample app')
parser.add_argument("--square", dest="square", default=2, type=int, help="Please enter integer value")
args = parser.parse_args()
print args.square**2
$ python my-script.py --square 4
16
$ python my-script.py
4

If we are not providing any command line arguments, it will take value from default field. "None" is the default value for default field.

Short options:


We can define the short versions of the optional arguments. it is very useful for handy

import sys
import argparse
parser = argparse.ArgumentParser(description='sample app')
parser.add_argument("-s","--square", dest="square", default=2, type=int)
args = parser.parse_args()
print args.square**2

$ python my-script.py -s 4
16

Argument Actions:


action field of add_argument() function specifies what kind of action need to be perform to that argument. default value is "store", i.e store the given value to the destination variable. following are the six different kind of actions can be triggered when we add argument.

  • store - it is a default value of an action field. it will store specified value to destination variable
  • store_const - store the value defined as part of argument specification
  • store_true/store_false - save boolean values to the variables
  • append - save the value to the list
  • append_const - store the value defined in the argument to list
  • version - prints the version details about the program

examples:


import sys
import argparse
parser = argparse.ArgumentParser(description='sample app')
parser.add_argument("-v", "--verbose", action="store_true", default=False)
parser.add_argument("-s","--square", dest="square", default=2, type=int)
parser.add_argument("-a","--add", dest="my_list", default=[], action="append")
args = parser.parse_args()
if args.verbose:
    print "printing verbose output"
print "square value ", args.square**2
print "my list is ", args.my_list
$ python my-script.py -v --square 4 -a 2 -a 3
printing verbose output
square value  4
my list is  ['2', '3']