Friday, 28 July 2017

Informatica Scenario Based Questions - Part 2

1. Consider the following employees data as source

employee_id, salary
10, 1000
20, 2000
30, 3000
40, 5000

Q1. Design a mapping to load the cumulative sum of salaries of employees into target table?
The target table data should look like as

employee_id, salary, cumulative_sum
10, 1000, 1000
20, 2000, 3000
30, 3000, 6000
40, 5000, 11000

Solution:

Connect the source Qualifier to expression transformation. In the expression transformation, create a variable port V_cum_sal and in the expression editor write V_cum_sal+salary. Create an output port O_cum_sal and assign V_cum_sal to it.

Q2. Design a mapping to get the pervious row salary for the current row. If there is no pervious row exists for the current row, then the pervious row salary should be displayed as null.
The output should look like as

employee_id, salary, pre_row_salary
10, 1000, Null
20, 2000, 1000
30, 3000, 2000
40, 5000, 3000

Solution:

Connect the source Qualifier to expression transformation. In the expression transformation, create a variable port V_count and increment it by one for each row entering the expression transformation. Also create V_salary variable port and assign the expression IIF(V_count=1,NULL,V_prev_salary) to it . Then create one more variable port V_prev_salary and assign Salary to it. Now create output port O_prev_salary and assign V_salary to it. Connect the expression transformation to the target ports.

In the expression transformation, the ports will be

employee_id
salary
V_count=V_count+1
V_salary=IIF(V_count=1,NULL,V_prev_salary)
V_prev_salary=salary
O_prev_salary=V_salary

Q3. Design a mapping to get the next row salary for the current row. If there is no next row for the current row, then the next row salary should be displayed as null.
The output should look like as

employee_id, salary, next_row_salary
10, 1000, 2000
20, 2000, 3000
30, 3000, 5000
40, 5000, Null

Solution:

Step1: Connect the source qualifier to two expression transformation. In each expression transformation, create a variable port V_count and in the expression editor write V_count+1. Now create an output port O_count in each expression transformation. In the first expression transformation, assign V_count to O_count. In the second expression transformation assign V_count-1 to O_count.

In the first expression transformation, the ports will be

employee_id
salary
V_count=V_count+1
O_count=V_count

In the second expression transformation, the ports will be

employee_id
salary
V_count=V_count+1
O_count=V_count-1

Step2: Connect both the expression transformations to joiner transformation and join them on the port O_count. Consider the first expression transformation as Master and second one as detail. In the joiner specify the join type as Detail Outer Join. In the joiner transformation check the property sorted input, then only you can connect both expression transformations to joiner transformation.

Step3: Pass the output of joiner transformation to a target table. From the joiner, connect the employee_id, salary which are obtained from the first expression transformation to the employee_id, salary ports in target table. Then from the joiner, connect the salary which is obtained from the second expression transformaiton to the next_row_salary port in the target table.

Q4. Design a mapping to find the sum of salaries of all employees and this sum should repeat for all the rows.
The output should look like as

employee_id, salary, salary_sum
10, 1000, 11000
20, 2000, 11000
30, 3000, 11000
40, 5000, 11000

Solution:

Step1: Connect the source qualifier to the expression transformation. In the expression transformation, create a dummy port and assign value 1 to it.

In the expression transformation, the ports will be

employee_id
salary
O_dummy=1

Step2: Pass the output of expression transformation to aggregator. Create a new port O_sum_salary and in the expression editor write SUM(salary). Do not specify group by on any port.

In the aggregator transformation, the ports will be

salary
O_dummy
O_sum_salary=SUM(salary)

Step3: Pass the output of expression transformation, aggregator transformation to joiner transformation and join on the DUMMY port. In the joiner transformation check the property sorted input, then only you can connect both expression and aggregator to joiner transformation.

Step4: Pass the output of joiner to the target table.

2. Consider the following employees table as source

department_no, employee_name
20, R
10, A
10, D
20, P
10, B
10, C
20, Q
20, S

Q1. Design a mapping to load a target table with the following values from the above source?

department_no, employee_list
10, A
10, A,B
10, A,B,C
10, A,B,C,D
20, A,B,C,D,P
20, A,B,C,D,P,Q
20, A,B,C,D,P,Q,R
20, A,B,C,D,P,Q,R,S

Solution:

Step1: Use a sorter transformation and sort the data using the sort key as department_no and then pass the output to the expression transformation. In the expression transformation, the ports will be

department_no
employee_name
V_employee_list = IIF(ISNULL(V_employee_list),employee_name,V_employee_list||','||employee_name)
O_employee_list = V_employee_list

Step2: Now connect the expression transformation to a target table.

Q2. Design a mapping to load a target table with the following values from the above source?

department_no, employee_list
10, A
10, A,B
10, A,B,C
10, A,B,C,D
20, P
20, P,Q
20, P,Q,R
20, P,Q,R,S

Solution:

Step1: Use a sorter transformation and sort the data using the sort key as department_no and then pass the output to the expression transformation. In the expression transformation, the ports will be

department_no
employee_name
V_curr_deptno=department_no
V_employee_list = IIF(V_curr_deptno! = V_prev_deptno,employee_name,V_employee_list||','||employee_name)
V_prev_deptno=department_no
O_employee_list = V_employee_list

Step2: Now connect the expression transformation to a target table.

Q3. Design a mapping to load a target table with the following values from the above source?

department_no, employee_names
10, A,B,C,D
20, P,Q,R,S

Solution:

The first step is same as the above problem. Pass the output of expression to an aggregator transformation and specify the group by as department_no. Now connect the aggregator transformation to a target table.

Informatica Scenario Based Interview Questions with Answers

Informatica Scenario Based Interview Questions with Answers - Part 1

Informatica Scenarios

I have listed the following informatica scenarios which are frequently asked in the informatica interviews. These informatica scenario interview questions helps you a lot in gaining confidence in interviews.

1. How to generate sequence numbers using expression transformation?

Solution:
In the expression transformation, create a variable port and increment it by 1. Then assign the variable port to an output port. In the expression transformation, the ports are:
V_count=V_count+1
O_count=V_count

2. Design a mapping to load the first 3 rows from a flat file into a target?

Solution:
You have to assign row numbers to each record. Generate the row numbers either using the expression transformation as mentioned above or use sequence generator transformation.
Then pass the output to filter transformation and specify the filter condition as O_count <=3

3. Design a mapping to load the last 3 rows from a flat file into a target?

Solution:
Consider the source has the following data.
col
a
b
c
d
e

Step1: You have to assign row numbers to each record. Generate the row numbers using the expression transformation as mentioned above and call the row number generated port as O_count. Create a DUMMY output port in the same expression transformation and assign 1 to that port. So that, the DUMMY output port always return 1 for each row.

In the expression transformation, the ports are
V_count=V_count+1
O_count=V_count
O_dummy=1

The output of expression transformation will be
col, o_count, o_dummy
a, 1, 1
b, 2, 1
c, 3, 1
d, 4, 1
e, 5, 1

Step2: Pass the output of expression transformation to aggregator and do not specify any group by condition. Create an output port O_total_records in the aggregator and assign O_count port to it. The aggregator will return the last row by default. The output of aggregator contains the DUMMY port which has value 1 and O_total_records port which has the value of total number of records in the source.

In the aggregator transformation, the ports are
O_dummy
O_count
O_total_records=O_count

The output of aggregator transformation will be
O_total_records, O_dummy
5, 1

Step3: Pass the output of expression transformation, aggregator transformation to joiner transformation and join on the DUMMY port. In the joiner transformation check the property sorted input, then only you can connect both expression and aggregator to joiner transformation.

In the joiner transformation, the join condition will be
O_dummy (port from aggregator transformation) = O_dummy (port from expression transformation)

The output of joiner transformation will be
col, o_count, o_total_records
a, 1, 5
b, 2, 5
c, 3, 5
d, 4, 5
e, 5, 5

Step4: Now pass the ouput of joiner transformation to filter transformation and specify the filter condition as O_total_records (port from aggregator)-O_count(port from expression) <=2

In the filter transformation, the filter condition will be
O_total_records - O_count <=2

The output of filter transformation will be
col o_count, o_total_records
c, 3, 5
d, 4, 5
e, 5, 5

4. Design a mapping to load the first record from a flat file into one table A, the last record from a flat file into table B and the remaining records into table C?

Solution:
This is similar to the above problem; the first 3 steps are same. In the last step instead of using the filter transformation, you have to use router transformation. In the router transformation create two output groups.

In the first group, the condition should be O_count=1 and connect the corresponding output group to table A. In the second group, the condition should be O_count=O_total_records and connect the corresponding output group to table B. The output of default group should be connected to table C.

5. Consider the following products data which contain duplicate rows.
A
B
C
C
B
D
B

Q1. Design a mapping to load all unique products in one table and the duplicate rows in another table.
The first table should contain the following output
A
D

The second target should contain the following output
B
B
B
C
C

Solution:
Use sorter transformation and sort the products data. Pass the output to an expression transformation and create a dummy port O_dummy and assign 1 to that port. So that, the DUMMY output port always return 1 for each row.

The output of expression transformation will be
Product, O_dummy
A, 1
B, 1
B, 1
B, 1
C, 1
C, 1
D, 1

Pass the output of expression transformation to an aggregator transformation. Check the group by on product port. In the aggreagtor, create an output port O_count_of_each_product and write an expression count(product).

The output of aggregator will be
Product, O_count_of_each_product
A, 1
B, 3
C, 2
D, 1

Now pass the output of expression transformation, aggregator transformation to joiner transformation and join on the products port. In the joiner transformation check the property sorted input, then only you can connect both expression and aggregator to joiner transformation.

The output of joiner will be
product, O_dummy, O_count_of_each_product
A, 1, 1
B, 1, 3
B, 1, 3
B, 1, 3
C, 1, 2
C, 1, 2
D, 1, 1

Now pass the output of joiner to a router transformation, create one group and specify the group condition as O_dummy=O_count_of_each_product. Then connect this group to one table. Connect the output of default group to another table.

Q2. Design a mapping to load each product once into one table and the remaining products which are duplicated into another table.
The first table should contain the following output
A
B
C
D

The second table should contain the following output
B
B
C

Solution:
Use sorter transformation and sort the products data. Pass the output to an expression transformation and create a variable port,V_curr_product, and assign product port to it. Then create a V_count port and in the expression editor write IIF(V_curr_product=V_prev_product, V_count+1,1). Create one more variable port V_prev_port and assign product port to it. Now create an output port O_count port and assign V_count port to it.

In the expression transformation, the ports are
Product
V_curr_product=product
V_count=IIF(V_curr_product=V_prev_product,V_count+1,1)
V_prev_product=product
O_count=V_count

The output of expression transformation will be
Product, O_count
A, 1
B, 1
B, 2
B, 3
C, 1
C, 2
D, 1

Now Pass the output of expression transformation to a router transformation, create one group and specify the condition as O_count=1. Then connect this group to one table. Connect the output of default group to another table.

Wednesday, 28 June 2017

Unix Sed Tutorial: Append, Insert, Replace, and Count File Lines

Sed provides lot of commands to perform number of operations with the lines in a file.

In this article let us review how to append, insert, replace a line in a file and how to get line numbers of a file.

Let us first create ABC.txt file that will be used in all the examples mentioned below.

$cat ABC.txt
Linux Sysadmin
Databases - Oracle, mySQL etc.
Security (Firewall, Network, Online Security etc)
Storage in Linux
Productivity (Too many technologies to explore, not much time available)
Windows- Sysadmin, reboot etc.

Append Lines Using Sed Command

Sed provides the command “a” which appends a line after every line with the address or pattern.

Syntax:

#sed 'ADDRESS a\
 Line which you want to append' filename

#sed '/PATTERN/ a\
 Line which you want to append' filename

Sed Append Example 1. Add a line after the 3rd line of the file.

Add the line “Cool gadgets and websites” after the 3rd line. sed “a” command inserts the line after match.

$ sed '3 a\
> Cool gadgets and websites' ABC.txt

Linux Sysadmin
Databases - Oracle, mySQL etc.
Security (Firewall, Network, Online Security etc)
Cool gadgets and websites
Storage in Linux
Productivity (Too many technologies to explore, not much time available)
Windows- Sysadmin, reboot etc.

Sed Append Example 2. Append a line after every line matching the pattern

The below sed command will add the line “Linux Scripting” after every line that matches the pattern “Sysadmin”.

$ sed '/Sysadmin/a \
> Linux Scripting' ABC.txt

Linux Sysadmin
Linux Scripting
Databases - Oracle, mySQL etc.
Security (Firewall, Network, Online Security etc)
Storage in Linux
Productivity (Too many technologies to explore, not much time available)
Windows- Sysadmin, reboot etc.
Linux Scripting

Sed Append Example 3. Append a line at the end of the file

The following example, appends the line “Website Design” at the end of the file.

$ sed '$ a\
> Website Design' ABC.txt

Linux Sysadmin
Databases - Oracle, mySQL etc.
Security (Firewall, Network, Online Security etc)
Storage in Linux
Productivity (Too many technologies to explore, not much time available)
Windows- Sysadmin, reboot etc.
Website Design

Insert Lines Using Sed Command

Sed command “i” is used to insert a line before every line with the range or pattern.

Syntax:

#sed 'ADDRESS i\
 Line which you want to insert' filename

#sed '/PATTERN/ i\
 Line which you want to insert' filename

Sed Insert Example 1. Add a line before the 4th line of the line.

Add a line “Cool gadgets and websites” before 4th line. “a” command inserts the line after match whereas “i” inserts before match.

$ sed '4 i\
> Cool gadgets and websites' ABC.txt

Linux Sysadmin
Databases - Oracle, mySQL etc.
Security (Firewall, Network, Online Security etc)
Cool gadgets and websites
Storage in Linux
Productivity (Too many technologies to explore, not much time available)
Windows- Sysadmin, reboot etc.

Sed Insert Example 2. Insert a line before every line with the pattern

The below sed command will add a line “Linux Scripting” before every line that matches with the pattern called ‘Sysadmin”.

$ sed '/Sysadmin/i \
> Linux Scripting' ABC.txt

Linux Scripting
Linux Sysadmin
Databases - Oracle, mySQL etc.
Security (Firewall, Network, Online Security etc)
Storage in Linux
Productivity (Too many technologies to explore, not much time available)
Linux Scripting
Windows- Sysadmin, reboot etc.

Sed Insert Example 3. Insert a line before the last line of the file.

Append a line “Website Design” before the last line of the file.

$ sed '$ i\
> Website Design' ABC.txt
Linux Sysadmin
Databases - Oracle, mySQL etc.
Security (Firewall, Network, Online Security etc)
Storage in Linux
Productivity (Too many technologies to explore, not much time available)
Website Design
Windows- Sysadmin, reboot etc.

Replace Lines Using Sed Command

“c” command in sed is used to replace every line matches with the pattern or ranges with the new given line.

Syntax:

#sed 'ADDRESS c\
 new line' filename

#sed '/PATTERN/ c\
 new line' filename

Sed Replace Example 1. Replace a first line of the file

The below command replaces the first line of the file with the “The Geek Stuff”.

$ sed '1 c\
> The Geek Stuff' ABC.txt

The Geek Stuff
Databases - Oracle, mySQL etc.
Security (Firewall, Network, Online Security etc)
Storage in Linux
Productivity (Too many technologies to explore, not much time available)
Windows- Sysadmin, reboot etc.

Sed Replace Example 2. Replace a line which matches the pattern

Replace everyline which has a pattern “Linux Sysadmin” to “Linux Sysadmin – Scripting”.

$ sed '/Linux Sysadmin/c \
> Linux Sysadmin - Scripting' ABC.txt

Linux Sysadmin - Scripting
Databases - Oracle, mySQL etc.
Security (Firewall, Network, Online Security etc)
Storage in Linux
Productivity (Too many technologies to explore, not much time available)
Windows- Sysadmin, reboot etc.

Sed Replace Example 3. Replace the last line of the file

Sed command given below replaces the last line of the file with “Last Line of the file”.

$ sed '$ c\
> Last line of the file' ABC.txt

Linux Sysadmin
Databases - Oracle, mySQL etc.
Security (Firewall, Network, Online Security etc)
Storage in Linux
Productivity (Too many technologies to explore, not much time available)
Last line of the file

Print Line Numbers Using Sed Command

“=” is a command in sed to print the current line number to the standard output.

Syntax:

#sed '=' filename

The above send command syntax prints line number in the first line and the original line from the file in the next line .

sed ‘=’ command accepts only one address, so if you want to print line number for a range of lines, you must use the curly braces.

Syntax:

# sed -n '/PATTERN/,/PATTERN/ {
=
p
}' filename

Sed Line Number Example 1. Find the line number which contains the pattern

The below sed command prints the line number for which matches with the pattern “Databases”

$ sed -n '/Databases/=' ABC.txt

2

Sed Line Number Example 2. Printing Range of line numbers

Print the line numbers for the lines matches from the pattern “Oracle” to “Productivity”.

$ sed -n '/Oracle/,/Productivity/{
> =
> p
> }' ABC.txt

2
Databases - Oracle, mySQL etc.
3
Security (Firewall, Network, Online Security etc)
4
Storage in Linux
5
Productivity (Too many technologies to explore, not much time available)

Sed Line Number Example 3. Print the total number of lines in a file

Line number of the last line of the file will be the total lines in a file. Pattern $ specifies the last line of the file.

$ sed -n '$=' ABC.txt

6

Tuesday, 20 June 2017

SORT UNIX COMMAND

Sort Command Examples in Unix / Linux Tutorials

Sort command in unix or linux system is used to order the elements or text. Sort command has the capability of sorting numerical values and strings. The sort command can order the lines in a text file.

The syntax of sort command is:

sort [options] filename

The options are:

-b : Ignores leading spaces in each line
-d : Uses dictionary sort order. Conisders only spaces and alphanumeric characters in sorting
-f : Uses case insensitive sorting.
-M : Sorts based on months. Considers only first 3 letters as month. Eg: JAN, FEB
-n : Uses numeric sorting
-R : Sorts the input file randomly.
-r : Reverse order sorting
-k : Sorts file based on the data in the specified field positions.
-u : Suppresses duplicate lines
-t : input field separator

Sort Command Examples:

Before practicing the examples create the below two files in your unix system:

> cat order.txt
Unix distributed 05 server
Linux virtual 3 server
Unix distributed 05 server
Distributed processing 6 system

> cat delim_sort.txt
Mayday|4
Janmon|1
Declast|12

1. Sorting lines of text

The default sort command uses alphabetical order (ASCII order) to sort the file. It treats each line as a string and then sorts the lines.

> sort order.txt
Distributed processing 6 system
Linux virtual 3 server
Unix distributed 05 server
Unix distributed 05 server

2. Sorting based on the field positions.

You can specify the field postions using the -k option of sort command. The sort command uses the space or tab as the default delimiter. To sort based on the data in the second field, run the below command:

> sort -k2 order.txt
Unix distributed 05 server
Unix distributed 05 server
Distributed processing 6 system
Linux virtual 3 server

You can also pecify more than field with k option as a comma separated list. The below command uses the second and fourth fields to sort the data.

> sort -k2,4 order.txt

3. Numeric sorting

Instead of the default alphabetical sorting order, you can make the sort command to sort in numeric order using the -n option. This is shown below:

> sort -nk3 order.txt
Linux virtual 3 server
Unix distributed 05 server
Unix distributed 05 server
Distributed processing 6 system

4. Sort in reverse order

By default, the sort command sorts the data in ascending order. You can change this to descending order using the -r option.

> sort -nrk3 order.txt
Distributed processing 6 system
Unix distributed 05 server
Unix distributed 05 server
Linux virtual 3 server

5. Suppressing duplicates or Print only unique values

You can produce only unique values in the output using the - u option of the sort command.

> sort -u order.txt
Distributed processing 6 system
Linux virtual 3 server
Unix distributed 05 server

Another way is piping the output of sort command to uniq command.

> sort order.txt | uniq

6. Delimited file input

In the second, third and fourth examples we have sorted the data based on the field positions. Here the fields are separted by space or tab character. What if the fields are specifed by any other character? In such cases, we have to specify the input delimiter with the -t option. An example is shown below:

> sort -t'|' -nrk2 delim_sort.txt
Declast|12
Mayday|4
Janmon|1

7. Sorting on months.

We can sort the data in the monthwise using the -M option of the sort command. This is shown below:

> sort -M delim_sort.txt
Janmon|1
Mayday|4
Declast|12

Treats the first 3 characters in the string as month and then sorts in months order.

Thursday, 6 April 2017

SUBSTR and INSTR with INFORMATICA

SUBSTR in Informatica is a function that returns a subset of characters from a larger string. We can use this data as part of some additional mapping logic or map it to a target table to be consumed by business. SUBSTR is used primarily within the Expression Transformation in Informatica. This function works perfectly with pattern based string values like zip codes or phone numbers.

Lets take a look at a quick SUBSTR in Informatica example.

PHONE NUMBER EXAMPLE

Lets say we have the below phone numbers passing through our mapping into an expression transformation:

209-555-1234

714-555-5678

515-555-9123

Assume we want to populate a PHONE table along with AREA_CODE and MAIN_LINE fields. SUBSTR in Informatica works perfectly for extracting these pieces of data out of the full phone number.

Lets take a quick look at the sytax we must use:

SUBSTR( string, start [,length] )

Our first two parameters are required, the third is optional.

1. “string” is defined as the character/string that we want to search. Generally we would pass an expression string variable or input port.

2. “start”, defined by an integer, is merely the starting position to begin counting. We can pass a positive or negative value here. If we pass a positive value, we count left to right for our starting position. Conversely, if we pass a negative value, we count right to left for our starting position. The integration service considers a 0 equal to 1, the first character in our string.

3. “length” is an optional parameter. If entered, it must be an integer greater than 0. It tells the integration service how many characters of the string we want to return based on our starting position. If left blank, the entire string will be returned from the start location specified.

Ok now that we understand the SUBSTR in Informatica syntax, lets continue our phone number example.

Area Code

Using the below SUBSTR in Inforamtica parameter values, we can return the first three characters from our PHONE_NUMBER data:

SUBSTR(PHONE_NUMBER, 1, 3)

SUBSTR INFORMATICA AREA CODE

PHONE_NUMBER	AREA_CODE
209-555-1234	209
714-555-5678	714
515-555-9123	515

I named this expression output port OUT_AREA_CODE.

Lets add another expression output port, OUT_MAIN_LINE. We will define it with the below SUBSTR statement. We start at the 5th character of our PHONE_NUMBER and return the next 8 characters.

SUBSTR(PHONE_NUMBER, 5, 8)

SUBSTR INFORMATICA MAIN LINE

PHONE_NUMBER	MAIN_LINE
209-555-1234	555-1234
714-555-5678	555-5678
515-555-9123	555-9123

Putting it all together, our expression transformation will produce the following:

PHONE_NUMBER	AREA_CODE	MAIN_LINE
209-555-1234	209	555-1234
714-555-5678	714	555-5678
515-555-9123	515	555-9123

Below is a snapshot of our expression transformation ports tab. I defined our new fields using SUBSTR as OUT_AREA_CODE and OUT_MAIN_LINE.

SUBSTR INFORMATICA EXPRESSION SETUP

COMMON QUESTIONS

Question 1 – What will the SUSTR in Informatica fuction return when my “string” value is NULL?

A. When the string value is NULL, SUBSTR will return NULL.

Question 2 – What if my “string” does not follow a character length pattern. How would I return the domain names in an email address for example?

A. Many times our data is not simple. It may follow a pattern of some kind, but perhaps not as straight forward as our PHONE_NUMBER example.

In these situations, we need to use the INSTR function to determine either our start position, length of characters to return or both.

In the case of an email domain, we would need to do something like the below…

SUBSTR(EMAIL_ADDRESS, INSTR(EMAIL_ADDRESS, ‘@’))

We passed the EMAIL_ADDRESS port into our SUBSTR string value parameter. Since we cannot predict the starting position for every email address ahead of time, I used the INSTR function to get my start position. I passed the same EMAIL_ADDRESS port into INSTR as the string to search in, and then the @ symbol as the character to search for.

The INSTR function in Informatica will then return the start postion of the first occurrence of the @ symbol. Since I do not know how long any domain will be, I left the SUBSTR length optional parameter empty so the entire domain will be returned.

Using some real data, our results might look something like this:

EMAIL_ADDRESS	DOMAIN
12345Go@gmail.com	gmail.com
hello@hotmail.com	hotmail.com
dataintegration@yahoo.com	yahoo.com

SUMMARY

SUBSTR in Informatica is a very useful function. It helps us extract specific characters from a string that might be useful on their own. The phone number use case is a perfect example of how SUBSTR can be used on strings with simple, consistent patterns. For more complex patterns we might use the INSTR function in Informatica to compliment SUBSTR.

Sunday, 19 March 2017

UNIX AWK COMMAND

AWK Stands for ‘Aho, Weinberger, and Kernighan‘

Awk is a scripting language which is used for processing or analyzing text files. Or we can say that awk is mainly used for grouping of data based on either a column or field , or on a set of columns. Mainly it’s used for reporting data in a usefull manner. It also employs Begin and End Blocks to process the data.

Syntax of awk :

# awk ‘pattern {action}’ input-file > output-file

Lets take a input file with the following data

$ cat awk_file
Name,Marks,Max Marks
Ram,200,1000
Shyam,500,1000
Ghyansham,1000
Abharam,800,1000
Hari,600,1000
Ram,400,1000

Example:1 Print all the lines from a file.

By default, awk prints all lines of a file , so to print every line of above created file use below command :

linuxtechi@mail:~$ awk ‘{print;}’ awk_file
Name,Marks,Max Marks
Ram,200,1000
Shyam,500,1000
Ghyansham,1000
Abharam,800,1000
Hari,600,1000
Ram,400,1000

Example:2 Print only Specific field like 2nd & 3rd.

linuxtechi@mail:~$ awk -F “,” ‘{print $2, $3;}’ awk_file
Marks Max Marks
200 1000
500 1000
1000
800 1000
600 1000
400 1000

In the above command we have used the option -F “,” which specifies that comma (,) is the field separator in the file

Example:3 Print the lines which matches the pattern

I want to print the lines which contains the word “Hari & Ram”

linuxtechi@mail:~$ awk ‘/Hari|Ram/’ awk_file
Ram,200,1000
Hari,600,1000
Ram,400,1000

Example:4 How do we find unique values in the first column of name

linuxtechi@mail:~$ awk -F, ‘{a[$1];}END{for (i in a)print i;}’ awk_file
Abharam
Hari
Name
Ghyansham
Ram
Shyam

Example:5 How to find the sum of data entry in a particular column .

Synatx : awk -F, ‘$1==”Item1″{x+=$2;}END{print x}’ awk_file

linuxtechi@mail:~$ awk -F, ‘$1==”Ram”{x+=$2;}END{print x}’ awk_file
600

Example:6 How to find the total of all numbers in a column.

For eg we take the 2nd and the 3rd column.

linuxtechi@mail:~$ awk -F”,” ‘{x+=$2}END{print x}’ awk_file
3500
linuxtechi@mail:~$ awk -F”,” ‘{x+=$3}END{print x}’ awk_file
5000

Example:7 How to find the sum of individual group records.

Eg if we consider the first column than we can do the summation for the first column based on the items

linuxtechi@mail:~$ awk -F, ‘{a[$1]+=$2;}END{for(i in a)print i”, “a[i];}’ awk_file
Abharam, 800
Hari, 600
Name, 0
Ghyansham, 1000
Ram, 600
Shyam, 500

Example:8 How to find the sum of all entries in second column and append it to the end of the file.

linuxtechi@mail:~$ awk -F”,” ‘{x+=$2;y+=$3;print}END{print “Total,”x,y}’ awk_file
Name,Marks,Max Marks
Ram,200,1000
Shyam,500,1000
Ghyansham,1000
Abharam,800,1000
Hari,600,1000
Ram,400,1000
Total,3500 5000

Example:9 How to find the count of entries against every column based on the first column:

linuxtechi@mail:~$ awk -F, ‘{a[$1]++;}END{for (i in a)print i, a[i];}’ awk_file
Abharam 1
Hari 1
Name 1
Ghyansham 1
Ram 2
Shyam 1

Example:10 How to print only the first record of every group:

linuxtechi@mail:~$ awk -F, ‘!a[$1]++’ awk_file
Name,Marks,Max Marks
Ram,200,1000
Shyam,500,1000
Ghyansham,1000
Abharam,800,1000
Hari,600,1000

AWK Begin Block

Syntax for BEGIN block is

# awk ‘BEGIN{awk initializing code}{actual AWK code}’ filename.txt

Let us create a datafile with below contents

datafile for awk

Example:11 How to populate each column names along with their corresponding data.

linuxtechi@mail:~$ awk ‘BEGIN{print “Names\ttotal\tPPT\tDoc\txls”}{printf “%-s\t%d\t%d\t%d\t%d\n”, $1,$2,$3,$4,$5}’ datafile

Example:12 How to change the Field Separator

As we can see space is the field separator in the datafile , in the below example we will change field separator from space to “|”

linuxtechi@mail:~$ awk ‘BEGIN{OFS=”|”}{print $1,$2,$3,$4,$5}’ datafile

DWH/ETL Concepts

Pages