DWH/ETL Concepts

Thursday, 6 April 2017

SUBSTR and INSTR with INFORMATICA

SUBSTR in Informatica is a function that returns a subset of characters from a larger string. We can use this data as part of some additional mapping logic or map it to a target table to be consumed by business. SUBSTR is used primarily within the Expression Transformation in Informatica. This function works perfectly with pattern based string values like zip codes or phone numbers.

Lets take a look at a quick SUBSTR in Informatica example.

PHONE NUMBER EXAMPLE

Lets say we have the below phone numbers passing through our mapping into an expression transformation:

209-555-1234

714-555-5678

515-555-9123

Assume we want to populate a PHONE table along with AREA_CODE and MAIN_LINE fields. SUBSTR in Informatica works perfectly for extracting these pieces of data out of the full phone number.

Lets take a quick look at the sytax we must use:

SUBSTR( string, start [,length] )

Our first two parameters are required, the third is optional.

1. “string” is defined as the character/string that we want to search. Generally we would pass an expression string variable or input port.

2. “start”, defined by an integer, is merely the starting position to begin counting. We can pass a positive or negative value here. If we pass a positive value, we count left to right for our starting position. Conversely, if we pass a negative value, we count right to left for our starting position. The integration service considers a 0 equal to 1, the first character in our string.

3. “length” is an optional parameter. If entered, it must be an integer greater than 0. It tells the integration service how many characters of the string we want to return based on our starting position. If left blank, the entire string will be returned from the start location specified.

Ok now that we understand the SUBSTR in Informatica syntax, lets continue our phone number example.

Area Code

Using the below SUBSTR in Inforamtica parameter values, we can return the first three characters from our PHONE_NUMBER data:

SUBSTR(PHONE_NUMBER, 1, 3)

SUBSTR INFORMATICA AREA CODE

PHONE_NUMBER	AREA_CODE
209-555-1234	209
714-555-5678	714
515-555-9123	515

I named this expression output port OUT_AREA_CODE.

Lets add another expression output port, OUT_MAIN_LINE. We will define it with the below SUBSTR statement. We start at the 5th character of our PHONE_NUMBER and return the next 8 characters.

SUBSTR(PHONE_NUMBER, 5, 8)

SUBSTR INFORMATICA MAIN LINE

PHONE_NUMBER	MAIN_LINE
209-555-1234	555-1234
714-555-5678	555-5678
515-555-9123	555-9123

Putting it all together, our expression transformation will produce the following:

PHONE_NUMBER	AREA_CODE	MAIN_LINE
209-555-1234	209	555-1234
714-555-5678	714	555-5678
515-555-9123	515	555-9123

Below is a snapshot of our expression transformation ports tab. I defined our new fields using SUBSTR as OUT_AREA_CODE and OUT_MAIN_LINE.

SUBSTR INFORMATICA EXPRESSION SETUP

COMMON QUESTIONS

Question 1 – What will the SUSTR in Informatica fuction return when my “string” value is NULL?

A. When the string value is NULL, SUBSTR will return NULL.

Question 2 – What if my “string” does not follow a character length pattern. How would I return the domain names in an email address for example?

A. Many times our data is not simple. It may follow a pattern of some kind, but perhaps not as straight forward as our PHONE_NUMBER example.

In these situations, we need to use the INSTR function to determine either our start position, length of characters to return or both.

In the case of an email domain, we would need to do something like the below…

SUBSTR(EMAIL_ADDRESS, INSTR(EMAIL_ADDRESS, ‘@’))

We passed the EMAIL_ADDRESS port into our SUBSTR string value parameter. Since we cannot predict the starting position for every email address ahead of time, I used the INSTR function to get my start position. I passed the same EMAIL_ADDRESS port into INSTR as the string to search in, and then the @ symbol as the character to search for.

The INSTR function in Informatica will then return the start postion of the first occurrence of the @ symbol. Since I do not know how long any domain will be, I left the SUBSTR length optional parameter empty so the entire domain will be returned.

Using some real data, our results might look something like this:

EMAIL_ADDRESS	DOMAIN
12345Go@gmail.com	gmail.com
hello@hotmail.com	hotmail.com
dataintegration@yahoo.com	yahoo.com

SUMMARY

SUBSTR in Informatica is a very useful function. It helps us extract specific characters from a string that might be useful on their own. The phone number use case is a perfect example of how SUBSTR can be used on strings with simple, consistent patterns. For more complex patterns we might use the INSTR function in Informatica to compliment SUBSTR.

Sunday, 19 March 2017

UNIX AWK COMMAND

AWK Stands for ‘Aho, Weinberger, and Kernighan‘

Awk is a scripting language which is used for processing or analyzing text files. Or we can say that awk is mainly used for grouping of data based on either a column or field , or on a set of columns. Mainly it’s used for reporting data in a usefull manner. It also employs Begin and End Blocks to process the data.

Syntax of awk :

# awk ‘pattern {action}’ input-file > output-file

Lets take a input file with the following data

$ cat awk_file
Name,Marks,Max Marks
Ram,200,1000
Shyam,500,1000
Ghyansham,1000
Abharam,800,1000
Hari,600,1000
Ram,400,1000

Example:1 Print all the lines from a file.

By default, awk prints all lines of a file , so to print every line of above created file use below command :

linuxtechi@mail:~$ awk ‘{print;}’ awk_file
Name,Marks,Max Marks
Ram,200,1000
Shyam,500,1000
Ghyansham,1000
Abharam,800,1000
Hari,600,1000
Ram,400,1000

Example:2 Print only Specific field like 2nd & 3rd.

linuxtechi@mail:~$ awk -F “,” ‘{print $2, $3;}’ awk_file
Marks Max Marks
200 1000
500 1000
1000
800 1000
600 1000
400 1000

In the above command we have used the option -F “,” which specifies that comma (,) is the field separator in the file

Example:3 Print the lines which matches the pattern

I want to print the lines which contains the word “Hari & Ram”

linuxtechi@mail:~$ awk ‘/Hari|Ram/’ awk_file
Ram,200,1000
Hari,600,1000
Ram,400,1000

Example:4 How do we find unique values in the first column of name

linuxtechi@mail:~$ awk -F, ‘{a[$1];}END{for (i in a)print i;}’ awk_file
Abharam
Hari
Name
Ghyansham
Ram
Shyam

Example:5 How to find the sum of data entry in a particular column .

Synatx : awk -F, ‘$1==”Item1″{x+=$2;}END{print x}’ awk_file

linuxtechi@mail:~$ awk -F, ‘$1==”Ram”{x+=$2;}END{print x}’ awk_file
600

Example:6 How to find the total of all numbers in a column.

For eg we take the 2nd and the 3rd column.

linuxtechi@mail:~$ awk -F”,” ‘{x+=$2}END{print x}’ awk_file
3500
linuxtechi@mail:~$ awk -F”,” ‘{x+=$3}END{print x}’ awk_file
5000

Example:7 How to find the sum of individual group records.

Eg if we consider the first column than we can do the summation for the first column based on the items

linuxtechi@mail:~$ awk -F, ‘{a[$1]+=$2;}END{for(i in a)print i”, “a[i];}’ awk_file
Abharam, 800
Hari, 600
Name, 0
Ghyansham, 1000
Ram, 600
Shyam, 500

Example:8 How to find the sum of all entries in second column and append it to the end of the file.

linuxtechi@mail:~$ awk -F”,” ‘{x+=$2;y+=$3;print}END{print “Total,”x,y}’ awk_file
Name,Marks,Max Marks
Ram,200,1000
Shyam,500,1000
Ghyansham,1000
Abharam,800,1000
Hari,600,1000
Ram,400,1000
Total,3500 5000

Example:9 How to find the count of entries against every column based on the first column:

linuxtechi@mail:~$ awk -F, ‘{a[$1]++;}END{for (i in a)print i, a[i];}’ awk_file
Abharam 1
Hari 1
Name 1
Ghyansham 1
Ram 2
Shyam 1

Example:10 How to print only the first record of every group:

linuxtechi@mail:~$ awk -F, ‘!a[$1]++’ awk_file
Name,Marks,Max Marks
Ram,200,1000
Shyam,500,1000
Ghyansham,1000
Abharam,800,1000
Hari,600,1000

AWK Begin Block

Syntax for BEGIN block is

# awk ‘BEGIN{awk initializing code}{actual AWK code}’ filename.txt

Let us create a datafile with below contents

datafile for awk

Example:11 How to populate each column names along with their corresponding data.

linuxtechi@mail:~$ awk ‘BEGIN{print “Names\ttotal\tPPT\tDoc\txls”}{printf “%-s\t%d\t%d\t%d\t%d\n”, $1,$2,$3,$4,$5}’ datafile

Example:12 How to change the Field Separator

As we can see space is the field separator in the datafile , in the below example we will change field separator from space to “|”

linuxtechi@mail:~$ awk ‘BEGIN{OFS=”|”}{print $1,$2,$3,$4,$5}’ datafile

Sunday, 26 February 2017

UNIX CUT COMMAND

Linux command cut is used for text processing. You can use this command to extract portion of text from a file by selecting columns.

This tutorial provides few practical examples of cut command that you can use in your day to day command line activities.

For most of the example, we’ll be using the following test file.

$ cat test.txt
cat command for file oriented operations.
cp command for copy files or directories.
ls command to list out files and directories with its attributes.

1. Select Column of Characters

To extract only a desired column from a file use -c option. The following example displays 2nd character from each line of a file test.txt

$ cut -c2 test.txt
a
p
s

As seen above, the characters a, p, s are the second character from each line of the test.txt file.

2. Select Column of Characters using Range

Range of characters can also be extracted from a file by specifying start and end position delimited with -. The following example extracts first 3 characters of each line from a file called test.txt

$ cut -c1-3 test.txt
cat
cp
ls

3. Select Column of Characters using either Start or End Position

Either start position or end position can be passed to cut command with -c option.

The following specifies only the start position before the ‘-‘. This example extracts from 3rd character to end of each line from test.txt file.

$ cut -c3- test.txt
t command for file oriented operations.
 command for copy files or directories.
 command to list out files and directories with its attributes.

The following specifies only the end position after the ‘-‘. This example extracts 8 characters from the beginning of each line from test.txt file.

$ cut -c-8 test.txt
cat comm
cp comma
ls comma

The entire line would get printed when you don’t specify a number before or after the ‘-‘ as shown below.

$ cut -c- test.txt
cat command for file oriented operations.
cp command for copy files or directories.
ls command to list out files and directories with its attributes.

4. Select a Specific Field from a File

Instead of selecting x number of characters, if you like to extract a whole field, you can combine option -f and -d. The option -f specifies which field you want to extract, and the option -d specifies what is the field delimiter that is used in the input file.

The following example displays only first field of each lines from /etc/passwd file using the field delimiter : (colon). In this case, the 1st field is the username. The file

$ cut -d':' -f1 /etc/passwd
root
daemon
bin
sys
sync
games
bala

5. Select Multiple Fields from a File

You can also extract more than one fields from a file or stdout. Below example displays username and home directory of users who has the login shell as “/bin/bash”.

$ grep "/bin/bash" /etc/passwd | cut -d':' -f1,6
root:/root
bala:/home/bala

To display the range of fields specify start field and end field as shown below. In this example, we are selecting field 1 through 4, 6 and 7

$ grep "/bin/bash" /etc/passwd | cut -d':' -f1-4,6,7
root:x:0:0:/root:/bin/bash
bala:x:1000:1000:/home/bala:/bin/bash

6. Select Fields Only When a Line Contains the Delimiter

In our /etc/passwd example, if you pass a different delimiter other than : (colon), cut will just display the whole line.

In the following example, we’ve specified the delimiter as | (pipe), and cut command simply displays the whole line, even when it doesn’t find any line that has | (pipe) as delimiter.

$ grep "/bin/bash" /etc/passwd | cut -d'|'  -f1
root:x:0:0:root:/root:/bin/bash
bala:x:1000:1000:bala,,,:/home/bala:/bin/bash

But, it is possible to filter and display only the lines that contains the specified delimiter using -s option.

The following example doesn’t display any output, as the cut command didn’t find any lines that has | (pipe) as delimiter in the /etc/passwd file.

$ grep "/bin/bash" /etc/passwd | cut -d'|' -s -f1

7. Select All Fields Except the Specified Fields

In order to complement the selection field list use option –complement.

The following example displays all the fields from /etc/passwd file except field 7

$ grep "/bin/bash" /etc/passwd | cut -d':' --complement -s -f7
root:x:0:0:root:/root
bala:x:1000:1000:bala,,,:/home/bala

8. Change Output Delimiter for Display

By default the output delimiter is same as input delimiter that we specify in the cut -d option.

To change the output delimiter use the option –output-delimiter as shown below. In this example, the input delimiter is : (colon), but the output delimiter is # (hash).

$ grep "/bin/bash" /etc/passwd | cut -d':'  -s -f1,6,7 --output-delimiter='#'
root#/root#/bin/bash
bala#/home/bala#/bin/bash

9. Change Output Delimiter to Newline

In this example, each and every field of the cut command output is displayed in a separate line. We still used –output-delimiter, but the value is $’\n’ which indicates that we should add a newline as the output delimiter.

$ grep bala /etc/passwd | cut -d':' -f1,6,7 --output-delimiter=$'\n'
bala
/home/bala
/bin/bash

10. Combine Cut with Other Unix Command Output

The power of cut command can be realized when you combine it with the stdout of some other Unix command.

Once you master the basic usage of cut command that we’ve explained above, you can wisely use cut command to solve lot of your text manipulation requirements.

The following example indicates how you can extract only useful information from the ps command output. We also showed how we’ve filtered the output of ps command using grep and sed before the final output was given to cut command. Here, we’ve used cut option -d and -f which we’ve explained in the above examples.

$ ps axu | grep python | sed 's/\s\+/ /g' | cut -d' ' -f2,11-
2231 /usr/bin/python /usr/lib/unity-lens-video/unity-lens-video
2311 /usr/bin/python /usr/lib/unity-scope-video-remote/unity-scope-video-remote
2414 /usr/bin/python /usr/lib/ubuntuone-client/ubuntuone-syncdaemon
2463 /usr/bin/python /usr/lib/system-service/system-service-d
3274 grep --color=auto python

UNIX SED COMMAND

Sed is a Stream Editor used for modifying the files in unix (or linux). Whenever you want to make changes to the file automatically, sed comes in handy to do this. Most people never learn its power; they just simply use sed to replace text. You can do many things apart from replacing text with sed. Here I will describe the features of sed with examples.

Consider the below text file as an input.

>cat file.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
unixlinux which one you choose.

Sed Command Examples

1. Replacing or substituting string

Sed command is mostly used to replace the text in a file. The below simple sed command replaces the word "unix" with "linux" in the file.

>sed 's/unix/linux/' file.txt
linux is great os. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you choose.

Here the "s" specifies the substitution operation. The "/" are delimiters. The "unix" is the search pattern and the "linux" is the replacement string.

By default, the sed command replaces the first occurrence of the pattern in each line and it won't replace the second, third...occurrence in the line.

2. Replacing the nth occurrence of a pattern in a line.

Use the /1, /2 etc flags to replace the first, second occurrence of a pattern in a line. The below command replaces the second occurrence of the word "unix" with "linux" in a line.

>sed 's/unix/linux/2' file.txt
unix is great os. linux is opensource. unix is free os.
learn operating system.
unixlinux which one you choose.

3. Replacing all the occurrence of the pattern in a line.

The substitute flag /g (global replacement) specifies the sed command to replace all the occurrences of the string in the line.

>sed 's/unix/linux/g' file.txt
linux is great os. linux is opensource. linux is free os.
learn operating system.
linuxlinux which one you choose.

4. Replacing from nth occurrence to all occurrences in a line.

Use the combination of /1, /2 etc and /g to replace all the patterns from the nth occurrence of a pattern in a line. The following sed command replaces the third, fourth, fifth... "unix" word with "linux" word in a line.

>sed 's/unix/linux/3g' file.txt
unix is great os. unix is opensource. linux is free os.
learn operating system.
unixlinux which one you choose.

5. Changing the slash (/) delimiter

You can use any delimiter other than the slash. As an example if you want to change the web url to another url as

>sed 's/http:\/\//www/' file.txt

In this case the url consists the delimiter character which we used. In that case you have to escape the slash with backslash character, otherwise the substitution won't work.

Using too many backslashes makes the sed command look awkward. In this case we can change the delimiter to another character as shown in the below example.

>sed 's_http://_www_' file.txt
>sed 's|http://|www|' file.txt

6. Using & as the matched string

There might be cases where you want to search for the pattern and replace that pattern by adding some extra characters to it. In such cases & comes in handy. The & represents the matched string.

>sed 's/unix/{&}/' file.txt
{unix} is great os. unix is opensource. unix is free os.
learn operating system.
{unix}linux which one you choose.

>sed 's/unix/{&&}/' file.txt
{unixunix} is great os. unix is opensource. unix is free os.
learn operating system.
{unixunix}linux which one you choose.

7. Using \1,\2 and so on to \9

The first pair of parenthesis specified in the pattern represents the \1, the second represents the \2 and so on. The \1,\2 can be used in the replacement string to make changes to the source string. As an example, if you want to replace the word "unix" in a line with twice as the word like "unixunix" use the sed command as below.

>sed 's/\(unix\)/\1\1/' file.txt
unixunix is great os. unix is opensource. unix is free os.
learn operating system.
unixunixlinux which one you choose.

The parenthesis needs to be escaped with the backslash character. Another example is if you want to switch the words "unixlinux" as "linuxunix", the sed command is

>sed 's/\(unix\)\(linux\)/\2\1/' file.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
linuxunix which one you choose.

Another example is switching the first three characters in a line

>sed 's/^\(.\)\(.\)\(.\)/\3\2\1/' file.txt
inux is great os. unix is opensource. unix is free os.
aelrn operating system.
inuxlinux which one you choose.

8. Duplicating the replaced line with /p flag

The /p print flag prints the replaced line twice on the terminal. If a line does not have the search pattern and is not replaced, then the /p prints that line only once.

>sed 's/unix/linux/p' file.txt
linux is great os. unix is opensource. unix is free os.
linux is great os. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you choose.
linuxlinux which one you choose.

9. Printing only the replaced lines

Use the -n option along with the /p print flag to display only the replaced lines. Here the -n option suppresses the duplicate rows generated by the /p flag and prints the replaced lines only one time.

>sed -n 's/unix/linux/p' file.txt
linux is great os. unix is opensource. unix is free os.
linuxlinux which one you choose.

If you use -n alone without /p, then the sed does not print anything.

10. Running multiple sed commands.

You can run multiple sed commands by piping the output of one sed command as input to another sed command.

>sed 's/unix/linux/' file.txt| sed 's/os/system/'
linux is great system. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you chosysteme.

Sed provides -e option to run multiple sed commands in a single sed command. The above output can be achieved in a single sed command as shown below.

>sed -e 's/unix/linux/' -e 's/os/system/' file.txt
linux is great system. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you chosysteme.

11. Replacing string on a specific line number.

You can restrict the sed command to replace the string on a specific line number. An example is

>sed '3 s/unix/linux/' file.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you choose.

The above sed command replaces the string only on the third line.

12. Replacing string on a range of lines.

You can specify a range of line numbers to the sed command for replacing a string.

>sed '1,3 s/unix/linux/' file.txt
linux is great os. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you choose.

Here the sed command replaces the lines with range from 1 to 3. Another example is

>sed '2,$ s/unix/linux/' file.txt
linux is great os. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you choose.

Here $ indicates the last line in the file. So the sed command replaces the text from second line to last line in the file.

13. Replace on a lines which matches a pattern.

You can specify a pattern to the sed command to match in a line. If the pattern match occurs, then only the sed command looks for the string to be replaced and if it finds, then the sed command replaces the string.

>sed '/linux/ s/unix/centos/' file.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
centoslinux which one you choose.

Here the sed command first looks for the lines which has the pattern "linux" and then replaces the word "unix" with "centos".

14. Deleting lines.

You can delete the lines a file by specifying the line number or a range or numbers.

>sed '2 d' file.txt
>sed '5,$ d' file.txt

15. Duplicating lines

You can make the sed command to print each line of a file two times.

>sed 'p' file.txt

16. Sed as grep command

You can make sed command to work as similar to grep command.

>grep 'unix' file.txt
>sed -n '/unix/ p' file.txt

Here the sed command looks for the pattern "unix" in each line of a file and prints those lines that has the pattern.

You can also make the sed command to work as grep -v, just by using the reversing the sed with NOT (!).

>grep -v 'unix' file.txt
>sed -n '/unix/ !p' file.txt

The ! here inverts the pattern match.

17. Add a line after a match.

The sed command can add a new line after a pattern match is found. The "a" command to sed tells it to add a new line after a match is found.

>sed '/unix/ a "Add a new line"' file.txt
unix is great os. unix is opensource. unix is free os.
"Add a new line"
learn operating system.
unixlinux which one you choose.
"Add a new line"

18. Add a line before a match

The sed command can add a new line before a pattern match is found. The "i" command to sed tells it to add a new line before a match is found.

>sed '/unix/ i "Add a new line"' file.txt
"Add a new line"
unix is great os. unix is opensource. unix is free os.
learn operating system.
"Add a new line"
unixlinux which one you choose.

19. Change a line

The sed command can be used to replace an entire line with a new line. The "c" command to sed tells it to change the line.

>sed '/unix/ c "Change line"' file.txt
"Change line"
learn operating system.
"Change line"

20. Transform like tr command

The sed command can be used to convert the lower case letters to upper case letters by using the transform "y" option.

>sed 'y/ul/UL/' file.txt
Unix is great os. Unix is opensoUrce. Unix is free os.
Learn operating system.
UnixLinUx which one yoU choose.

Here the sed command transforms the alphabets "ul" into their uppercase format "UL"

Pages