DWH/ETL Concepts: February 2017

Sunday, 26 February 2017

UNIX CUT COMMAND

Linux command cut is used for text processing. You can use this command to extract portion of text from a file by selecting columns.

This tutorial provides few practical examples of cut command that you can use in your day to day command line activities.

For most of the example, we’ll be using the following test file.

$ cat test.txt
cat command for file oriented operations.
cp command for copy files or directories.
ls command to list out files and directories with its attributes.

1. Select Column of Characters

To extract only a desired column from a file use -c option. The following example displays 2nd character from each line of a file test.txt

$ cut -c2 test.txt
a
p
s

As seen above, the characters a, p, s are the second character from each line of the test.txt file.

2. Select Column of Characters using Range

Range of characters can also be extracted from a file by specifying start and end position delimited with -. The following example extracts first 3 characters of each line from a file called test.txt

$ cut -c1-3 test.txt
cat
cp
ls

3. Select Column of Characters using either Start or End Position

Either start position or end position can be passed to cut command with -c option.

The following specifies only the start position before the ‘-‘. This example extracts from 3rd character to end of each line from test.txt file.

$ cut -c3- test.txt
t command for file oriented operations.
 command for copy files or directories.
 command to list out files and directories with its attributes.

The following specifies only the end position after the ‘-‘. This example extracts 8 characters from the beginning of each line from test.txt file.

$ cut -c-8 test.txt
cat comm
cp comma
ls comma

The entire line would get printed when you don’t specify a number before or after the ‘-‘ as shown below.

$ cut -c- test.txt
cat command for file oriented operations.
cp command for copy files or directories.
ls command to list out files and directories with its attributes.

4. Select a Specific Field from a File

Instead of selecting x number of characters, if you like to extract a whole field, you can combine option -f and -d. The option -f specifies which field you want to extract, and the option -d specifies what is the field delimiter that is used in the input file.

The following example displays only first field of each lines from /etc/passwd file using the field delimiter : (colon). In this case, the 1st field is the username. The file

$ cut -d':' -f1 /etc/passwd
root
daemon
bin
sys
sync
games
bala

5. Select Multiple Fields from a File

You can also extract more than one fields from a file or stdout. Below example displays username and home directory of users who has the login shell as “/bin/bash”.

$ grep "/bin/bash" /etc/passwd | cut -d':' -f1,6
root:/root
bala:/home/bala

To display the range of fields specify start field and end field as shown below. In this example, we are selecting field 1 through 4, 6 and 7

$ grep "/bin/bash" /etc/passwd | cut -d':' -f1-4,6,7
root:x:0:0:/root:/bin/bash
bala:x:1000:1000:/home/bala:/bin/bash

6. Select Fields Only When a Line Contains the Delimiter

In our /etc/passwd example, if you pass a different delimiter other than : (colon), cut will just display the whole line.

In the following example, we’ve specified the delimiter as | (pipe), and cut command simply displays the whole line, even when it doesn’t find any line that has | (pipe) as delimiter.

$ grep "/bin/bash" /etc/passwd | cut -d'|'  -f1
root:x:0:0:root:/root:/bin/bash
bala:x:1000:1000:bala,,,:/home/bala:/bin/bash

But, it is possible to filter and display only the lines that contains the specified delimiter using -s option.

The following example doesn’t display any output, as the cut command didn’t find any lines that has | (pipe) as delimiter in the /etc/passwd file.

$ grep "/bin/bash" /etc/passwd | cut -d'|' -s -f1

7. Select All Fields Except the Specified Fields

In order to complement the selection field list use option –complement.

The following example displays all the fields from /etc/passwd file except field 7

$ grep "/bin/bash" /etc/passwd | cut -d':' --complement -s -f7
root:x:0:0:root:/root
bala:x:1000:1000:bala,,,:/home/bala

8. Change Output Delimiter for Display

By default the output delimiter is same as input delimiter that we specify in the cut -d option.

To change the output delimiter use the option –output-delimiter as shown below. In this example, the input delimiter is : (colon), but the output delimiter is # (hash).

$ grep "/bin/bash" /etc/passwd | cut -d':'  -s -f1,6,7 --output-delimiter='#'
root#/root#/bin/bash
bala#/home/bala#/bin/bash

9. Change Output Delimiter to Newline

In this example, each and every field of the cut command output is displayed in a separate line. We still used –output-delimiter, but the value is $’\n’ which indicates that we should add a newline as the output delimiter.

$ grep bala /etc/passwd | cut -d':' -f1,6,7 --output-delimiter=$'\n'
bala
/home/bala
/bin/bash

10. Combine Cut with Other Unix Command Output

The power of cut command can be realized when you combine it with the stdout of some other Unix command.

Once you master the basic usage of cut command that we’ve explained above, you can wisely use cut command to solve lot of your text manipulation requirements.

The following example indicates how you can extract only useful information from the ps command output. We also showed how we’ve filtered the output of ps command using grep and sed before the final output was given to cut command. Here, we’ve used cut option -d and -f which we’ve explained in the above examples.

$ ps axu | grep python | sed 's/\s\+/ /g' | cut -d' ' -f2,11-
2231 /usr/bin/python /usr/lib/unity-lens-video/unity-lens-video
2311 /usr/bin/python /usr/lib/unity-scope-video-remote/unity-scope-video-remote
2414 /usr/bin/python /usr/lib/ubuntuone-client/ubuntuone-syncdaemon
2463 /usr/bin/python /usr/lib/system-service/system-service-d
3274 grep --color=auto python

UNIX SED COMMAND

Sed is a Stream Editor used for modifying the files in unix (or linux). Whenever you want to make changes to the file automatically, sed comes in handy to do this. Most people never learn its power; they just simply use sed to replace text. You can do many things apart from replacing text with sed. Here I will describe the features of sed with examples.

Consider the below text file as an input.

>cat file.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
unixlinux which one you choose.

Sed Command Examples

1. Replacing or substituting string

Sed command is mostly used to replace the text in a file. The below simple sed command replaces the word "unix" with "linux" in the file.

>sed 's/unix/linux/' file.txt
linux is great os. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you choose.

Here the "s" specifies the substitution operation. The "/" are delimiters. The "unix" is the search pattern and the "linux" is the replacement string.

By default, the sed command replaces the first occurrence of the pattern in each line and it won't replace the second, third...occurrence in the line.

2. Replacing the nth occurrence of a pattern in a line.

Use the /1, /2 etc flags to replace the first, second occurrence of a pattern in a line. The below command replaces the second occurrence of the word "unix" with "linux" in a line.

>sed 's/unix/linux/2' file.txt
unix is great os. linux is opensource. unix is free os.
learn operating system.
unixlinux which one you choose.

3. Replacing all the occurrence of the pattern in a line.

The substitute flag /g (global replacement) specifies the sed command to replace all the occurrences of the string in the line.

>sed 's/unix/linux/g' file.txt
linux is great os. linux is opensource. linux is free os.
learn operating system.
linuxlinux which one you choose.

4. Replacing from nth occurrence to all occurrences in a line.

Use the combination of /1, /2 etc and /g to replace all the patterns from the nth occurrence of a pattern in a line. The following sed command replaces the third, fourth, fifth... "unix" word with "linux" word in a line.

>sed 's/unix/linux/3g' file.txt
unix is great os. unix is opensource. linux is free os.
learn operating system.
unixlinux which one you choose.

5. Changing the slash (/) delimiter

You can use any delimiter other than the slash. As an example if you want to change the web url to another url as

>sed 's/http:\/\//www/' file.txt

In this case the url consists the delimiter character which we used. In that case you have to escape the slash with backslash character, otherwise the substitution won't work.

Using too many backslashes makes the sed command look awkward. In this case we can change the delimiter to another character as shown in the below example.

>sed 's_http://_www_' file.txt
>sed 's|http://|www|' file.txt

6. Using & as the matched string

There might be cases where you want to search for the pattern and replace that pattern by adding some extra characters to it. In such cases & comes in handy. The & represents the matched string.

>sed 's/unix/{&}/' file.txt
{unix} is great os. unix is opensource. unix is free os.
learn operating system.
{unix}linux which one you choose.

>sed 's/unix/{&&}/' file.txt
{unixunix} is great os. unix is opensource. unix is free os.
learn operating system.
{unixunix}linux which one you choose.

7. Using \1,\2 and so on to \9

The first pair of parenthesis specified in the pattern represents the \1, the second represents the \2 and so on. The \1,\2 can be used in the replacement string to make changes to the source string. As an example, if you want to replace the word "unix" in a line with twice as the word like "unixunix" use the sed command as below.

>sed 's/\(unix\)/\1\1/' file.txt
unixunix is great os. unix is opensource. unix is free os.
learn operating system.
unixunixlinux which one you choose.

The parenthesis needs to be escaped with the backslash character. Another example is if you want to switch the words "unixlinux" as "linuxunix", the sed command is

>sed 's/\(unix\)\(linux\)/\2\1/' file.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
linuxunix which one you choose.

Another example is switching the first three characters in a line

>sed 's/^\(.\)\(.\)\(.\)/\3\2\1/' file.txt
inux is great os. unix is opensource. unix is free os.
aelrn operating system.
inuxlinux which one you choose.

8. Duplicating the replaced line with /p flag

The /p print flag prints the replaced line twice on the terminal. If a line does not have the search pattern and is not replaced, then the /p prints that line only once.

>sed 's/unix/linux/p' file.txt
linux is great os. unix is opensource. unix is free os.
linux is great os. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you choose.
linuxlinux which one you choose.

9. Printing only the replaced lines

Use the -n option along with the /p print flag to display only the replaced lines. Here the -n option suppresses the duplicate rows generated by the /p flag and prints the replaced lines only one time.

>sed -n 's/unix/linux/p' file.txt
linux is great os. unix is opensource. unix is free os.
linuxlinux which one you choose.

If you use -n alone without /p, then the sed does not print anything.

10. Running multiple sed commands.

You can run multiple sed commands by piping the output of one sed command as input to another sed command.

>sed 's/unix/linux/' file.txt| sed 's/os/system/'
linux is great system. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you chosysteme.

Sed provides -e option to run multiple sed commands in a single sed command. The above output can be achieved in a single sed command as shown below.

>sed -e 's/unix/linux/' -e 's/os/system/' file.txt
linux is great system. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you chosysteme.

11. Replacing string on a specific line number.

You can restrict the sed command to replace the string on a specific line number. An example is

>sed '3 s/unix/linux/' file.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you choose.

The above sed command replaces the string only on the third line.

12. Replacing string on a range of lines.

You can specify a range of line numbers to the sed command for replacing a string.

>sed '1,3 s/unix/linux/' file.txt
linux is great os. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you choose.

Here the sed command replaces the lines with range from 1 to 3. Another example is

>sed '2,$ s/unix/linux/' file.txt
linux is great os. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you choose.

Here $ indicates the last line in the file. So the sed command replaces the text from second line to last line in the file.

13. Replace on a lines which matches a pattern.

You can specify a pattern to the sed command to match in a line. If the pattern match occurs, then only the sed command looks for the string to be replaced and if it finds, then the sed command replaces the string.

>sed '/linux/ s/unix/centos/' file.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
centoslinux which one you choose.

Here the sed command first looks for the lines which has the pattern "linux" and then replaces the word "unix" with "centos".

14. Deleting lines.

You can delete the lines a file by specifying the line number or a range or numbers.

>sed '2 d' file.txt
>sed '5,$ d' file.txt

15. Duplicating lines

You can make the sed command to print each line of a file two times.

>sed 'p' file.txt

16. Sed as grep command

You can make sed command to work as similar to grep command.

>grep 'unix' file.txt
>sed -n '/unix/ p' file.txt

Here the sed command looks for the pattern "unix" in each line of a file and prints those lines that has the pattern.

You can also make the sed command to work as grep -v, just by using the reversing the sed with NOT (!).

>grep -v 'unix' file.txt
>sed -n '/unix/ !p' file.txt

The ! here inverts the pattern match.

17. Add a line after a match.

The sed command can add a new line after a pattern match is found. The "a" command to sed tells it to add a new line after a match is found.

>sed '/unix/ a "Add a new line"' file.txt
unix is great os. unix is opensource. unix is free os.
"Add a new line"
learn operating system.
unixlinux which one you choose.
"Add a new line"

18. Add a line before a match

The sed command can add a new line before a pattern match is found. The "i" command to sed tells it to add a new line before a match is found.

>sed '/unix/ i "Add a new line"' file.txt
"Add a new line"
unix is great os. unix is opensource. unix is free os.
learn operating system.
"Add a new line"
unixlinux which one you choose.

19. Change a line

The sed command can be used to replace an entire line with a new line. The "c" command to sed tells it to change the line.

>sed '/unix/ c "Change line"' file.txt
"Change line"
learn operating system.
"Change line"

20. Transform like tr command

The sed command can be used to convert the lower case letters to upper case letters by using the transform "y" option.

>sed 'y/ul/UL/' file.txt
Unix is great os. Unix is opensoUrce. Unix is free os.
Learn operating system.
UnixLinUx which one yoU choose.

Here the sed command transforms the alphabets "ul" into their uppercase format "UL"

Saturday, 25 February 2017

ANALYTICAL FUNCTIONS SQL

This article provides a clear, thorough concept of analytic functions and its various options by a series of simple yet concept building examples. The article is intended for SQL coders, who for might be not be using analytic functions due to unfamiliarity with its cryptic syntax or uncertainty about its logic of operation. Often I see that people tend to reinvent the feature provided by analytic functions by native join and sub-query SQL. This article assumes familiarity with basic Oracle SQL, sub-query, join and group function from the reader. Based on that familiarity, it builds the concept of analytic functions through a series of examples.

It is true that whatever an analytic function does can be done by native SQL, with join and sub-queries. But the same routine done by analytic function is always faster, or at least as fast, when compared to native SQL. Moreover, I am not considering here the amount of time that is spent in coding the native SQLs, testing, debugging and tuning them.

The general syntax of analytic function is:

Function(arg1,..., argn) OVER ( [PARTITION BY <...>] [ORDER BY <....>] [<window_clause>] )

<window_clause> is like "ROW <?>" or "RANK <?>"
All the keywords will be dealt in details as we walk through the examples. The script for creating the schema (SCOTT) on which the example queries of this article are run can be obtained in ORACLE_HOME/sqlplus/demo/demobld.sql of any standard Oracle installation.

How are analytic functions different from group or aggregate functions?

QUERY - 1

SELECT deptno,

COUNT(*) DEPT_COUNT

FROM emp

WHERE deptno IN (20, 30)

GROUP BY deptno;

DEPTNO DEPT_COUNT

---------------------- ----------------------

20 5

30 6

2 rows selected

Consider the Query-1 and its result. Query-1 returns departments and their employee count. Most importantly it groups the records into departments in accordance with the GROUP BY clause. As such any non-"group by" column is not allowed in the select clause.

QUERY - 2

SELECT empno, deptno,

COUNT(*) OVER (PARTITION BY

deptno) DEPT_COUNT

FROM emp

WHERE deptno IN (20, 30);

EMPNO DEPTNO DEPT_COUNT

---------- ---------- ----------

7369 20 5

7566 20 5

7788 20 5

7902 20 5

7876 20 5

7499 30 6

7900 30 6

7844 30 6

7698 30 6

7654 30 6

7521 30 6

11 rows selected.

Now consider the analytic function query (Query-2) and its result. Note the repeating values of DEPT_COUNT column.

This brings out the main difference between aggregate and analytic functions. Though analytic functions give aggregate result they do not group the result set. They return the group value multiple times with each record. As such any other non-"group by" column or expression can be present in the select clause, for example, the column EMPNO in Query-2.

Analytic functions are computed after all joins, WHERE clause, GROUP BY and HAVING are computed on the query. The main ORDER BY clause of the query operates after the analytic functions. So analytic functions can only appear in the select list and in the main ORDER BY clause of the query.

In absence of any PARTITION or <window_clause> inside the OVER( ) portion, the function acts on entire record set returned by the where clause. Note the results of Query-3 and compare it with the result of aggregate function query Query-4.

QUERY - 3

SELECT empno, deptno,

COUNT(*) OVER ( ) CNT

FROM emp

WHERE deptno IN (10, 20)

ORDER BY 2, 1;

EMPNO DEPTNO CNT

---------- ---------- ----------

7782 10 8

7839 10 8

7934 10 8

7369 20 8

7566 20 8

7788 20 8

7876 20 8

7902 20 8

QUERY - 4

SELECT COUNT(*) FROM emp WHERE deptno IN (10, 20);

COUNT(*)

----------

ROWID and ROWNUM

Rownum is a pseudo column which is generated for query data set at runtime.

Rowid is the physical address of the row and hence definition suggest rowid for a row will never change but row num will always change.

Rownum is the temp num assigned for the return rows per statement query
It will change as per the query statement.
It can be used with "=1" or "< 10" id number .

Where as Rowid is the id assigned by the Oracle while creating record in database. It is permanent & we can query with "=" sign & it will not change
or depend on query.

Some more Points about Rowid and Rownum :

1. Rowid gives address of rows or records. Rownum gives count of records.

2. Rowid is permanently stored in database. Rownum is not stored in database permanently

3. Rowid is automatically assigned with every inserted into a table. Rownum is an dynamic value automatically
retrieved along with select statement output.

4. It is only for display purpose.

5. ROWNUM is a pseudocolumn returning a sequential number along with the rows retrieved, whereas rowid (also a pseudo column) contains the actual physical address of the data block containing the row.

6.ROWID uniquely identifies where a row resides on disk. The information in a ROWID gives Oracle everything it needs to find your row, the disk number, the cylinder, block and offset into the block. It actually the physical address.
ROWID can be used to fetch a row from database.
ROWID is permanent.

7. ROWNUM is a "pseudo-column", ROWNUM is used to restrict the number of rows in a query.
ROWNUM is temporary.

8. ROWID is a semi-physical value specifying the location of a row (in terms of data file, extent, etcetera). ROWID is not very useful in day-to-day programming, as it is not a constant value (it changes with table re-organizations, imports, and things like that). Therefore it requires caution in its use.

9. ROWNUM is a logical value for the sequence of row retrieval in the resultset. This is the position before sorting, so rownum is not always the same as the position in the resultset presented to you. This also requires some caution.

VIEWS and MATERIALIZED VIEWS

What is View in database

Views are logical virtual table created by “select query” but the result is not stored anywhere in the disk and every time we need to fire the query when we need data, so always we get updated or latest data

from original tables. Performance of the view depend upon our select query. If we want to improve the performance of view we should avoid to use join statement in our query or if we need multiple joins between table always try to use index based column for joining as we know index based columns are faster than non index based column. View allow to store definition of the query in the database itself.

What is Materialized View in database

Materialized views are also logical view of our data driven by select query but the result of the query will get stored in the table or disk, also definition of the query will also store in the database .When we see the performance of Materialized view it is better than normal View because the data of materialized view will stored in table and table may be indexed so faster for joining also joining is done at the time of materialized views refresh time so no need to every time fire join statement as in case of view.

Difference between View vs Materialized View in database

Based upon on our understanding of View and Materialized View, Let’s see, some short difference between them :

1) First difference between View and materialized view is that, In Views query result is not stored in the disk or database but Materialized view allow to store query result in disk or table.

2) Another difference between View vs materialized view is that, when we create view using any table, rowid of view is same as original table but in case of Materialized view rowid is different.

3) One more difference between View and materialized view in database is that, In case of View we always get latest data but in case of Materialized view we need to refresh the view for getting latest data.

4) Performance of View is less than Materialized view.

5) This is continuation of first difference between View and Materialized View, In case of view its only the logical view of table no separate copy of table but in case of Materialized view we get physically separate copy of table

6) Last difference between View vs Materialized View is that, In case of Materialized view we need extra trigger or some automatic method so that we can keep MV refreshed, this is not required for views in database.

When to Use View vs Materialized View in SQL

Mostly in application we use views because they are more feasible, only logical representation of table data no extra space needed. We easily get replica of data and we can perform our operation on that data without affecting actual table data but when we see performance which is crucial for large application they use materialized view where Query Response time matters so Materialized views are used mostly with data ware housing or business intelligence application.

That’s all on difference between View and materialized View in database or SQL. I suggest always prepare this question in good detail and if you can get some hands on practice like creating Views, getting data from Views then try that as well.

Friday, 24 February 2017

INDEX

Indexes help us retrieve data from tables quicker. Let's use an example to illustrate this point: Say we are interested in reading about how to grow peppers in a gardening book. Instead of reading the book from the beginning until we find a section on peppers, it is much quicker for us to go to the index section at the end of the book, locate which pages contain information on peppers, and then go to these pages directly. Going to the index first saves us time and is by far a more efficient method for locating the information we need.

The same principle applies for retrieving data from a database table. Without an index, the database system reads through the entire table (this process is called a 'table scan') to locate the desired information. With the proper index in place, the database system can then first go through the index to find out where to retrieve the data, and then go to these locations directly to get the needed data. This is much faster.

Therefore, it is often desirable to create indexes on tables. An index can cover one or more columns. The syntax for creating a table index is shown in the CREATE INDEX section. Below we discuss some general strategies when building and using an index:

1. Build index on columns of integer type

Integers take less space to store, which means the query will be faster. If the column you want to build an index for is not of type integer, consider creating a surrogate integer key (or simply a surrogate column of type integer) which maps one-to-one to the column you want to build the index for.

2. Keep index as narrow as possible

Narrower indexes take less space, require less time to process, which in turn means the query will run faster.

3. Column order is important

For indexes covering multiple columns, the order of the columns in the index is important. The best practice is to use the column with the lowest cardinality first, and the column with the highest cardinality last. Recall cardinality means the number of distinct values for that column. So, if SELECT DISTINCT (COLUMN1) FROM TABLE_NAME; returns 5, that means the cardinality for COLUMN1 is 5.

4. Make sure the column you are building an index for is declared NOT NULL

This can decrease the size of the index, which in turn will speed up the query.

5. Build an index only when necessary

The advantages of using an index do not come without a cost. Indexes take up disk space, and they will cause INSERT, UPDATE, and DELETE operations to go slower, since each time one of these operations is carried out, not only does the database system need to update the values in the table, but it also needs to update the indexes. There are no hard rules on how to build indexes. Deciding what indexes to build is often this is an iterative process over time.

As mentioned in the Index overview page, a table index helps SQL statements run faster. The syntax for creating an index is:

CREATE INDEX "index_name" ON "table_name" (column_name);

Note that an index can only cover one table. We cannot build an index that covers multiple tables.

Let's assume that we have the following table,

Table Customer

First_Name char50,

Last_name char50,

Address char 50,

City char50,

Country char25,

Birth_date datetime;

and we want to create an index on the column Last_Name, we would type in,

CREATE INDEX IDX_CUSTOMER_LAST_NAME

ON Customer (Last_Name);

If we want to create an index on both City and Country, we would type in,

CREATE INDEX IDX_CUSTOMER_LOCATION

ON Customer (City, Country);

A simple index is an index on a single column, while a composite index is an index on two or more columns. In the examples above, IDX_CUSTOMER_LAST_NAME is a simple index because there is only one column, while IDX_CUSTOMER_LOCATION is a composite index because there are two columns.

There is no strict rule on how to name an index. The generally accepted method is to place a prefix, such as "IDX_", before an index name to avoid confusion with other database objects. It is also a good idea to provide information on which table and column(s) the index is used on.

Please note that the exact syntax for CREATE INDEX may be different for different databases. You should consult with your database reference manual for the precise syntax.

Pages

Sunday, 26 February 2017

UNIX CUT COMMAND

1. Select Column of Characters

2. Select Column of Characters using Range

3. Select Column of Characters using either Start or End Position

4. Select a Specific Field from a File

5. Select Multiple Fields from a File

6. Select Fields Only When a Line Contains the Delimiter

7. Select All Fields Except the Specified Fields

8. Change Output Delimiter for Display

9. Change Output Delimiter to Newline

10. Combine Cut with Other Unix Command Output

UNIX SED COMMAND

Sed Command Examples

Saturday, 25 February 2017

ANALYTICAL FUNCTIONS SQL

How are analytic functions different from group or aggregate functions?

ROWID and ROWNUM

VIEWS and MATERIALIZED VIEWS

Friday, 24 February 2017

INDEX