Data understanding and preparation - basic work with datasets

In my previous four articles, I worked on a single variable of a dataset. I have shown example code in T-SQL, R, and Python languages. I always used the same dataset. Therefore, you might have gotten the impression that in R and in Python, you can operate on a dataset the same way like you operate on an SQL Server table. However, there is a big difference between an SQL Server table and Python or R data frame.

In this article, will do a bit more formal introduction to R and Python data frames. I will show how to make basic operations on data frames, like filter them, make a projection, join and bind them, and sort them. For the sake of completeness, I am starting with T-SQL. Of course, the first part is really just a very brief recapitulation of the basic SELECT statement.

Core T-SQL SELECT statement elements

The simplest query to retrieve the data you can write includes the SELECT and the FROM clauses. In the select clause, you can use the star character, literally SELECT *, to denote that you need all columns from a table in the result set.

Better than using SELECT * is to explicitly list only the columns you need. This means you are returning only a projection on the table. A projection means you filter the columns. Of course, you can filter also the rows with the WHERE clause.

In a relational database, you typically have data spread in multiple tables. Each table represents a set of entities of the same kind, like customers, or products, or orders. In order to get result sets meaningful for the business your database supports, you most of the time need to retrieve data from multiple tables in the same query. You need to join two or more tables based on some conditions. The most frequent kind of a join is the inner join. Rows returned are those for which the condition in the join predicate for the two tables joined evaluates to true. Note that in a relational database, you have three-valued logic, because there is always a possibility that a piece of data is unknown. You mark the unknown with the NULL keyword. A predicate can thus evaluate to true, false or NULL. For an inner join, the order of the tables involved in the join is not important.

In the query where you join multiple tables, you should use table aliases. If a column’s name is unique across all tables in the query, then you can use it without table name. You can shorten the two-part column names by using table aliases. You specify table aliases in the FROM clause. Once you specify table aliases, you must always use the aliases; you can’t refer to the original table names in that query anymore. Please note that a column name might be unique in the query at the moment when you write the query. However, later somebody could add a column with the same name in another table involved in the query. If the column name is not preceded by an alias or by the table name, you would get an error when executing the query because of the ambiguous column name. In order to make the code more stable and more readable, you should always use table aliases for each column in the query. You can specify column aliases as well.

A table in SQL Server represents a set. In a set, the order of the elements is not defined. Therefore, you cannot refer to a row or a column in a table by position. Also, the result of the SELECT statement does not guarantee any specific ordering. If you want to return the data in a specific order, you need to use the ORDER BY clause. The following query shows these basic SELECT elements.

USE AdventureWorksDW2016;

-- Basic SELECT statement elements

SELECT c.CustomerKey, c.FirstName, c.LastName,

g.City, g.StateProvinceName AS State

FROM dbo.DimCustomer AS c

INNER JOIN dbo.DimGeography AS g

ON c.GeographyKey = g.GeographyKey

WHERE g.EnglishCountryRegionName = N'Australia'

ORDER BY c.LastName DESC, c.FirstName DESC;

The following figure shows the partial results, limited to the first six rows only. Please note the order of the rows and other elements of the basic SELECT statement implemented.

R data frame

In R, you operate on matrices, not on tables. A matrix is a two-dimensional array of values of the same type. The elements are ordered, and you can refer to them by position.

The most important data structure in R is a data frame. Most of the time, you analyze data stored in a data frame. Data frames are matrices where each variable can be of a different type. Remember, a variable is stored in a column, and all values of a single variable must be of the same type. Data frames are very similar to SQL Server tables. However, they are still matrices, meaning that you can refer to the elements by position and that they are ordered.

Most of the times, you get a data frame from your data source, for example from a SQL Server database. You can also enter the data manually, or read it from many other sources, including text files, Excel, and many more. The following code reads the data from SQL Server AdventureWorksDW2016 demo database, the dbo.vTargetMail view.

library(RODBC)

# Connecting and reading the data

con <- odbcConnect("AWDW", uid = "RUser", pwd = "Pa$$w0rd")

TM <- as.data.frame(sqlQuery(con,

"SELECT CustomerKey, MaritalStatus, Gender, Age

FROM dbo.vTargetMail;"),

stringsAsFactors = TRUE)

close(con)

The following code shows how you can access the data in a data frame by the position or mixed by the position and column names. Please note that the index of the first row or column is 1, and that the boundaries are included – you get three rows and two columns.

1 2	TM[1:3, 1:2]; TM[1:3, c("MaritalStatus", "Gender")];

The result of the previous two rows is the same:

	CustomerKey	MaritalStatus
1	11000	M
2	11001	S
3	11002	M

You can refer to columns similarly like you refer to them in T-SQL, where you use table.column dotation; in R, the dollar sign replaces the dot, and therefore you refer to columns with dataframe$column. The following code shows how to refer to the columns. It does the crosstabulation of the MaritalStatus and Gender columns and shows this in a bar chart.

# $ Notation

mg <- table(TM$MaritalStatus, TM$Gender);

barplot(mg,

main = 'Marital status and gender',

xlab = 'MaritalStatus', ylab = 'Gender',

col = c("blue", "yellow"),

beside = TRUE,

legend = TRUE)

Here is the bar chart.

For a projection, you simply select appropriate columns, like the following code shows. It creates two new data frames with subset of the columns only.

# Projections

cols1 <- c("CustomerKey", "MaritalStatus")

TM1 <- TM[cols1]

cols2 <- c("CustomerKey", "Gender")

TM2 <- TM[cols2]

TM1[1:3, 1:2]

TM2[1:3, 1:2]

Here are the first three rows of the two new data frames.

	CustomerKey	MaritalStatus
1	11000	M
2	11001	S
3	11002	M
	CustomerKey	Gender
1	11000	M
2	11001	M
3	11002	M

You can merge two data frames using some column values from a column that appears in both of them. This is very similar to T-SQL join. The following code performs the merge of the two previously created data frames.

# Merge datasets

TM3 <- merge(TM1, TM2, by = "CustomerKey")

TM3[1:3, 1:3]

The results are:

	CustomerKey	MaritalStatus	Gender
1	11000	M	M
2	11001	S	M
3	11002	M	M

However, data frames are ordered. You can rely on the order, unless you reorder a data frame, of course. Since I created the two projection data frames TM1 and TM2 from the original TM data frame without reordering them, they maintain the original order. Instead of joining them, I can simply bind the columns, row by row, by the cbind() function, like the following code shows.

# Binding datasets

TM4 <- cbind(TM1, TM2)

TM4[1:3, 1:4]

Note the results:

	CustomerKey	MaritalStatus	CustomerKey.1	Gender
1	11000	M	11000	M
2	11001	S	11001	M
3	11002	M	11002	M

The CustomerKey column is listed twice, and the second time it is automatically renamed.

You can filter a data frame by using a predicate when you refer to the index of rows or columns. The following code creates a two new data frames with the same columns, CustomerKey and MaritalStatus. The first data frame is filtered to include only two customers with the lowest CustomerKey, and the second includes only two customers with the highest CustomerKey values. Both data frames have the same columns. You can bind such data frames by rows with the rbind() function. This is a similar process like the UNION ALL clause does in T-SQL.

# Filtering and row binding data

TM1 <- TM[TM$CustomerKey < 11002, cols1]

TM2 <- TM[TM$CustomerKey > 29481, cols1]

TM5 <- rbind(TM1, TM2)

TM5

Here is the content of complete TM5 data fame.

	CustomerKey	MaritalStatus
1	11000	M
2	11001	S
18483	29482	M
18484	29483	M

Finally, you can reorder a data frame by using the order() function in the index reference. The following code creates a reordered data frame by sorting the rows over the Age column. Note the minus sign in the order() function – it means sort descending.

# Sort

TMSortedByAge <- TM[order(-TM$Age), c("CustomerKey", "Age")]

TMSortedByAge[1:5, 1:2]

So here are the reordered first five rows.

	CustomerKey	Age
1726	12725	99
5456	16455	98
3842	14841	97
3993	14992	97
7035	18034	97

Python Pandas data frame

In Python, there is also the data frame object, like in R. However, it is not part of the basic engine like in R. It is defined in the pandas library. You can communicate with SQL Server through the pandas data frames. But before getting there, you need first to refer to the numpy library, which brings efficient work with matrices to Python. The following code does the necessary imports and reads the data from SQL Server.

import numpy as np

import pandas as pd

import pyodbc

import matplotlib.pyplot as plt

# Connecting and reading the data

con = pyodbc.connect('DSN=AWDW;UID=RUser;PWD=Pa$$w0rd')

query = """SELECT CustomerKey, MaritalStatus, Gender, Age

FROM dbo.vTargetMail;"""

TM = pd.read_sql(query, con)

The pandas data frame has many built methods for which you need separate functions in R. For example, you can use the pandas crosstab() function to cross tabulate the data like the R table() function. However, you don’t need a separate function to plot the data; you ca use the pandas data frame plot() function, like the following code shows.

# A graph of marital status and gender

pd.crosstab(TM.MaritalStatus + TM.Gender,

columns = 'Count',

rownames = 'X').plot(kind = 'barh',

legend = False,

title = 'Marital status and gender',

fontsize = 12)

plt.show()

Not that the result of the pandas crosstab() function is a new data frame, and I am calling the plot() method of this data frame in order to produce this time a horizontal bar chart, like you can see in the following figure.

You make a projection of a data frame by selecting a subset of columns listed by their names in an array, like the following code shows.

# Projections

TM1 = TM[["CustomerKey", "MaritalStatus"]]

TM2 = TM[["CustomerKey", "Gender"]]

I created two new data frames in the previous code. Because a data frame is a matrix in Python as well, I can refer to the elements by their positional index with the iloc(), or index locate method:

# Positional access

TM1.iloc[0:3, 0:2]

TM2.iloc[0:3, 0:2]

Note that the index is zero-based. But there is another interesting difference from R. Please observe the results of the previous code.

	CustomerKey	MaritalStatus
0	11000	M
1	11001	S
2	11002	M
	CustomerKey	Gender
0	11000	M
1	11001	M
2	11002	M

Not that when you refer to the elements of a data frame by the index position, the upper boundary is not included. For example, the fourth row (index value 3) is not included in any of the results.

With the loc() method you can locate the elements based on a predicate for the rows and columns. For example, the following code selects Only rows where Age is greater than 97 and lists the three columns included in the results explicitly in an array.

1 2	# Filter and projection TM.loc[TM.Age > 97, ["CustomerKey", "Age", "Gender"]]

Here are the results.

	CustomerKey	Age	Gender
1725	12725	99	F
5455	16455	98	F

In order to join two data frames, you can use the merge() pandas function, and specify the column you want to use for the join. This is similar to R merge() function. You can see this process in the following code.

# Joining data frames

TM3 = pd.merge(TM1, TM2, on = "CustomerKey")

TM3.iloc[0:3, 0:3]

The first three rows of the merged data frame are shown below.

	CustomerKey	MaritalStatus	Gender
0	11000	M	M
1	11001	S	M
2	11002	M	M

Finally, you can reorder a data frame with help of the sort() method, like shown in the following code.

# Sort

TMSortedByAge = TM.sort(["Age"], ascending = False)

TMSortedByAge.iloc[0:5, 0:4]

And here is the last result in this article.

	CustomerKey	MaritalStatus	Gender	Age
1725	12725	M	F	99
5455	16455	M	F	98
3841	14841	M	M	97
7034	18034	M	M	97
3992	14992	M	M	97

Conclusion

In this article, you learned how to do the basic operations on a whole dataset. For the next article, I plan to show how you can do more advanced operations, like grouping and aggregating data in t-SQL, R, and Python.

Introduction to data science, data understanding and preparation

Data science in SQL Server: Data understanding and transformation – ordinal variables and dummies

Data science in SQL Server: Data analysis and transformation – binning a continuous variable

Data science in SQL Server: Data analysis and transformation – Information entropy of a discrete variable

Data understanding and preparation – basic work with datasets

Data science in SQL Server: Data analysis and transformation – grouping and aggregating data I

Data science in SQL Server: Data analysis and transformation – grouping and aggregating data II

Interview questions and answers about data science, data understanding and preparation

References

See more

To boost SQL coding productivity, check out these SQL tools for SSMS and Visual Studio including T-SQL formatting, refactoring, auto-complete, text and data search, snippets and auto-replacements, SQL code and object comparison, multi-db script comparison, object decryption and more

Author
Recent Posts

Dejan Sarka

Dejan Sarka, MCT and Data Platform MVP, is an independent trainer and consultant that focuses on development of database & business intelligence applications.Besides projects, he spends about half of the time on training and mentoring. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of sixteen books about databases and SQL Server. He also developed many courses and seminars for Microsoft, SolidQ and Pluralsight.

View all posts by Dejan Sarka

SQLShack

Data understanding and preparation – basic work with datasets

Core T-SQL SELECT statement elements

R data frame

Python Pandas data frame

Conclusion

Table of contents

References

See more

Core T-SQL SELECT statement elements

R data frame

Python Pandas data frame

Conclusion

Table of contents

References

See more

Related posts: