In this article, we are going to use SQL Server to query data that resides in Amazon S3 buckets with the help of Amazon Athena. SQL Server is one of the most popular relational database management systems developed by Microsoft. SQL Server can be installed on-premises or on popular cloud services like Azure or AWS. For the purpose of this demonstration, we are going to use SQL Server that has been installed on-premises. Also, as a pre-requisite for this tutorial, we would need to have an AWS account valid and some knowledge about Amazon Athena.
What is Amazon Athena
Amazon Athena is an interactive query service provided by Amazon that can be used to connect to S3 and run ANSI SQL queries. Imagine you have a CSV file that contains data in tabular format. This CSV file cannot be read by any SQL engine without being imported into the database server directly. In such a scenario, AWS Athena provides a service that can be used to first create a schema of the CSV file, and then use general ANSI SQL scripts to query the data stored in those CSV files. This is also possible in case there is more than one CSV file. The only condition in such a case would be that all the CSV files maintain a similar column structure.
Let us now head over to the AWS console and upload a small CSV file that can be queried by Athena. Navigate to https://console.aws.amazon.com and search for S3. Create an S3 bucket of your choice and then create a directory under it. Under this directory, you can upload your CSV file as follows.
Figure 1 – Uploading the CSV file
Once the CSV file has been uploaded, the next step is to move to the Athena service to set up the schema and query service. Search for the Athena service and then create an environment that can read the CSV files from the S3 bucket. You can follow this article, Getting started with Amazon Athena and S3, where I have explained in a step-by-step manner how to create an instance in Athena and query data from an S3 bucket. Once you have created an instance, you are free to query the data using standard SQL queries as follows.
Figure 2 – Creating an AWS Athena instance and querying it
As of now, the part of setting up the AWS Athena service and the CSV files in S3 has been over. In the next section, we will look into how to connect SQL Server to this Athena service and run similar queries from the SQL Server instance.
Also, in order to query the Athena service from SQL Server, we need to generate the Access Key and the Secret Key from the AWS console. This can be done by navigating to the IAM service on the console and then generating the security credentials from the Users tab. If you are using an AWS Account that is handled by an administrator, you would need to get the credentials from the Administrator.
Setting up the ODBC Driver for Amazon Athena
Now that we have our Athena service up and running, we can set up our SQL Server instance and configure it to query the Athena. For this purpose, we need to first create a linked server in SQL Server and then we will use the OPENQUERY service to query data stored in the S3 buckets.
In order to set up the linked server, we would first need to download and install the connector driver for SQL Server to connect to the AWS Athena service. This can be easily downloaded from the official page of Athena. Navigate to https://docs.aws.amazon.com/athena/latest/ug/connect-with-odbc.html and then select an appropriate version for your computer. In my case, I am going to proceed with the ODBC 1.1.12 for Windows 64-bit version. Download the setup and install it on your machine.
Figure 3 – Installing AWS Athena ODBC Connector
Proceed with the installation as usual and once done, you can set up the ODBC connector. Search for ODBC Drivers from the Start Menu and ODBC Data Source Administrator (64-bit) will open up on the screen. For the System DSN, the Simba Athena connector will already be present. However, in this case, we would need to install the ODBC driver for the user. Under User DSN, click on Add and the new data source dialog box will open up.
Figure 4 – Creating the new data source using the ODBC Driver
As you can see in the figure above, we need to select the Simba Athena ODBC Driver and click on Finish to start. In order to configure the ODBC driver, the following information will be required that will allow the ODBC connector to establish a connection to the Athena service running under the specified AWS account.
Figure 5 – Simba Athena ODBC Driver Setup
You can use the following details to set up the ODBC Driver.
- Data Source Name – This is the name of the ODBC driver that is going to be created. We will refer to this in the following sections
- Description – This is an optional field. You can add a description of your choice to refer to the ODBC driver
- AWS Region – This specifies the region in which the Athena service is running
- Catalog – The database server catalog of the Athena service
- Schema – The name of the database in the Athena service
- Workgroup – The name of the workgroup
- S3 Output Locations – This is a directory in S3 where Athena can store its output results
Once all this information is provided, the next step is to authenticate the driver and connect to the AWS account. Click on Authentication Options and provide the AWS credentials as obtained earlier.
Figure 6 – Authenticating the ODBC Connector with AWS
Click OK and test the connection if successful.
Figure 7 – Testing the connection ODBC Driver
As you can see in the figure above, a connection has been successfully established with the Athena service that is running under the specified AWS Account. If there are any issues while connecting to the service, then you can try to regenerate the security credentials and retry connecting with the new ones.
Setting up the Linked Server in SQL Server Management Studio
Once the ODBC driver is set up correctly, the next step would be to create a linked server in SQL Server that can use the ODBC driver to connect to the Amazon Athena service. In order to create a linked server, you can run the following script in SQL Server Management Studio. The following script will add the linked server and also configure it for use.
Figure 8 – Setting up the Linked Server in SQL Server Management Studio
You can find the script here.
As soon as you run the script, you can refresh the database connection on the left panel and expand the Linked Servers tab to find the AWS_Athena server that has been created.
Figure 9 – Linked Server created
Right-click on the Linked Server and click on Test Connection to test if the driver can connect to the Amazon Athena service.
Figure 10 – Connection Successful
Once the connection to the Linked Server is successful, the next step is to expand the objects under the linked server and browse the tables that were originally created in Amazon Athena.
Figure 11 – Browsing the Linked Server
Alternatively, you can also start querying the service from SQL Server Management Studio now.
Figure 12 – Querying the Athena Service from SQL Server
Conclusion
In this article, we have discussed what Amazon Athena is, and how to connect the SQL Server database to Athena and extract data from S3 buckets stored in AWS. This way of interacting SQL Server with data in S3 is a great advantage while dealing with data that is continuously growing. Amazon S3 can easily act as a Data Lake that can store data in a structured or semi-structured format. This data can then be queried using Amazon Athena which in turn can be linked to a SQL Server instance. This gives users of SQL Server a simple connection to Amazon Athena over an ODBC driver to fetch results over data stored in an S3 bucket.
- Getting started with PostgreSQL on Docker - August 12, 2022
- Getting started with Spatial Data in PostgreSQL - January 13, 2022
- An overview of Power BI Incremental Refresh - December 6, 2021