Power BI Desktop is a useful reporting and analytical tool to represent data in various formats. These presentations help us to quickly understand information and circulate it to stakeholders in a visual fashion.
We can connect Power BI Desktop to disparate data sources including multiple relational and non-relation databases, Cloud, SharePoint, and Microsoft Exchange etc. In the real world, we use various file formats to represent our documents.
In Power BI, we can get data from below file formats
- Excel
- Text\CSV
- XML
- JSON
- SharePoint Folder
Although popular as an export format, you cannot easily modify a PDF document, and it can contain the text, images, tables, and charts in PDF document.
In Power BI Desktop, we cannot get data from PDF documents directly. Also, we do not have any option to use PDF as a data source. But suppose we want to extract a table from the PDF file and prepare visualizations on it. We can import data from PDF with following methods.
- Convert the PDF file to a Microsoft Word or HTML document and copy the table in an appropriate data source such as Microsoft Excel. It is a lengthy process and if we have to do the same process repeatedly, it quickly becomes cumbersome
- We can also use programming languages such as R to extract the required data from a PDF file. It requires extensive technical knowledge of a machine language such as R to work with these PDF files. If the format of the PDF files changes or we need to import a table from another PDF file, we require a change in the programming code. So this again creates organizational friction to contact developers and ask for the code
In both approaches, we require extra overhead to prepare PDF reports in Power BI. In this article, we will show how import data from a PDF file easily and without any coding.
Power BI PDF Connector
The Power BI Desktop contains many preview features. These features are not enabled by default. We can use the PDF Connector preview feature in Power BI to use a PDF file as a data source.
Launch the Power BI and go to the File menu. In the File menu, go to Options and settings and click on Options.
It opens various Power BI configuration options. We can configure various options such as Data load, Power BI Query, R scripting, security, Report settings. We need to click on the Preview features menu option to use the PDF connector in Power BI Desktop data source.
It opens the various preview features in the Power BI Desktop.
Note: This preview list is updated regularly. Some of the preview features might move to general availability. We will also get new features with the new release of Power BI Desktop.
In the preview feature list, enable Get Data from PDF files. We can also read the documentation from the Learn more. It opens a web page for Microsoft Docs. You should have an active internet connection to go through these documentations.
Click on the checkbox in front of Get data from PDF files and click Ok. We need to restart Power BI to enable the preview feature. Close Power BI Desktop and re-open it.
Once we relaunch Power BI Desktop, go to Get Data->More again from the menu bar. In the following screenshot, you can get a new option under ‘File’. We get the option ‘PDF (Beta) to use as a data source.
Click on PDF (Beta) connector and connect. We again get a warning message that this PDF connector is still under development. Since it is still a preview version, it might have a few bugs that will be resolved before the general availability of this feature.
Click on Continue. If we do not want this message, then you can put a check on Don’t warn me again for this connector.
In this article, we are going to import a sample PDF file that contains the Monthly sales analysis. We can see the text, images along with data table in this PDF file. We need to import table from the following PDF file. We can prepare Power BI Reports using this data.
Once you click on Connect and provide path of this PDF file from a local directory.
It connects to PDF file and opens a Navigator. It lists out the table from PDF and a list of all pages. If we select the page, it shows the complete page content in the Navigator preview window.
We need to get data from this table, therefore, put a check in front of the table, and it shows the table on the right-hand side page.
You can see a table from our PDF file. In this preview, we need to make a few changes to display the table contents in a proper format. We can make the changes as well in data imported from the PDF file. Click on Edit.
It opens a Power Query Editor. Power Query Editor helps to make the changes in table format, add or remove any column, row, split column etc.
In this Power Query Editor, you can see that the fourth row contains the column names. We do not want to have the top three rows in our table. Therefore, we need to remove these three rows from the top.
We need to open the list of options for Remove Rows. In the options list, click on Remove Top Rows. It opens a pop-up box to specify the number of rows from the top we want to remove. We want to remove the top three rows from the top, therefore, enter the value and click OK.
It removes the number of rows specified from the top (in our case three rows) as shown in below image.
Now the top row in this table contains the column list. Currently, we do not have any column name defined for the table. We want the top row to represent the column names, therefore, click on the Use First Row as Headers as shown in the image above. It uses the top row as a header of the table. Once we clicked on the Use First Rows as Headers, we can see the column name in our data table.
We might do further filtering in our data. We do not want any NULL values to be there in a table for the location column. Click on the location column, remove the checkbox from the NULL value, and click OK.
It removes Null values from the table, and you can see data in the tabular format now. You can match this data with the PDF file to have validation.
We need to click on the Close & Apply from the menu bar to save the changes. It makes all required changes and we can see table fields in the FIELDS section. It identifies the data type accurately, and we can use this data for the reporting purpose easily.
We can choose the required fields from this data set and create a Power BI Visualization.
We can view data as well from the ‘Data’ tab on the left-hand side.
Conclusion
In this article, we explored the quick and easy way to get the data from a PDF data source. It is an exciting enhancement to Power BI Desktop functionality.
Table of contents
- Understanding PostgreSQL SUBSTRING function - September 21, 2024
- How to install PostgreSQL on Ubuntu - July 13, 2023
- How to use the CROSSTAB function in PostgreSQL - February 17, 2023