This article gives an overview of DIFFERENCE and SOUNDEX SQL Server built-in system functions. It explains how to use those functions and how do they work.
Introduction
Soundex is a phonetic algorithm developed by Robert C. Russell and Margaret King Odell in the early 1900s. This algorithm is used to index names as they are pronounced in English. The main goal of such an algorithm is to encode homophones to the same representation to be matched even if there are some slight spelling differences. As an example, consider the names “Smith” and “Smyth”, or “Mohamad” and “Mouhammad”. Soundex mainly encodes consonants and only encodes a vowel if it is the first letter of the name.
Being one of the most popular phonetic algorithms, Soundex was implemented in multiple database engines such as Oracle, SQL Server, MySQL, SQLite, and PostgreSQL.
In the following sections, we will illustrate the Soundex SQL Server functions.
SOUNDEX SQL Server built-in function
In SQL Server, SOUNDEX() is a scalar function that takes a string value as input and returns a four-character string. As mentioned before, the returned string by the Soundex SQL function is generated based on the way the input string is spoken in English.
- The first character of the code is the first character of the input string, converted to upper case.
- The remaining characters of the code are numbers that represent the letters in the expression.
- Vowels and the “H” and “W” characters are ignored except if they are the first letter of the input string.
- If the input string length is less than four, the Soundex function adds additional zeros to the returned value.
Soundex SQL function is mainly used as a fuzzy matching technique for data integration purposes. It is a collation-sensitive function.
As mentioned in the official documentation, before SQL Server 2012, the Soundex SQL function only applied a subset of the Soundex algorithm rules. In SQL Server 2012, more rules were implemented. Besides, the following restrictions will be applied once using a database compatibility level greater than or equal to 110 (SQL Server 2012):
- A heap (table without a clustered index) that contains a persisted computed column defined with SOUNDEX cannot be queried until the heap is rebuilt by running the statement ALTER TABLE <table> REBUILD.
- CHECK constraints defined with SOUNDEX are disabled upon upgrade. To enable the constraint, we should run the statement ALTER TABLE <table> WITH CHECK CHECK CONSTRAINT ALL.
- Indexes (including indexed views) that contain a persisted computed column defined with SOUNDEX cannot be queried until the index is rebuilt by running the statement ALTER INDEX ALL ON <object> REBUILD.
Examples
To illustrate how the Soundex SQL functions work, we will run some examples. First, let us run the following query:
1 2 |
SELECT SOUNDEXSELECT SOUNDEX('H') as 'H', SOUNDEX('Ha') as ' Ha', SOUNDEX('Hd') as 'Hd', SOUNDEX('Had') as 'Had', SOUNDEX('Hadi') as 'Hadi' |
The image below shows that the Soundex function returns the same value (H000) for both “H” and “Ha” input strings. As we can see, the “a” is ignored, and three additional zeros were added to the result.
Besides, “Hd”, “Had”, and “Hadi” produce the same output (H300) since the “a” and “I” vowels as ignored, and two additional zeros are added.
Figure 1 – Soundex function example
Let us try to pass multiple words to the Soundex function:
1 2 3 |
SELECT SOUNDEX('Hadi') as 'Hadi', SOUNDEX('HadiFadlallah') as 'HadiFadlallah', SOUNDEX('Hadi Fadlallah') as 'Hadi Fadlallah' |
The image below shows that “Hadi” and “Hadi Fadlallah” produce the same output value (H300) and characters after the space is ignored. While removing the space from the name will change the output value to “H313”. Besides, the output value of “HadiFadlallah” is the same as the output value of the first four consonants (“Hdf”).
Figure 2 – Another example of the Soundex function
DIFFERENCE
DIFFERENCE is a built-in scalar function used to measure the similarity of two strings using the Soundex SQL function. First, SOUNDEX() is applied to each input, and then a similarity check is done over these results. This function returns an integer value between 0 and 4. When this value is closer to 4, then inputs are very similar.
Examples
To illustrate how the Difference() SQL functions work, we will run some examples. First, let us run the following query:
1 2 3 4 |
SELECT DIFFERENCE('H','Ha') as 'H_Ha', DIFFERENCE('Ha','Had') as 'Ha_Had', DIFFERENCE('Hd','Had') as 'Hd_Had', DIFFERENCE('Had','Hadi') as 'Had_Hadi' |
The image below shows that values having the same Soundex output such as “H” and “Ha” or “Hd”, “Had”, and “Hadi” has a similarity score of 4. While “Ha” and “Had” produce a lower similarity score.
Figure 3 – Difference SQL function example
Besides, the following example shows that adding spaces will make the Difference() function ignore all characters to the right.
1 2 |
SELECT DIFFERENCE('Hadi','Hadi Fadlallah') as 'Hadi_Hadi Fadlallah', DIFFERENCE('Hadi','HadiFadlallah') as 'Hadi_HadiFadlallah' |
As we can note, “Hadi” and “Hadi Fadlallah” have a similarity score of 4, while removing the space from “Hadi Fadlallah” will decrease the score to 2.
Figure 4 – Second example of the Difference SQL function
Now, let us try to compare two of the “MSSQL” word with a similar word like “MYSQL” and another word like “POSTGRES”:
1 2 |
SELECT DIFFERENCE('MYSQL', 'MSSQL') as 'MYSQL_MSSQL', DIFFERENCE('MSSQL', 'POSTGRES') as 'MSSQL_POSTGRES' |
The image below shows how the similarity score is high for both “MYSQL” and “MSSQL” since only one character differs between both words, while “MSSQL” and “POSTGRES” have a very low score since they are very different.
Figure 5 – Third example of the Difference SQL function
Summary
This article briefly explained the Soundex algorithm and how it is implemented in SQL Server. Besides, it illustrated another built-in function named Difference which uses the Soundex SQL function to calculate the similarity between two different input strings.
- An overview of SQL Server monitoring tools - December 12, 2023
- Different methods for monitoring MongoDB databases - June 14, 2023
- Learn SQL: Insert multiple rows commands - March 6, 2023