SQL: What is collation? Why is collation important?

Key Points

SQL collation specifies how to sort and compare character data based on language or region norms.
Collations prevent language-specific sorting bugs.
It can be set for instance, database, column, or expression.
Windows collation is recommended for new development.
The COLLATE clause is used to specify collation rule for a query.

SQL: What is collation?

Collation is a rule that specifies how to sort and compare strings of character data. It is based on the norms of languages or regions.

For instance, in German, the umlauts (ä, ö, ü) are treated as separate letters and are sorted after "z" in the alphabet. This means that "Müller" would appear after "Muller" in a German collation, but before it in an English collation.

How does SQL Server set collation?

In SQL Server, you can set collations for:

instance,
database,
column, and
expression.

Setting SQL collation during SQL Server installation

During installation, you must specify the default collation for the instance. For databases, you can also specify the default collation or it will use the instance's default collation. When defining a character column, variable, or parameter, you can also specify its collation.

Objects without a specified collation will use the database's default collation.

Why is collation important?

SQL collation is important because it sets linguistic rules for comparing strings and sorting text values without requiring explicit case-insensitive comparison or sorting rules like in programming languages.

SQL Server has collations for most world languages and allows for turning on/off case, accent, or Kana sensitivity rules. Using collations can save time when fixing language-specific sorting bugs, making it beneficial for global market applications.

What's the SQL Collation format?

The SQL Collation format consists of several components including sorting rules, uppercase preference, code page, case sensitivity, and accent sensitivity.

The format follows the pattern of SQL_SortRules[_Pref]CPCodepage<ComparisonStyle>

Element	Description
SQL_collation_name	A single string that specifies the collation name for a SQL Server collation.
SQL_SortRules	A string identifying the alphabet or language whose sorting rules are applied when dictionary sorting is specified.
_Pref	Specifies uppercase preference. Even if comparison is case-insensitive, the uppercase version of a letter sorts before the lowercase version, when there is no other distinction.
_CPCodepage	Specifies a one- to four-digit number that identifies the code page used by the collation.
_ComparisonStyle	Specifies whether the collation is case-sensitive, accent-sensitive, or binary.

Example

The SQL collation name "Latin1_General_100_CI_AS_SC" is composed of several parts that specify the sorting rules used for comparisons and sorting of character data in SQL Server.

"Latin1_General" specifies the alphabet or language used for sorting rules. In this case, it refers to the Latin-1 character set (also known as ISO-8859-1), which includes characters commonly used in Western European languages such as English, French, and Spanish.
"100" refers to the version of the sorting rules used by this collation. In this case, it refers to version 100 of the Windows collation rules, which were introduced with SQL Server 2008.
"CI" specifies that the collation is case-insensitive, meaning that upper-case and lower-case letters are treated as equivalent for sorting and comparison purposes.
"AS" specifies that the collation is accent-sensitive, meaning that accented characters are treated as distinct from their unaccented counterparts for sorting and comparison purposes.
"SC" specifies that the collation is sensitive to the Unicode surrogate pair code points used to represent some characters outside the basic multilingual plane.

SQL Collation vs Windows Collation

Windows collation uses sorting rules defined by the Windows OS, while SQL collation has its own sorting rules. Windows collation offers a wider range of options for various languages and cultures, while SQL collation is limited and mainly used for backward compatibility. It's recommended to use Windows collation for new development.

What is the SQL COLLATE clause?

The SQL COLLATE clause specifies the character set and collation order to be used for a specific query.

For example, suppose you have a case-sensitive database and you want to perform a case-insensitive comparison on a column called "Name". To do this, you can use the collate clause in your query. Here's an example:

sql

SELECT * FROM table_name WHERE Name COLLATE SQL_Latin1_General_CP1_CI_AS = 'john';

In this query, the COLLATE clause is used to specify the SQL_Latin1_General_CP1_CI_AS collation, which is case-insensitive.

This allows the query to match both "John" and "john" in the "Name" column, even though the database is case-sensitive.

What is width-sensitive collation?

Width-sensitive (_WS) collation takes into account the number of bytes used to represent a character.

When using width-sensitive (_WS) collation, if a character can be represented both as a single-byte and double-byte character, they will be sorted separately in the query results.

On the other hand, with width-insensitive collation, SQL Server will consider both the single-byte and double-byte representations of the same character as identical.

For example, width-insensitive collation treats E=mc² and E=mc2 as equal. On the other hand, width-sensitive collation treats them as different.

Published on: Mar 14, 2023