Last year, I wrote a post about alphabetizing strings in Python. It explained why sorting is case-sensitive by default and then shared a method for forcing case-insensitive ordering anyway. It also briefly mentioned how to handle non-English characters but didn't dive into the specifics.

In general, the information from that post transfers comfortably to other programming languages as well. But SQL, the rebel, plays by its own set of rules. That's because SQL is specifically designed to communicate with relational databases, where the order in which data is stored under-the-hood plays a key component in sorting mechanics.

Sorting at the storage layer

In SQL, a character set is the set of symbols that can be used to express strings of data, while collation is the set of rules used to compare and sort those character strings.

For example, older versions of MySQL use the latin1 character set by default. This set includes Latin letters like A, B, and C but excludes letters like the Œ from French or the Ő in Hungarian. The default collation is latin1_swedish_ci, a case-insensitive sorting that properly handles the ordering of rarer letters found in some European languages. If desired, a database administrator could easily set the collation to latin1_general_ci or utf8_general_ci instead.

Fun fact: Why would the default collation be named latin1_swedish_ci instead of latin1_general_ci? Some argue that it's because latin1_swedish_ci better handles certain European-specific letters, but others believe that it's because the original developer named it after his company's primary base in Sweden. That latter reason may seem unfounded, but the same developer also named MySQL, MariaDB, and MySQL-Max after his three children: My, Maria, and Max. So who knows?

These days, the preferred default is the utf8mb4 character set paired with the accent-insensitive, case-insensitive utf8mb4_0900_ai_ci collation. This collation treats strings like RESUME or résumé as resume when comparing or sorting. As for the difference between Unicode and Latin character sets, that deserves a post of its own. Read through Joel Spolsky's fantastic post on character sets to learn all about it.

Sorting at the query layer

So when do the character sets and collations matter?

It turns out that, whenever a SQL query contains an ORDER BY clause, it's using the database server's collation to sort the data. A typical query looks like this:

  SELECT first_name,
         last_name
    FROM employees
ORDER BY last_name

When the underlying character set and collation aren't set properly, the ORDER BY clause won't work as expected. But even if set up properly, what happens if the server's collation is case-insensitive when you need a case-sensitive sort? This is when the COLLATE keyword comes into play:

  SELECT first_name,
         last_name
    FROM employees
ORDER BY last_name COLLATE utf8mb4_0900_as_cs

The above query uses an accent-sensitive, case-sensitive collation to sort the result set! It doesn't change how the data is stored underneath.

And just for demonstration purposes, here are queries that highlight the accent-sensitivity and case-sensitivity in action:

mysql> SELECT 'a' < 'A';
+-----------+
| 'a' < 'A' |
+-----------+
| 0         |
+-----------+
1 row in set (0.06 sec)

mysql> SELECT 'a' < 'A' COLLATE utf8mb4_0900_as_cs;
+--------------------------------------+
| 'a' < 'A' COLLATE utf8mb4_0900_as_cs |
+--------------------------------------+
| 1                                    |
+--------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT STRCMP('aa', 'å');
+-------------------+
| STRCMP('aa', 'å') |
+-------------------+
| 1                 |
+-------------------+
1 row in set (0.00 sec)

mysql> SELECT STRCMP('aa', 'å' COLLATE utf8mb4_da_0900_ai_ci);
+--------------------------------------------------+
| STRCMP('aa', 'å' COLLATE utf8mb4_da_0900_ai_ci) |
+--------------------------------------------------+
| 0                                                |
+--------------------------------------------------+
1 row in set (0.00 sec)

What's so tricky?

None of this content is difficult to grasp, but that's because collations already handle the many intricacies for you. Unicode collations know to sort "Œ" by expanding it to "OE" while Danish collations know to reverse the sorting order for letters with accents. And in some collations, "CH" is a digraph that comes after "C" – "chalk" would come alphabetically after "czar" on those servers!

The full document detailing the sorting algorithm is quite large. I can't imagine how many people it took to properly codify alphabetizing strings across the various languages and how many more people it took to implement it. I wonder how many exceptions have been missed and how many iterations of the algorithm will come in the future. But mostly, I marvel. Alphabetization is hard.