DEV Community

Cover image for Alphanumeric / Natural sort in MySQL - 30 years and we still can't do this? 😀
GrahamTheDev
GrahamTheDev

Posted on

Alphanumeric / Natural sort in MySQL - 30 years and we still can't do this? 😀

Our QA team are absolute monsters, they're monsters I tell you!

No matter how much I try to bribe them with drinks at our next company gathering to stop finding bugs in my code, they keep coming up with unreasonable requests and "expected behaviours"...ewwwww, gross!

Don't believe me?

Recently they asked for a list to be ordered alphanumerically.

Can you believe that? Monsters I tell you!

I mean, it isn't like an end user would expect a list to be sorted:

test 1
test 2
test 12
Enter fullscreen mode Exit fullscreen mode

...why can't it just be like MySQL thinks it should be sorted:

test 1
test 12
test 2
Enter fullscreen mode Exit fullscreen mode

Makes perfect sense right? Right?

Anyway lets not get too deep into who is right and who is wrong here (they're right FYI), I thought I better try and fix it.

I mean...how hard can it be? (foreshadowing)

The short answer

I know you might be busy, so the short answer is sort it in JS / PHP / whatever. Don't try and do it with MySQL.

Like seriously, just return your data set and then use Intl.Collator if you are using Node, or natsort() in PHP.

And you know what, that would have been fine. It would have solved the problem and I could have moved on with my day. QA would have been happy, list would be sorted, all good!

However, if you are like me then when your CTO reviews your PR and says:

I take that to heart!

I mean, he is so right!

Why should we have to sort things in code, MySQL should be able to do that for us!

What ensued was over 4 hours of fiddling and swearing to try and get MySQL to bend to my will.

Did I manage it? No...but kinda.

What I ended up with works well enough for user input in our specific scenario, and is much better than other solutions I found, but has it's limitations.

But we are getting ahead of ourselves, I want to show you what is it like trying to solve this in MySQL!

The Data for testing

First thing is first, let's start with the test data I used, just so you can see if I missed anything:

The Test Data

Test Data
INSERT INTO test_data (data_value) VALUES
('2 test'),
('12 test'),
('1 test'),
('20 test'),
('10 test'),
('1test'),
('2test'),
('test 1'),
('my data 2020-01-01'),
('my data 2020-02-01'),
('01-02-24 data'),
('12 test'),
('4a test'),
('my 2020-01-01 data 2020-01-01'),
('my 2020-02-01 data 2020-01-01'),
('my 2020-02-01 data 202-01-01'),
('my 2020-02-01 data 20-01-01'),
('my 2020-02-01 data 1-01-01'),
('my 2020-02-01 data 2-01-01'),
('my 2020-02-01 data 12-01-01'),
('my 2020-02-01 data 01-01-01'),
('my 2020-01-01 data 2020-02-01'),
('my 2020-01-01 data 2021-01-01'),
('my 2020-01-01 data 2120-01-01'),
('my 2120-01-01 data 2020-01-01'),
('4b test'),
('my test'),
('my 12 magic test'),
('my magic 12 test'),
('cheese and test 12'),
('42-a-1'),
('40-a-1'),
('40a'),
('FoClSy4727'),
('Pthw068bf'),
('6bfS'),
('HOFAp_Yx7920'),
('25hWTX'),
('dnjLlW1'),
('RHrIt72402eaLr'),
('cIhb42WFNQ'),
('9244uVCpGa'),
('yDKrkCp7960'),
('GeGIrPM-H86'),
('wrOae537LGCT'),
('WffSPaBA318'),
('kQ33596c'),
('3uEKHmHePf'),
('796h-eYWy'),
('833HufIZAS'),
('utjtV03Xns'),
('dlCSh87811'),
('13IUkOxEVl'),
('VHCok55901XYVk'),
('2RnSVwq'),
('AwtwQdn09'),
('gvSV6z'),
('uxWLO039hb'),
('vTg946');
Enter fullscreen mode Exit fullscreen mode

Some nonsense with numbers in various places throughout the strings, seemed reasonable enough to me.

It is probably not a perfect data set, but good enough for what I was after!

So once I had some data I did what every developer does, asked the Google God for an answer:

Googling for natural sort in MySQL

Seems like the logical first step right?

MySQL has been around for 30 years, and alphanumeric sorting is something you would think people need to do often, so it must have some answers out there already right?

Short answer...it was exhausting. Every single article, Stack Overflow post etc. spitting out the same nonsense that doesn't work or is hyper specific to a particular data format.

I wanted something generic, so I could just sort alphanumerically everywhere!

So let me walk you through the pain I endured so you don't have to, and so you can decide if my answer is any better...or if I have just added to the pain!

Let's try some of the generic sort methods people suggested:

Test 1: the "+0" method

SELECT 
    * 
FROM 
    test_data 
ORDER BY 
    `data_value`+0 ASC
Enter fullscreen mode Exit fullscreen mode

The idea being that it tries to force MySQL to do numeric ordering over lexicographical ordering.

This would work fine if we had numbers or words in each row (the numbers would indeed sort correctly and then the words).

However because we have mixed data, it just spat out a load of nonsense:

Test 1 results

id data_value
44 GeGIrPM-H86
25 my 2120-01-01 data 2020-01-01
27 my test
28 my 12 magic test
29 my magic 12 test
34 FoClSy4727

Full Results for test 1
id data_value
44 GeGIrPM-H86
25 my 2120-01-01 data 2020-01-01
27 my test
28 my 12 magic test
29 my magic 12 test
34 FoClSy4727
35 Pthw068bf
37 HOFAp_Yx7920
39 dnjLlW1
40 RHrIt72402eaLr
41 cIhb42WFNQ
43 yDKrkCp7960
30 cheese and test 12
45 wrOae537LGCT
46 WffSPaBA318
47 kQ33596c
51 utjtV03Xns
52 dlCSh87811
54 VHCok55901XYVk
56 AwtwQdn09
57 gvSV6z
58 uxWLO039hb
59 vTg946
17 my 2020-02-01 data 20-01-01
8 test 1
9 my data 2020-01-01
10 my data 2020-02-01
14 my 2020-01-01 data 2020-01-01
15 my 2020-02-01 data 2020-01-01
16 my 2020-02-01 data 202-01-01
18 my 2020-02-01 data 1-01-01
19 my 2020-02-01 data 2-01-01
21 my 2020-02-01 data 01-01-01
23 my 2020-01-01 data 2021-01-01
22 my 2020-01-01 data 2020-02-01
20 my 2020-02-01 data 12-01-01
24 my 2020-01-01 data 2120-01-01
6 1test
11 01-02-24 data
3 1 test
55 2RnSVwq
1 2 test
7 2test
48 3uEKHmHePf
13 4a test
26 4b test
36 6bfS
5 10 test
2 12 test
12 12 test
53 13IUkOxEVl
4 20 test
38 25hWTX
33 40a
32 40-a-1
31 42-a-1
49 796h-eYWy
50 833HufIZAS
42 9244uVCpGa

Ok so that didn't work (and I didn't really expect it to).

Let's try something else:

Test 2: The "length" hack

SELECT 
    alphanumeric, integer
FROM 
    sorting_test
ORDER BY
    LENGTH(alphanumeric), alphanumeric
Enter fullscreen mode Exit fullscreen mode

This is another interesting approach.

obviously "12" is longer than "1" so it would work there.

It also works if you have a standard format for data, like the "test 1, test 12, test 2" I mentioned at the beginning.

However, for our data it returns more nonsense:

Test 2 results

id data_value
33 40a
36 6bfS
6 1test
7 2test
3 1 test
1 2 test
38 25hWTX
32 40-a-1
31 42-a-1
57 gvSV6z
8 test 1
59 vTg946
5 10 test

Full Results for test 2
id data_value
33 40a
36 6bfS
6 1test
7 2test
3 1 test
1 2 test
38 25hWTX
32 40-a-1
31 42-a-1
57 gvSV6z
8 test 1
59 vTg946
5 10 test
2 12 test
12 12 test
4 20 test
55 2RnSVwq
13 4a test
26 4b test
39 dnjLlW1
27 my test
47 kQ33596c
49 796h-eYWy
56 AwtwQdn09
35 Pthw068bf
53 13IUkOxEVl
48 3uEKHmHePf
50 833HufIZAS
42 9244uVCpGa
41 cIhb42WFNQ
52 dlCSh87811
34 FoClSy4727
51 utjtV03Xns
58 uxWLO039hb
44 GeGIrPM-H86
46 WffSPaBA318
43 yDKrkCp7960
37 HOFAp_Yx7920
45 wrOae537LGCT
11 01-02-24 data
40 RHrIt72402eaLr
54 VHCok55901XYVk
28 my 12 magic test
29 my magic 12 test
30 cheese and test 12
9 my data 2020-01-01
10 my data 2020-02-01
18 my 2020-02-01 data 1-01-01
19 my 2020-02-01 data 2-01-01
21 my 2020-02-01 data 01-01-01
20 my 2020-02-01 data 12-01-01
17 my 2020-02-01 data 20-01-01
16 my 2020-02-01 data 202-01-01
14 my 2020-01-01 data 2020-01-01
22 my 2020-01-01 data 2020-02-01
23 my 2020-01-01 data 2021-01-01
24 my 2020-01-01 data 2120-01-01
15 my 2020-02-01 data 2020-01-01
25 my 2120-01-01 data 2020-01-01

And that also makes sense, all it is doing is ordering on length first so:

"a1" would come before "1potato" by length.

Test 3: The Hardouken!

Just take a look at this beauty:

SELECT 
    * 
FROM 
    test_data 
ORDER BY 
  REPLACE(
    REPLACE(
      REPLACE(
        REPLACE(
          REPLACE(
            REPLACE(
              REPLACE(
                REPLACE(
                  REPLACE(
                    REPLACE(
                      data_value, 
                      '0', 'Δ€'), 
                    '1', 'ā'), 
                  '2', 'Δ‚'), 
                '3', 'Δƒ'), 
              '4', 'Δ„'), 
            '5', 'Δ…'), 
          '6', 'Δ†'), 
        '7', 'Δ‡'), 
      '8', 'Ĉ'), 
    '9', 'Δ‰') 
  COLLATE utf8_bin;

Enter fullscreen mode Exit fullscreen mode

The meme is real!

lots of nested if statements that indent further and further to form a triangle shape. Ryu from Stree fighter is firing a Hardouken in the middle

But despite that, this one pretty much works (albeit slightly differently to how I would have done it).

This one is smart in that it replaces each number with a character that will be sorted based on their binary representation in Unicode.

Test 3 results

id data_value
56 AwtwQdn09
34 FoClSy4727
44 GeGIrPM-H86
37 HOFAp_Yx7920
35 Pthw068bf
40 RHrIt72402eaLr
54 VHCok55901XYVk
46 WffSPaBA318
41 cIhb42WFNQ
30 cheese and test 12

Full Results for test 3
id data_value
56 AwtwQdn09
34 FoClSy4727
44 GeGIrPM-H86
37 HOFAp_Yx7920
35 Pthw068bf
40 RHrIt72402eaLr
54 VHCok55901XYVk
46 WffSPaBA318
41 cIhb42WFNQ
30 cheese and test 12
52 dlCSh87811
39 dnjLlW1
57 gvSV6z
47 kQ33596c
9 my data 2020-01-01
10 my data 2020-02-01
29 my magic 12 test
27 my test
28 my 12 magic test
14 my 2020-01-01 data 2020-01-01
22 my 2020-01-01 data 2020-02-01
23 my 2020-01-01 data 2021-01-01
24 my 2020-01-01 data 2120-01-01
21 my 2020-02-01 data 01-01-01
18 my 2020-02-01 data 1-01-01
20 my 2020-02-01 data 12-01-01
19 my 2020-02-01 data 2-01-01
17 my 2020-02-01 data 20-01-01
16 my 2020-02-01 data 202-01-01
15 my 2020-02-01 data 2020-01-01
25 my 2120-01-01 data 2020-01-01
8 test 1
51 utjtV03Xns
58 uxWLO039hb
59 vTg946
45 wrOae537LGCT
43 yDKrkCp7960
11 01-02-24 data
3 1 test
6 1test
5 10 test
2 12 test
12 12 test
53 13IUkOxEVl
1 2 test
55 2RnSVwq
7 2test
4 20 test
38 25hWTX
48 3uEKHmHePf
13 4a test
26 4b test
32 40-a-1
33 40a
31 42-a-1
36 6bfS
49 796h-eYWy
50 833HufIZAS
42 9244uVCpGa

You know what, I would use this if it wasn't for one thing. It would mean that we could no longer accept 'Δ€' etc as input.

Unfortunately our software will be used internationally, so we can't make that call.

Additionally if you do decide to use this, it is worth noting your database column must use COLLATE 'utf8_bin' on it as well or you may get errors.

Chat GPT Time

When googling fails, chat GPT can always help right?

I am not going to show all the results, but let's just say the following are just a few of the "did not work" ones it produced!

Test 4: GPT fever dream

SELECT
    data_value,
    -- Replace numbers with 'a' and letters with 'b' dynamically
    (
        SELECT GROUP_CONCAT(
            CASE
                WHEN c REGEXP '[0-9]' THEN 'a' -- Replace digits with 'a'
                WHEN c REGEXP '[a-zA-Z]' THEN 'b' -- Replace letters with 'b'
                ELSE c -- Leave other characters as-is
            END
            ORDER BY seq
        ) SEPARATOR ''
        FROM (
            SELECT SUBSTRING(data_value, seq, 1) AS c, seq
            FROM (
                SELECT data_value, seq
                FROM test_data
                JOIN (
                    SELECT n AS seq
                    FROM (SELECT 1 n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10) numbers
                    WHERE n <= 255 -- Handle up to 255 characters in a string
                ) seq_table
                ON seq <= CHAR_LENGTH(data_value)
            ) char_table
        ) char_map
    ) AS sort_key
FROM test_data
ORDER BY sort_key;
Enter fullscreen mode Exit fullscreen mode

It didn't run, and I couldn't quite work out why, but isn't it impressive looking!

What it seemed to be trying to do was replace all numbers with "a" and all letters with "b" to form strings like "aabaaa, aabba, abbaa" for comparison.

I would have probably spent more time trying to fix this if it wasn't flawed.

"11ab1" would produce "aabba"
"111ab1" would produce "aaaba".

So with the way MySQL sorts, 111ab1 would have been above 11ab1 anyway as "aaa" comes before "aab".

Nice idea, but sadly not quite it.

Test 5: GPT nearly nailed it!

This one surprised me.

I had seen it elsewhere already (as we know GPT is great at just spitting things out that it has already seen and isn't actually creative), but the explanation it gave made more sense!

SELECT 
  * 
FROM 
  test_data
ORDER BY 
  CAST(data_value AS UNSIGNED), data_value ASC;
Enter fullscreen mode Exit fullscreen mode

Ahhhh nearly. I really thought this was the answer.

But sadly it fails on "Test 1, Test 12, Test 2". It is good for sorting if the number comes first though!

My Turn

I still couldn't get what I wanted after several attempts. But I was now much closer to an answer after seeing a few techniques.

After a bit of noodle scratching and crying I thought I had solved it!

It works...kinda!

Here is what I came up with:

CREATE TABLE test_data (
    id INT AUTO_INCREMENT PRIMARY KEY, 
    data_value VARCHAR(255) NOT NULL, 
    transformed_column VARCHAR(255) AS (
        CASE
            WHEN REGEXP_SUBSTR(data_value, '[0-9]+') IS NOT NULL THEN
                REGEXP_REPLACE(
                    data_value,
                    '[0-9]+',
                    LPAD(REGEXP_SUBSTR(data_value, '[0-9]+'), 8, '0')
                )
            ELSE data_value -- Or another default value, e.g., 'No Numbers'
        END
    ) STORED
);
Enter fullscreen mode Exit fullscreen mode

Then we can query it like so:

SELECT 
  *
FROM 
  test_data
ORDER BY 
  transformed_column, data_value;
Enter fullscreen mode Exit fullscreen mode

And this gives us nearly the results we would expect.

Test 6 results

data_value transformed_column
1 test 00000001 test
01-02-24 data 00000001-00000001-00000001 data
1test 00000001test
2 test 00000002 test
2RnSVwq 00000002RnSVwq
2test 00000002test
3uEKHmHePf 00000003uEKHmHePf
4a test 00000004a test

Full Results for test 6
data_value transformed_column
1 test 00000001 test
01-02-24 data 00000001-00000001-00000001 data
1test 00000001test
2 test 00000002 test
2RnSVwq 00000002RnSVwq
2test 00000002test
3uEKHmHePf 00000003uEKHmHePf
4a test 00000004a test
4b test 00000004b test
6bfS 00000006bfS
10 test 00000010 test
12 test 00000012 test
12 test 00000012 test
13IUkOxEVl 00000013IUkOxEVl
20 test 00000020 test
25hWTX 00000025hWTX
40-a-1 00000040-a-00000040
40a 00000040a
42-a-1 00000042-a-00000042
796h-eYWy 00000796h-eYWy
833HufIZAS 00000833HufIZAS
9244uVCpGa 00009244uVCpGa
AB-34Y67846 AB-00000034Y00000034
AwtwQdn09 AwtwQdn00000009
cheese and test 12 cheese and test 00000012
cIhb42WFNQ cIhb00000042WFNQ
dlCSh87811 dlCSh00087811
dnjLlW1 dnjLlW00000001
E5-RMT893Y9 E00000005-RMT00000005Y00000005
EV-489RY3DA EV-00000489RY00000489DA
FoClSy4727 FoClSy00004727
GeGIrPM-H86 GeGIrPM-H00000086
gvSV6z gvSV00000006z
HOFAp_Yx7920 HOFAp_Yx00007920
kQ33596c kQ00033596c
my 12 magic test my 00000012 magic test
my 2020-01-01 data 2020-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-01-01 data 2020-02-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-01-01 data 2021-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-01-01 data 2120-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 01-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 1-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 12-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 2-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 20-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 202-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 2020-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2120-01-01 data 2020-01-01 my 00002120-00002120-00002120 data 00002120-00002120-00002120
my data 2020-01-01 my data 00002020-00002020-00002020
my data 2020-02-01 my data 00002020-00002020-00002020
my magic 12 test my magic 00000012 test
my test my test
Pthw068bf Pthw00000068bf
RHrIt72402eaLr RHrIt00072402eaLr
test 1 test 00000001
utjtV03Xns utjtV00000003Xns
uxWLO039hb uxWLO00000039hb
VHCok55901XYVk VHCok00055901XYVk
vTg946 vTg00000946
WffSPaBA318 WffSPaBA00000318
wrOae537LGCT wrOae00000537LGCT
yDKrkCp7960 yDKrkCp00007960

It breaks on the rows:

data_value transformed_column
my 2020-02-01 data 1-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 12-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 2-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020

As you can see, but overall I think it will suit our needs for now!

This was based on a LPAD example I saw. But it had the same flaw that it only worked on Strings with numbers at the beginning.

So I took that principle and applied it to all number series.

Also this now uses a generated column so that it is fast on reads (although we could soon make it part of the query if it didn't hurt read performance too much or we were more bothered about write performance).

The big issue is the fact that this works through "brute force" by working around the fact that MySQL sees "11" as smaller than "2" due to checking one character at a time (the dumbed down version of what happens) by making all numbers 8 digits long.

It also replaces all numbers in the transformed_column with the first number it finds and as far as I can tell, other than nesting about 20 statements and using coalesce or something, there is no way to avoid this behaviour I could find.

Conclusion

I am not really sure? MySQL should join the 21st century and provide an alphanumeric sort option like other DBs?

While we wish for things to happen I think the short answer in the scope of this article is that I still haven't solved this.

But with that being said, I feel like test 6 is much closer to the answer than anything I found out in the wild and may be useful to someone someday?

Obviously if you are expecting numbers longer than 8 digits, you may want to adjust the LPAD value as that is one limitation.

Also it isn't perfect so you can't rely on it for scenarios other than sorting user input so they can find things easily.

I think at this point my best bet is to throw it out into the world and see if some smarter people than me can solve this thing properly! (I am probably needing some much deeper knowledge of sorting in MySQL to really solve it!)

If someone can solve the problem I have where I can't change every number to 8 digits long, then this would work even better! (so "my 2020-02-01 data 2-01-01" would become "my 00002020-00000002-00000001 data 00000002-00000001-00000001").

Or maybe there is a literal one liner that I missed in all my research that solves this? (I will only cry a little bit if there is, I promise!).

Anyway, that is the best I have, but would love to see something better!

Finally here is a DB Fiddle to play with

Here is a DB fiddle of the generated column version to start from if you think you can spot a quick win / improvement:

https://www.db-fiddle.com/f/o2ohcGVAgHZQg4teg1s9jW/1034

Thanks for reading, catch you all soon!

Top comments (1)

Collapse
 
grahamthedev profile image
GrahamTheDev

Guarantee someone shows me how to do this in one line and I end up deleting the post or putting "here is the answer, ignore the rest of the article" at the top! haha