DEV Community

Jeremy Friesen for The DEV Team

Posted on • Originally published at takeonrules.com on

Adding Reproducible Randomization to SQL Queries

Or How I Spent an Afternoon Building Out More Creases for Extensibility in Forem’s Feed Query

In DEV’s relevance feed, we utilize the Articles::Feeds::VariantQuery to perform the correct query for the current DEV feed experiment variants. For those curious, you can look to at the ./config/field_test.yml at github.com/forem/forem.

For each variant, we choose a order by lever. we had two order by levers:

  • relevancy_score_and_publication_date
  • final_order_by_random_weighted_to_score

The one I want to write about is the final_order_by_random_weighted_to_score. It had the following SQL fragment: RANDOM() ^ (1.0 / greatest(articles.score, 0.1)) DESC. Let’s ignore everything except the RANDOM().
I wrote the following Github issue to enumerate the problem: As a person who wants to explore different sort orders I want a bit of control over fully random numbers.
The consequence of having this RANDOM() is time you run the query, you get a new random number. The impact is that depending on the use of RANDOM() when you refresh the page you could see significantly different results.

Yet there is utility in randomization. What to do?

A BASIC Sidebar

In junior high, my dad got a new computer for the family. I started writing some BASIC programs to simulate a simple siege game I had created. The castle defenders attempted to ward off the assailants. It involved lots of randomization. Every time I ran the simulation, I got the exact same result.

What I didn’t realize was that I needed to first seed the randomizer with a random seed. Without that randomized seed, my version of BASIC used the same seed. This meant that when I called the RANDOM function in BASIC, the sequence was always the same.

My dice were always going to return the sequence of results. But this was well before Google and household internet. So I remained stumped until a year later when I stumbled upon a BASIC manual and had an aha moment.

Back to the Query at Hand

I had two task at hand:

  • Create a result set that had a repeatable sequence of random numbers.
  • “Join” that result set into the existing Articles::Feeds::VariantQuery implementation.

When working with SQL, I always strive to build from the inside out.

Create a Result Set That Had a Repeatable Sequence of Random Numbers

To explore this, I set about writing the simplest query I could make. After some fits and starts, I settled on the following query.

WITH seeder AS (SELECT setseed(0.5))
SELECT *, RANDOM() AS randomized_value
FROM generate_series(1,10), seeder

Enter fullscreen mode Exit fullscreen mode

The above query is logically equivalent to the query below.

SELECT *, RANDOM() AS randomized_value
FROM generate_series(1,10),
  (SELECT setseed(0.5)) as seeder

Enter fullscreen mode Exit fullscreen mode

Both have the result set of:

generate_series setseed randomized_value
1 empty string 0.2499104186659835
2 empty string 0.520017612227381
3 empty string 0.4611753978720401
4 empty string 0.22773722382948947
5 empty string 0.18489966987045392
6 empty string 0.06851542705898694
7 empty string 0.12457768622157062
8 empty string 0.2790778552363484
9 empty string 0.14696090303700515
10 empty string 0.6336384228449994

The above queries may warrant a bit of discussion. First, it is rare for me to SELECT FROM two “tables” without a join condition. The generate_series and setseed column come from the generate_series(1,10) and seeder “tables.” The randomized_value is the result of the RANDOM() function.

Now, each time I run it with setseed(0.5) I get the same randomized_value. If I change the 0.5 to 0.75 I get different values, but again each time I run the query with the seed value, I get the same results.

I had my proof of concept. Next came the hard part. Integrating it into the complicated query.

“Join” That Result Set into the Existing Articles::Feeds::Variantquery Implementation

This next step took a bit. The primary challenge was in thinking about the result sets.

Conceptually there are two queries happening for the feed:

Inner Query
Get a subset of relevant articles for the given user.
Outer Query
Order that subset of relevant articles for the given user.

To achieve the desired goal, I needed to add the randomized_value to the list of “columns” for the inner query of relevant articles. With that, we could then do the sorting.
My implementation constraint is ensuring that the the outer query would be an ActiveRecord::Relation; In sticking to that constraint, I could then use ActiveRecord’s query interface to chain includes, order, where statements. Rather powerful.
My solution was to insert a middle query. Again, the inner query is responsible for retrieving the relevant subset. Then the middle query is responsible for appending the randomized_value column to the relevant subset. And finally the outer query could then sort using this new randomized_value instead of RANDOM().

Conclusion

When I started writing this post, I thought I might dump a bunch of SQL. But I did some grooming to highlight the difference.

The original query without the seeded randomizer

SELECT "articles".*
FROM "articles"
INNER JOIN (
      SELECT articles.id, 1 AS relevancy_score
      FROM articles
      LEFT OUTER JOIN user_blocks
       ON user_blocks.blocked_id = articles.user_id
       AND user_blocks.blocked_id IS NULL
       AND user_blocks.blocker_id = 322
      WHERE articles.published = true
        AND articles.published_at > '2022-05-19 20:12:37.777414'
        AND articles.published_at < '2022-06-03 20:12:37.777493'
      GROUP BY articles.id, articles.published_at
      ORDER BY relevancy_score DESC,
           articles.published_at DESC
      LIMIT 50
) AS article_relevancies
  ON articles.id = article_relevancies.id
ORDER BY RANDOM() ^ (1.0 / greatest(articles.score, 0.1)) DESC

Enter fullscreen mode Exit fullscreen mode

The updated query using the seeded randomizer

SELECT "articles".*
FROM "articles"
INNER JOIN (
      WITH seeder AS (SELECT setseed(0.5))
      SELECT inner_article_relevancies.id,
         inner_article_relevancies.relevancy_score,
         RANDOM() AS randomized_value
      FROM seeder,
       (SELECT articles.id, 1 AS relevancy_score
        FROM articles
        LEFT OUTER JOIN user_blocks
        ON user_blocks.blocked_id = articles.user_id
           AND user_blocks.blocked_id IS NULL
           AND user_blocks.blocker_id = 322
         WHERE articles.published = true
         AND articles.published_at > '2022-05-19 20:12:37.777414'
         AND articles.published_at < '2022-06-03 20:12:37.777493'
         GROUP BY articles.id, articles.published_at
         ORDER BY relevancy_score DESC,
         articles.published_at DESC
         LIMIT 50) AS inner_article_relevancies
) AS article_relevancies
  ON articles.id = article_relevancies.id
ORDER BY article_relevancies.randomized_value ^
           (1.0 / greatest(articles.score, 0.1)) DESC

Enter fullscreen mode Exit fullscreen mode

The main difference between the two? The randomization moves from the outer query to the middle query.

And for those curious, below is the corresponding pull request that incorporates this logic into the feed.

Adding ability to "seed" feed's SQL randomization #17827

What type of PR is this? (check all applicable)

  • [x] Feature

Description

The commit includes three changes that work to allow us to "seed" the randomization. When using the same seed, and randomizer should/must return the same sequence of numbers.

First, I added the "seed" parameter to the variant query. If you want the same sequence of random numbers from query to query, pass the same seed. Otherwise, we'll use a Ruby generated random seed.

Second, I added a virtual column to the virtual article_relevancies table. If you provide a seed, and call the query twice, the first row in each of article_relevancies will have the same randomized_value, likewise the second row, etc.

Third, I amended the existing order_by_lever that had a RANDOM() postgresql function call. I replaced that with the randomized_value which is computed based on the provided seed.

Fundamentally the idea is to allow for randomness but introduce possible repeatability; a hard thing considering that some of the inner selection criteria is not fixed (e.g. number of reactions can change from moment to moment).

I wrote a complimentary blog post to further explain what's happening.

Related Tickets & Documents

Related to forem/forem#17826

QA Instructions, Screenshots, Recordings

Please read the above description and complimentary blog post and ask clarifying questions.

UI accessibility concerns?

Added/updated tests?

  • [x] No, and this is why: the tests already verify the changed behavior.

[Forem core team only] How will this change be communicated?

  • [x] I will share this change internally with the appropriate teams

Top comments (2)

Collapse
 
manalijagtap49 profile image
MANALI NITIN JAGTAP

Nice!!!

Collapse
 
bobbyiliev profile image
Bobby Iliev

Very interesting!