Ok, you all most likely heard it. Twitter went open-source. That's amazing. Curious as I am, I wanted to dive into their repository.
When looking into their issues list, I was laughing out loud. Check this:
GitHub users are making fun on the whole release, and turn the issues list into a jokes section.
As an engineer on the dev team of Twitter, however, I would be really annoyed. Differentiating between issues of trolls and non-trolls is now a new todo on their list. So let's try to help them. I'm going to show a first, very simple version of a classifier for identifying troll-issues in the Twitter repo. Of course, I'm sharing the work on GitHub as well. Here's the repo.
Getting the data
I've scraped the issues with a simple Python script, which I also shared in the repo:
import requests
import json
PAT = "add-your-PAT-here" # see https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
owner = "twitter"
repo = "the-algorithm"
url = f"https://api.github.com/repos/{owner}/{repo}/issues"
headers = {"Authorization": f"Bearer {PAT}"}
all_issues = []
while url:
response = requests.get(url, headers=headers)
if response.status_code == 200:
issues = response.json()
all_issues.extend(issues)
if "next" in response.links:
url = response.links["next"]["url"]
else:
url = None
else:
print(f"Failed to retrieve issues (status code {response.status_code}): {response.text}")
break
issues_reduced = []
for issue in all_issues:
issue_reduced = {
"title": issue["title"],
"body": issue["body"],
"html_url": issue["html_url"],
"reactions_laugh": issue["reactions"]["laugh"],
"reactions_hooray": issue["reactions"]["hooray"],
"reactions_confused": issue["reactions"]["confused"],
"reactions_heart": issue["reactions"]["heart"],
"reactions_rocket": issue["reactions"]["rocket"],
"reactions_eyes": issue["reactions"]["eyes"],
}
issues_reduced.append(issue_reduced)
with open("twitter-issues.json", "w") as f:
json.dump(issues_reduced, f)
print(f"Retrieved {len(all_issues)} issues and saved to twitter-issues.json")
Of course, these days, I didn't write the code for this myself. ChatGPT did that, but you all already know that.
I decided to reduce the downloaded data a bit, because much of the content didn't seem to be relevant to me. Instead, I wanted to just have the URL to the issue, the title and body, and some potentially interesting metadata in form of the reactions.
An example of this looks as follows:
{
"title": "adding Documentation",
"body": null,
"html_url": "https://github.com/twitter/the-algorithm/pull/838",
"reactions_laugh": 0,
"reactions_hooray": 0,
"reactions_confused": 0,
"reactions_heart": 0,
"reactions_rocket": 0,
"reactions_eyes": 0
},
Building the classifier
With the data downloaded, I started refinery on my local machine. With refinery, I'm able to label a little bit of data and build some heuristics to quickly test if my idea works. It's open-sourced under Apache 2.0, you can just grab it and try along.
Simply upload the twitter-issues.json
file we just created:
For the title
and body
attributes, I added two distilbert-base-uncased
embeddings directly from Hugging Face.
After that, I set up three labeling tasks, of which for now only the Seriousness
task is relevant.
Diving into the data, I labeled a few examples to see how the data looks like and to get some reference labels for my automations I want to build.
I realized that quite often, people are searching for jobs in issues. So i started building my first heuristic for this, in which I use a lookup list that I created to search for appearances of job-terms. I'm going to later combine this via weak supervision with other heuristics to power my classifier.
For reference, this is how the lookup lists looks like. Terms are automatically added while labeling spans (which is also why i had three labeling tasks, one for classification and two for span labeling), but I could also have uploaded a CSV file of terms.
As I also already labeled a bit of data, I created a few active learners:
With weak supervision, I can easily combine this active learner with my previous job search classifier without having to worry about conflicts, overlaps and the likes.
Also I noted a couple of issues with just a link to play chess online:
So i added a heuristic for detecting links via spaCy.
Of course, I also wanted to create a GPT-based classifier, since this is publicly available data. However, GPT seems to be down while I'm initially building this :(
After circa 20 minutes of labeling and working with the data, this is how my heuristics tab looked like
So there are mainly active learners, some lookup lists and regular-expression like heuristics. I will add GPT in the comments section as soon as I can access it again :)
Now, I weakly supervised the results:
You can see that the automation already nicely fits the distribution of trolls vs. non-trolls.
I also noticed a strong difference in confidence:
So I headed over to the data browser and configured the confidence so I only see the records with above 80% confidence.
Notice that in here, we could also filter by single heuristic hits, e.g. to find records where different heuristics vote different labels:
In the dashboard, I now filter for the high confidence records and see that our classifier is performing quite good already (note, this isn't even using GPT yet!):
Next steps
I exported the project snapshot and labeled examples into the public repository (twitter_default_all.json.zip
), so you can play with the bit of labeled data yourself. I'll continue on this topic the next days, and we'll add a YouTube video for this article for a version 2 of the classifier. There certainly are further attributes, we can look into, such as taking the length of the body into account (I already saw that shorter bodys typically are troll-like).
Also, keep in mind that this is an excellent way to benchmark how power GPT can add for your use case. Simply add it as a heuristic, try a few different prompts, and play with excluding or adding it from your heuristics in the weak supervision procedure. For instance, here, I excluded GPT:
I'm really thrilled about Twitter going open-source with their algorithm, and I'm sure it will add a lot of benefits. What you can already tell is due to the nature of Twitter's community, issues are often written by trolls. So finding detecting such will be important for the dev team of Twitter. Maybe this post here can be of help for that :)
Top comments (8)
I have some doubts about this being the version they run nowadays, so I doubt there's gonna be activity on the repo from Twitter. Or they have the actual repo somewhere internally. So they propably don't care about the issues in Github.
I completely understand that doubt. Of course, I don't know either. We'll see in the next weeks how the activity looks like on this repo. Though it would be really sad to hear that the repo isn't the one resembling the one in production, as one of the motives of the release were meant to be transparency in how news are shared.
Awesome showcase of the tool :). I will definitly keep an eye on refinery. I might need something like that in the not so distant future :).
Also i was more suprised that ppl where suprised about the fact that ppl where trolling.
In any case, great write up!
so long
Thanks, that means a lot! :)
If you're using it, feel free to join our Discord if you have any questions or suggestions: discord.com/invite/qf4rGCEphW
Or open an issue: github.com/code-kern-ai/refinery
The team tries to respond as fast as possible.
lol, there were 239 issues.
In the time it took you to do all of the above, you could've been a dev at twitter reading through the list and just closing the dumb ones.
It's really not as big a problem as you seem to make it look like.
Also, they were pretty damn funny :)
However the tool you put together is nice for dev leads/managers to show their bosses so they can keep their jobs.
239 open issues, not 239 issues that were created. By the #-id on the issue, you can tell that on day 2, it have already been 1,483 issues. It took me less than 2 hours to do the above (asking ChatGPT to do set up the GH scraper and setting up a PAT took 20 minutes, the rest was exploration, a bit of labeling and off-the-shelf heuristics).
But on the other side, it has also been launch day, so I guess that the number of issues is going down. So realistically, it is something that can be done manually, but it is not time well spent. From the number of issues far exceeding the number of open ones, you can already tell that dumb issues have been closed already by the team.
And true, some of them were funny, but a degree of more than 70% of opened issues being trolls is a problematic ratio. So I wanted to look into it and share my insights :)