In this article, I will try to develop a couple of Solutions to stop useless PRs in Open Source.
🤔 Understanding the Problem
I recently came across a post by Arpit Bhayani. I am sure you might have seen this controversy that recently happened:
There are many PRs made to Open Source with useless changes.
In this blog, we will go from the basic way to the Machine Learning approach to reduce this issue.
I am Pratik Singh, a Senior Software Developer at Nasdaq. A major part of my job is building and maintaining CI/CD Pipelines. Let me share a few solutions to fix this.
✨ Possible solutions
This is a problem that doesn't come up in a company. Treat this article as more of a free space to discuss your ideas in the comments.
I will start with basic implementations and we will move up the ladder. Let's Go!
1. KISS Approach 😉
Keep It Simple Stupid!
The very first approach to this issue is to restrict access of users. Github has an in-built feature to limit access.
Go to: GitHub Repo Settings -> Moderation Options -> Interaction Limits
This will help to stop newcomers from making useless PRs!
But what if the user is a prior contributor?
2. Not all can edit Docs!
We are moving to different types of CI Jobs to tackle this problem. The idea is to have a set of Users that are allowed to make changes to the .md files (or any file for that matter). And make this job fail the entire CI pipeline!
Kubernetes has a set of people who make Docs. I know all repos can't do that, but you certainly assign a few people certain people for it!
The Job would look something like this:
name: Checking for authorized Doc changes
on:
pull_request:
paths:
- '**/*.md'
jobs:
restrict_md_changes:
runs-on: ubuntu-latest
steps:
- name: Check commit author
id: check_author
run: |
# Get the author of the latest commit
AUTHOR=$(git log -1 --pretty=format:'%an')
# List of allowed authors (replace with your own)
ALLOWED_AUTHORS="kitarp29 user1 user2 "
# Check if the author is allowed
if [[ ! $ALLOWED_AUTHORS =~ (^| )$AUTHOR($| ) ]]; then
echo "Unauthorized commit by $AUTHOR. Only specific accounts are allowed."
echo "If you see a problem in the Docs, please raise an Issue"
exit 1
fi
You see it working on one of my pet projects: Here
3. PR should have an Assigned Issue
The ideal way to do Open Source is:
- Create an Issue
- Get it assigned to you
- Build it
- Make a PR to solve the issue
Why not enforce this? This CI job will ensure that the PR raised by the user has an Issue related to it. Also, it is assigned to them.
I understand there will be some requirements you need to declare in CONTRIBUTING.md for this. But the CI would look something like this:
name: Check PR Issue Assignment
on:
pull_request:
types:
- opened
- synchronize
jobs:
check-issue:
runs-on: ubuntu-latest
steps:
- name: Check if PR has an issue
id: check-issue
run: |
# Extract the issue number from the PR title
ISSUE_NUMBER=$(echo "${{ github.event.pull_request.title }}" | grep -oE '#[0-9]+' | sed 's/#//')
if [ -z "$ISSUE_NUMBER" ]; then
echo "No issue found in the PR title."
exit 1
fi
# Get the issue details
ISSUE_DETAILS=$(curl -s "https://api.github.com/repos/${{ github.repository }}/issues/$ISSUE_NUMBER")
ISSUE_ASSIGNEE=$(echo "$ISSUE_DETAILS" | jq -r '.assignee.login')
# Get the user making the commit
COMMITTER=$(git log -1 --pretty=format:"%an")
# Check if the issue is assigned to the committer
if [ "$ISSUE_ASSIGNEE" != "$COMMITTER" ]; then
echo "Issue #$ISSUE_NUMBER is not assigned to $COMMITTER."
exit 1
fi
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Every Github Actions runner has it's own GITHUB_TOKEN so no extra charge here
3. gh-cli approach
If you are still here I am sure you are intrigued by the idea. So let's dig deep from here.
Check out: gh-cli, it's mostly an overkill when Github UI is so good. But if you add this to a GithubActions runner, you can automate almost all and every aspect of being a maintainer using it. You can report such spam users as well!
I will not give an exact job here as this idea needs to be tailor-made.
4. Initial Idea
For me, the first idea was this:
Irrespective of all the comments on it. I still think it would be an easy fix for the problem.
The CI Job could look something like this:
name: Check PR Markdown Changes
on:
pull_request:
types:
- opened
- synchronize
paths:
- '**/*.md' # Include all .md files
jobs:
check-md-changes:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Get changed Markdown files
id: changed-md-files
run: |
CHANGED_FILES=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }} | grep '\.md$')
echo "::set-output name=changed_md_files::$CHANGED_FILES"
- name: Count lines changed in Markdown files
id: count-lines
run: |
LINES_CHANGED=0
for FILE in ${{ steps.changed-md-files.outputs.changed_md_files }}; do
LINES_CHANGED=$((LINES_CHANGED + $(git diff ${{ github.event.before }} ${{ github.sha }} -- $FILE | wc -l)))
done
echo "::set-output name=lines_changed::$LINES_CHANGED"
- name: Fail if lines changed exceed limit
run: |
if [ ${{ steps.count-lines.outputs.lines_changed }} -lt 50 ]; then
echo "Lines changed in Markdown files: ${{ steps.count-lines.outputs.lines_changed }}"
exit 1
fi
Yes, I know you have the point of "False Positive" or others. I will address them towards the end.
5. Machine Learning Approach
The moment you all have been reading for!
Can Machine Learning be used to fix this? Yes
Is it an overkill? Also Yes 😂
But again this can only be stated as "Overkill" depending on the cost it incurs versus the magnitude of the size it solves
We know some models can do this. We can run them within the CI runner or maybe create a microservice for it 😂
You can take the original .md file and the new one. Send both of these as string inputs to your Python and get the results back.
For the Python code, you can take either of the three approaches:
- ndiff : Very Basic approach. Not Machine Learning but yes can be used here.
- External Service: With the rise of AI Azure, Google and various services have APIs at this point. You can subscribe and your CI will talk to the service to check if the changes are semantically the same or not. You can check for spelling and grammar mistakes as well
- Using a Machine Learning Model: For such a use case BERT model seems to the perfect fit. I have worked with this model at scale and can vouch for its accuracy.
This is a sample CI Job template for all of the three:
name: Markdown Similarity Check
on:
pull_request:
paths:
- '**/*.md' # Only trigger on changes to Markdown files
jobs:
similarity-check:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x' # Choose the appropriate Python version
- name: Install dependencies
run: pip install -r requirements.txt # Add any required dependencies
- name: Get old and new Markdown content
run: |
# Retrieve old and new Markdown content (replace with actual commands)
git diff --name-only ${{ github.event.before }} ${{ github.sha }} | grep '.md' > changed_files.txt
# Read the changed Markdown files
while read -r file; do
old_content=$(git show ${{ github.event.before }}:$file)
new_content=$(git show ${{ github.sha }}:$file)
python calculate_similarity.py "$old_content" "$new_content"
done < changed_files.txt
My Take on the Issue
I understand that desperation and misdirection can create bad things in the world. Students please understand Open Source Devs are generally polite and take the extra step to help you out.
But I am no politician set to change the world.
I am a Developer, I prefer Code over talk!
These ideas can help to reduce the magnitude of the issue.
Coming back to the "False Positives": I agree there will be some that comes up. But the problem is coming in huge repos. Such repos don't change docs frequently. They have releases. If one has to fix any doc. Create an issue for it first!
There are some existing solutions around it: Check Here
If you liked this content you can follow me or on Twitter at kitarp29 for more!
Thanks for reading my article :)
Top comments (3)
Good one, Pratik
I am learning Devops and this definitely helps in learning CI (will practice this in my private dummy repo)
Thanks!
i think this is actually good to prevent these DDoS
thanks bro for sharing these with us....