Pratik Singh

Posted on Feb 8, 2024

How to stop useless PRs on Open Source!

#opensource #cicd #devops #codenewbie

In this article, I will try to develop a couple of Solutions to stop useless PRs in Open Source.

🤔 Understanding the Problem

I recently came across a post by Arpit Bhayani. I am sure you might have seen this controversy that recently happened:

There are many PRs made to Open Source with useless changes.

In this blog, we will go from the basic way to the Machine Learning approach to reduce this issue.

I am Pratik Singh, a Senior Software Developer at Nasdaq. A major part of my job is building and maintaining CI/CD Pipelines. Let me share a few solutions to fix this.

✨ Possible solutions

This is a problem that doesn't come up in a company. Treat this article as more of a free space to discuss your ideas in the comments.
I will start with basic implementations and we will move up the ladder. Let's Go!

1. KISS Approach 😉

Keep It Simple Stupid!
The very first approach to this issue is to restrict access of users. Github has an in-built feature to limit access.
Go to: GitHub Repo Settings -> Moderation Options -> Interaction Limits

This will help to stop newcomers from making useless PRs!

But what if the user is a prior contributor?

2. Not all can edit Docs!

We are moving to different types of CI Jobs to tackle this problem. The idea is to have a set of Users that are allowed to make changes to the .md files (or any file for that matter). And make this job fail the entire CI pipeline!

Kubernetes has a set of people who make Docs. I know all repos can't do that, but you certainly assign a few people certain people for it!

The Job would look something like this:

name: Checking for authorized Doc changes

on:
  pull_request:
    paths:
      - '**/*.md' 
jobs:
  restrict_md_changes:
    runs-on: ubuntu-latest

    steps:
      - name: Check commit author
        id: check_author
        run: |
          # Get the author of the latest commit
          AUTHOR=$(git log -1 --pretty=format:'%an')

          # List of allowed authors (replace with your own)
          ALLOWED_AUTHORS="kitarp29 user1 user2 "

          # Check if the author is allowed
          if [[ ! $ALLOWED_AUTHORS =~ (^| )$AUTHOR($| ) ]]; then
            echo "Unauthorized commit by $AUTHOR. Only specific accounts are allowed."
            echo "If you see a problem in the Docs, please raise an Issue"
            exit 1
          fi

You see it working on one of my pet projects: Here

3. PR should have an Assigned Issue

The ideal way to do Open Source is:

Create an Issue
Get it assigned to you
Build it
Make a PR to solve the issue

Why not enforce this? This CI job will ensure that the PR raised by the user has an Issue related to it. Also, it is assigned to them.

I understand there will be some requirements you need to declare in CONTRIBUTING.md for this. But the CI would look something like this:

name: Check PR Issue Assignment

on:
  pull_request:
    types:
      - opened
      - synchronize

jobs:
  check-issue:
    runs-on: ubuntu-latest

    steps:
      - name: Check if PR has an issue
        id: check-issue
        run: |
          # Extract the issue number from the PR title
          ISSUE_NUMBER=$(echo "${{ github.event.pull_request.title }}" | grep -oE '#[0-9]+' | sed 's/#//')
          if [ -z "$ISSUE_NUMBER" ]; then
            echo "No issue found in the PR title."
            exit 1
          fi

          # Get the issue details
          ISSUE_DETAILS=$(curl -s "https://api.github.com/repos/${{ github.repository }}/issues/$ISSUE_NUMBER")
          ISSUE_ASSIGNEE=$(echo "$ISSUE_DETAILS" | jq -r '.assignee.login')

          # Get the user making the commit
          COMMITTER=$(git log -1 --pretty=format:"%an")

          # Check if the issue is assigned to the committer
          if [ "$ISSUE_ASSIGNEE" != "$COMMITTER" ]; then
            echo "Issue #$ISSUE_NUMBER is not assigned to $COMMITTER."
            exit 1
          fi
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Every Github Actions runner has it's own GITHUB_TOKEN so no extra charge here

3. gh-cli approach

If you are still here I am sure you are intrigued by the idea. So let's dig deep from here.

Check out: gh-cli, it's mostly an overkill when Github UI is so good. But if you add this to a GithubActions runner, you can automate almost all and every aspect of being a maintainer using it. You can report such spam users as well!

I will not give an exact job here as this idea needs to be tailor-made.

4. Initial Idea

For me, the first idea was this:

Irrespective of all the comments on it. I still think it would be an easy fix for the problem.

The CI Job could look something like this:

name: Check PR Markdown Changes

on:
  pull_request:
    types:
      - opened
      - synchronize
    paths:
      - '**/*.md' # Include all .md files

jobs:
  check-md-changes:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Get changed Markdown files
        id: changed-md-files
        run: |
          CHANGED_FILES=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }} | grep '\.md$')
          echo "::set-output name=changed_md_files::$CHANGED_FILES"

      - name: Count lines changed in Markdown files
        id: count-lines
        run: |
          LINES_CHANGED=0
          for FILE in ${{ steps.changed-md-files.outputs.changed_md_files }}; do
            LINES_CHANGED=$((LINES_CHANGED + $(git diff ${{ github.event.before }} ${{ github.sha }} -- $FILE | wc -l)))
          done
          echo "::set-output name=lines_changed::$LINES_CHANGED"

      - name: Fail if lines changed exceed limit
        run: |
          if [ ${{ steps.count-lines.outputs.lines_changed }} -lt 50 ]; then
            echo "Lines changed in Markdown files: ${{ steps.count-lines.outputs.lines_changed }}"
            exit 1
          fi

Yes, I know you have the point of "False Positive" or others. I will address them towards the end.

5. Machine Learning Approach

The moment you all have been reading for!
Can Machine Learning be used to fix this? Yes

Is it an overkill? Also Yes 😂
But again this can only be stated as "Overkill" depending on the cost it incurs versus the magnitude of the size it solves

We know some models can do this. We can run them within the CI runner or maybe create a microservice for it 😂

You can take the original .md file and the new one. Send both of these as string inputs to your Python and get the results back.

For the Python code, you can take either of the three approaches:

ndiff : Very Basic approach. Not Machine Learning but yes can be used here.
External Service: With the rise of AI Azure, Google and various services have APIs at this point. You can subscribe and your CI will talk to the service to check if the changes are semantically the same or not. You can check for spelling and grammar mistakes as well
Using a Machine Learning Model: For such a use case BERT model seems to the perfect fit. I have worked with this model at scale and can vouch for its accuracy.

This is a sample CI Job template for all of the three:

name: Markdown Similarity Check

on:
  pull_request:
    paths:
      - '**/*.md' # Only trigger on changes to Markdown files

jobs:
  similarity-check:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.x' # Choose the appropriate Python version

      - name: Install dependencies
        run: pip install -r requirements.txt # Add any required dependencies

      - name: Get old and new Markdown content
        run: |
          # Retrieve old and new Markdown content (replace with actual commands)
          git diff --name-only ${{ github.event.before }} ${{ github.sha }} | grep '.md' > changed_files.txt
          # Read the changed Markdown files
          while read -r file; do
            old_content=$(git show ${{ github.event.before }}:$file)
            new_content=$(git show ${{ github.sha }}:$file)
            python calculate_similarity.py "$old_content" "$new_content"
          done < changed_files.txt

My Take on the Issue

I understand that desperation and misdirection can create bad things in the world. Students please understand Open Source Devs are generally polite and take the extra step to help you out.

But I am no politician set to change the world.
I am a Developer, I prefer Code over talk!
These ideas can help to reduce the magnitude of the issue.

Coming back to the "False Positives": I agree there will be some that comes up. But the problem is coming in huge repos. Such repos don't change docs frequently. They have releases. If one has to fix any doc. Create an issue for it first!
There are some existing solutions around it: Check Here

If you liked this content you can follow me or on Twitter at kitarp29 for more!

Thanks for reading my article :)