DEV Community

blackneutron
blackneutron

Posted on

Lambdas, Loops, and Dota2 Feels

Introduction

As a long-time fan of Dota2, I've especially enjoyed following the career of Artour Babaev, known in the community as Arteezy. A skilled carry player, Arteezy has competed for teams like Evil Geniuses (EG), Shopify Rebellion, and Team Secret. He is a polarizing figure—simultaneously adored and criticized—though, at times, it feels like the negative sentiment surrounding him outweighs the positive.

To explore this perception, I built a Reddit comment insights pipeline that scrapes and analyzes Reddit comments, focusing on sentiment analysis. This pipeline consists of two main components: ingestion and generation. The ingestion pipeline is responsible for scraping the comments, while the generation pipeline performs sentiment analysis on the gathered data.

Link: https://arteezyonlyfans.com/ (I swear its SFW)

Background

started this project with a rough plan, deciding that the front end would be a single-page application (SPA) built with React, while AWS would handle the back end. Initially, I didn't settle on specific technologies, which contributed to a slower pace. I began scraping Reddit comments in July 2023 but only actively worked on the project starting in early 2024.

Previously, I experimented with the Reddit API using PRAW, creating a bot to respond to comments mentioning 'Arteezy' and 'washed up.' However, the bot was banned. Since I am most comfortable working in JavaScript and Go, I was hesitant to use PRAW, a Python library. Thankfully, I found Snoostorm, which, though less feature-rich than PRAW, provided the basic functionality needed to read submitted comments.

Initially, the scraper ran locally, saving comments in a JSON file for about a week before transitioning to S3. I then deployed the scraper to AWS Lambda, continuing to store the data in S3. This approach was cost-effective, with a single JSON file containing 70 comments only using 158 KB of space. Running the scraper every 10 minutes for a year would cost approximately $2.20 in S3 storage. However, this setup led to numerous duplicate comments due to the lack of a mechanism to check for previously scraped comments.

I chose to deploy the scraper as a Lambda function for cost-effectiveness and scalability. At the time, the DotA2 subreddit ranked 487th on the top community page, influencing decisions about the Lambda scraper’s trigger frequency and polling interval. The Lambda was scheduled to run every minute, polling every second and requesting up to 50 comments. However, these settings occasionally caused Reddit's servers to return internal server errors, likely due to overload.

I observed frequent duplicates within a 5-minute window and across consecutive runs. Initially suspected as a bug, I later realized it was due to the library's behavior. The library keeps an internal set of processed comments for each run, but this set is cleared when the Lambda function completes. As the Lambda function is stateless between executions, it doesn't retain memory of previously processed comments, causing the same comments to be fetched again in subsequent runs. If only one comment was submitted in a low-traffic period, Reddit’s server would return the same comment multiple times—once for each run. This issue was particularly noticeable during nighttime in European time zones when traffic is low. To reduce these duplicates, I adjusted the lambda to run from 1 minute to 5 minutes, which lowered the frequency of duplicate retrievals. While this setup isn't perfect—since it was based on trial and error—it has worked well enough for my purposes. On average, it takes about 22 seconds to scrape and store 70 unique comments, which is the limit per JSON file. However, during high-traffic periods like The International (TI), roster shuffles, or other major events, I increase the Lambda function's frequency to handle the higher volume of incoming comments.

I also considered running the scraper on an EC2 instance, which would involve creating a Docker file and setting up permissions for S3 access—more complex to set up. EC2 becomes cost-effective primarily for high-traffic subreddits, such as r/funny or r/gaming, where continuous scraping might justify the additional complexity.

Comment Search and Storage

The next challenge was determining how to search and filter specific comments. I needed a database with full-text search capabilities. Since I had used OpenSearch in the past for similar tasks, it was a natural fit for this project. I also wanted to keep the solution within the AWS ecosystem for ease of integration.

Opensearch service is expensive to run so I opted to use t3.small.search with an attached 10GiB EBS volume and 1 data node (yes, I know this isn’t good but I don’t want to pay more). Because of this constraint I had to design my pipeline to have a low read workload as I didn’t want to have to deal with the cluster being unavailable. For example, the final sentiment analysis results which includes the full document is saved as a JSON file and stored in S3.

Initial Population of the OpenSearch Cluster

Once the cluster and scraper were set up, the next step was populating the OpenSearch cluster with data. Since the scraped comments were already stored in an S3 bucket, I considered two approaches.

The first approach was to write a one-off script that would read each file from S3, parse the comments, and save them to OpenSearch. However, this would require running the script locally due to AWS Lambda's 15-minute execution limit, which isn't suitable for large-scale processing. Given that the bucket might eventually hold millions of JSON objects, this approach would take hours to complete, and handling network errors during the process would necessitate retries, complicating the operation. Additionally, I wanted a solution flexible enough to support repopulating other databases, like DynamoDB or Aurora.

The second approach was more efficient: using two separate queues, each with an accompanying Lambda function—one for storing filenames and another for processing the comments read from the files. This design enables parallel processing, significantly speeding up the indexing process while also adding robustness. By implementing dead-letter queues, we can handle errors without interrupting the entire reseeding process, an advantage the first approach lacked. Furthermore, if I decide to use a staging table in the future, the comments producer can publish to an SNS topic or another queue that will populate another table.

With approximately 430,000 files in the S3 bucket—each containing around 70 JSON objects—I optimized the process by batching the comments into arrays rather than sending individual SQS messages for each comment. This reduced the number of SQS requests, saving me around $11.30 in operational costs. While I could have implemented a reducer to remove duplicate comments, I chose to skip this step since OpenSearch charges are based on data storage and instance usage rather than per request. Instead, I configured the comment indexer to always use the index operation, which either creates a new document or updates an existing one, ensuring that the most recent data replaces older versions.

Of course this approach only works for a small scale like this, once it reaches a million files I would need to implement a deduplication step to reduce inflated data transfer and storage.

Opensearch reseed pipeline

Handling Edited Comments

What happens if we always use the index operation and an edited version of a comment is processed before the original? The answer: not much. I chose not to address this issue because the subreddit doesn't see high traffic, and most edits are minor typo corrections. These minor changes are unlikely to have a significant impact on the sentiment analysis results.

Design

Comment Ingestion Pipeline

Comment Ingestion Pipeline

  • reddit-listener: Periodically polls Reddit servers for new Dota2 comments using the Snoostorm library. This could be extended to other subreddits as well.
  • comment-indexer: Converts Reddit comments into OpenSearch documents.
  • comment-search-cluster: An OpenSearch service that stores all the scraped comments.

The Reddit listener (scraper) has undergone only slight modifications to enable new and incoming comments to be stored. This avoids the need for the previously considered script-based approach, making the ingestion process simpler and more efficient.

Comment Insight Generation

I’ll admit, I may have over complicated the insight generation pipeline. A simpler solution would have been to use AWS Glue—a fully managed ETL service—but I wanted to design the pipeline without relying on that particular service.

Comment Insight Generation

  • opensearch-exporter: This component exports comments from OpenSearch for sentiment analysis. It is designed to export data on a monthly basis but is capable of supporting yearly exports as well. The exporter filters documents created in the previous month and extracts comments containing specific keywords.

These are the keywords I used to track mentions of Arteezy.


"keywords": [
     "sprouteezy",
      "artz",
      "arteezy",
      "artour",
      "clifteezy",
      "artorito",
      "arturito",
      "babaev",
      "rtz",
      "choketeezy",
      "artdoor"
]
Enter fullscreen mode Exit fullscreen mode

While the exporter is currently set up to track mentions of Arteezy, it is flexible enough to support other topics as well. For example, I could start a sentiment analysis for another professional player or topic by simply changing the array of filters to the relevant keywords.

"keywords": ["7.36", "patch 7.36", "crownfall"]
Enter fullscreen mode Exit fullscreen mode
  • generate-insight: Responsible for generating insights such as sentiment analysis. It consumes a payload from the insight-queue, which contains all the necessary information for generating insights. Successfully submitted jobs to AWS Comprehend are saved in a DynamoDB table with the following structure:
{
  "jobName": "arteezy-dota2-monthly-sentiment",
  "topicName": "arteezy",
  "periodicity": "monthly",
  "timeInterval": "may",
  "year": 2024,
  "type": "sentiment"
  "dataSourceKey": "arteezy/comment-dota2/monthly/may"
}
Enter fullscreen mode Exit fullscreen mode
  • insight-job-poller: Periodically checks AWS Comprehend for updates on sentiment jobs and retrieves the results of completed jobs. AWS Comprehend returns compressed output files (in JSONL format), which the pipeline processes.
{
"File": "juayd50.txt",
"Sentiment": "NEUTRAL",
"SentimentScore": {
    "Mixed": 5.2448463065957185e-06,
    "Negative": 0.00011798929335782304,
    "Neutral": 0.9973727464675903,
    "Positive": 0.002504012081772089
    }
}
{
"File": "juq4uln.txt",
"Sentiment": "NEGATIVE",
"SentimentScore": {
    "Mixed": 0.025231868028640747,
    "Negative": 0.7300700545310974,
    "Neutral": 0.23479878902435303,
    "Positive": 0.009899277240037918
    }
}
{
"File": "juq7yer.txt",
"Sentiment": "NEUTRAL",
"SentimentScore": {
    "Mixed": 0.0026910887099802494,
    "Negative": 0.1895877569913864,
    "Neutral": 0.7168869972229004,
    "Positive": 0.009899277240037918
    }
}

Enter fullscreen mode Exit fullscreen mode
  • insight-result-mapper: Fetches the corresponding Reddit comment for each sentiment result, categorizes it by sentiment type, and stores the final result in S3. This avoids querying OpenSearch repeatedly when a user visits the insights page or changes the filter. Since the comments are static (i.e., they don’t change after analysis), serving them as a static JSON file makes sense.

While this approach limits dynamic features—such as filtering sentiments by specific users or querying the highest sentiment score across all months—it was acceptable for the project's MVP. Adding these features would require a relational database, but that wasn't a priority for this phase.

Generating sentiments for previous months and year

As mentioned earlier, I began scraping data in July 2023, but the sentiment analysis pipeline was not implemented until this year. My goal was to create a simple, reproducible method to backfill sentiment analysis for previous months and years. To accomplish this, I identified the key requirements for generating sentiment insights: exporting comments from OpenSearch and storing them in S3. This process required knowing the topic name, relevant keywords, index name, and the specific year and month for analysis.

With these inputs, I built a configuration that streamlined the export process. The well-structured configuration not only made the process clear and easy to follow but also served as a blueprint for writing the code, reducing the need for refactoring. Additionally, by adjusting just a few parameters, I can easily reuse the export job for different months, making the system highly flexible and reproducible.

Here’s an example of the configuration used for exporting sentiment data:

{
  "topicName":"arteezy",
  "parentFolderPrefix":"arteezy", // where the exported comments will be stored
  "keywords":[ "arteezy", "rtz", "babaev" ],
  "indexName":"subreddit-dota2",
  "periodicity":"monthly",
  "month": "december",
  "year": 2023
}
Enter fullscreen mode Exit fullscreen mode

For this export task, I used the dispatcher pattern. The dispatcher Lambda runs once a month and triggers another Lambda, called the "performer," which processes the configuration (shown above) and queues sentiment analysis jobs. While this pattern could be seen as an anti-pattern in some cases—due to issues like tight coupling between the two Lambdas, potential invocation limits, and increased costs—I found this tradeoff acceptable. The current pipeline only generates sentiment analysis for a single topic per month, making the workload manageable. Once scalability becomes a concern or additional topics are added, I plan to transition to a more event-driven design using SQS or SNS to decouple the components, handle higher workloads, and improve fault tolerance.

The same design principles were applied when building the insight generation process: flexibility and scalability were maintained using the dispatcher pattern and a well-structured configuration.

Comment Insight Generation with Dispatcher Pattern

Costs

On average I pay about 33.83 USD per month, here’s the breakdown:

Service Price/month in USD
OpenSearch Service 28.13
Simple Storage Service 2.20
Simple Queue Service 1.88
Key Management Service 1.00
Secrets Manager 0.40
DynamoDB 0.01
Data Transfer 0.21
Total 33.83

What do I think about the Sentiment Results?

When I started this project, I had no prior experience with AWS Comprehend or AI in general. I assumed that Comprehend would be smart enough to recognize that the focus of the sentiment analysis was Arteezy, given the frequent mentions of his name. However, I quickly learned that this is not how Comprehend works—it analyzes each piece of text independently, without considering any prior context.

AWS Comprehend is a good enough solution if you just want to know the general sentiment of a given text. But if you want a more customized or specific sentiment analysis targeting a particular topic, youʼll need to use a different product or write your own model.


If you've made it this far, thank you for reading. I hope you've learned a thing or two along the way. I also want to express my deep gratitude to the amazing engineers I've had the privilege of working with over the past five years: BM, DJ, TL, AP, DB, and Ed. The knowledge and experience you’ve generously shared with me have been instrumental in enabling me to build something like this.

Top comments (0)