At Alertpix we allow streamers to receive donations from their audience via Pix Instant Payment and show an alert on the live stream.
Being an early-stage startup, we can easily stay close to our streamers. As usual, I was present when our first incident occurred. A donation with a single emoji made a streamer richer, and chaos erupted in our backend.
The live stream
RBQK, one of our initial streamers, had just started their livestream. After a few minutes, the first donation alert appeared.
While everything seemed fine, and the donation goals updated correctly after the first donation, a Discord message and a Rollbar email gave me chills.
An error 500 occurred after the second donation. I waited a few seconds to see if the alert would appear on the live stream.
Realizing it wouldn't show up, I started to feel anxious, and so did the donor.
Logs and monitoring
Since day one, we were committed to having robust monitoring. However, as we never had any incidents in the payment process, we didn't realize that the logs in this flow were not as comprehensive as intended.
The error log was not helpful:
TypeError: Cannot read properties of null (reading '0')\n at Filter.clean (/app/node_modules/bad-words/lib/badwords.js:58:41)
Obviously, we had the stack trace, and it pointed to the part of the code that verifies if the message for the streamer has sensitive words and applies a filter to it before creating the transaction and dispatching the donation alert:
const isSensitive = fastify.badWords.isProfane(charge.comment)
const redactedComment = fastify.badWords.clean(charge.comment)
This code uses the bad-words library as a Fastify plugin.
However, this made no sense to me. I became nervous as I saw the error reporting messages and emails repeating every few minutes or so
Webhooks and no conditional updates
As I received more of these alerts, I realized the payment processor was retrying the route until it received a 200 response.
I checked the database and found that the charge was marked as paid, but no transaction was registered. This meant that the payment processing part was executing every time the route was called and we can't show alerts if no transaction is registered.
However, I calmed down because we can't process a transaction if the charge has already been processed, right?
Actually, the biggest mistake was in this part of the code:
export async function paymentReceived(data: Data, fastify: Fastify) {
const charge = await ChargeModel.findById(data.id)
if (!charge) {
return { error: 'Could not find related Charge' }
}
// ...
}
I felt so dumb when I realized that the second check should prevent processing the payment if the charge status was paid, but it wasn't there. So I rushed and added the check:
export async function paymentReceived(data: Data, fastify: Fastify) {
const charge = await ChargeModel.findById(data.id)
if (!charge) {
return { error: 'Could not find related Charge' }
}
if (charge.status === 'paid') {
return {
data: {}
}
}
// ...
}
I commited into main branch and waited. Bingo! The errors stopped.
An 🥰 emoji and no more chaos
With no more chaos happening, I investigated the donation more in depth to understand why the check for bad words have failed.
It seemed perfect: it was paid, the payment provider showed the transaction on their side, and it had all the required info:
- amount
- user name
- comment
However, the comment was a: 🥰.
Yes, the string was only an emoji.
So I rushed to my console and ran:
const BadWords = require('bad-words')
const filter = new BadWords()
filter.clean("🥰")
And there it was! The console screamed the same error as before:
As it was late at night, I sent a message to the streamer in their DM and on the Twitch chat, apologizing, and called it a day.
At 6 AM the next day, I found the issue. The clean method for bad-words was trying to join a string when it had nothing to join back in. It splits the string and replaces the word if it is profane. But when it joins the word back in, it fails miserably because the regex returned null, and we cannot access index 0 of null.
clean(string) {
return string.split(this.splitRegex).map((word) => {
return this.isProfane(word) ? this.replaceWord(word) : word;
}).join(this.splitRegex.exec(string)[0]);
}
It was clear to me: copy the code, fix it myself, and open a PR. So I did my implementation of the library in 5 minutes and tested against all comments we had in the database and then compared the isProfane status with my implementation.
Looks like the code is simple:
const filter = {
isProfane(text) {
return Boolean(badWords.find((word) => {
const wordExp = new RegExp(`\\b${word.replace(/(\W)/g, '\\$1')}\\b`, 'gi');
return wordExp.test(text)
}))
},
clean(text) {
return text
.split(' ')
.map((word) => this.isProfane(word) ? this.replaceWord(word) : word)
.join(' ')
},
replaceWord(string) {
return '*'.repeat(string.length)
}
}
Now I can use it just like before. The clean method can be written in a single line of code. I even added isProfane and replaceWord, which are simple to code as well.
After calling it quits, I deployed to production and after a week later, I have not seen any problems with a single emoji again. Of course, we still had to normalize the database.
No longer rich
Right away I ran the script to get the streamer wallet balance based on the transaction history. The user had 100 bucks more because the wehbook was called many times in a timespan of an hour.
So I normalized the wallet balance. And created the transaction based on the processed charge and also added the charge amount to the wallet balance.
The future and lessons learned
The primary part of running an MVP is to do the best you can with only a few features in hand. However, I don't think about scaling until we need to. Otherwise, we lose the time to market and get a well-optimized server for a hundred thousand concurrent users when we can't get our first paying customer.
For sure, we can't scale until we need to, but dealing with other people's money is dangerous. We know that; we had only one active streamer. However, what if we had five? Would we lose 500 bucks? How would we be able to handle such issues? So from now on, we have implemented the following measures to prevent incidents like that:
- Better logging: We log more information to Rollbar, providing us with a small payload that shows what happened and what we need to know.
- Conditional writes: After more testing, we are now certain that we can't modify an entity if it can't be modified based on its status.
- Event-based: We've shifted most of the payment receiving processes to use queues. This way, if something goes wrong, we can fix the issue and resume from where it stopped.
Top comments (3)
Congratulations on the incredible post!
And congratulations also for the special care with the logs and the well-done monitoring!
Every system, no matter how much we try to make it perfect, there will always be errors or bugs, and it is important that we are always prepared with resources to analyze the best possible way to resolve them!
And you did it well!
mt fera!!!!
glitch infinito de dinheiro
o cara descobriu um glitch de dinheiro infinito na vida real :p