DEV Community

Cover image for Five Free WAF Performance Comparison
Carrie
Carrie

Posted on

Five Free WAF Performance Comparison

Testing the Effectiveness of WAF Protection

Attack defense is the core capability of a WAF. This article will introduce how to test the effectiveness of WAF protection.

To ensure the fairness of the test results, all targets, testing tools, and test samples mentioned in this article are open-source projects.

Testing Metrics

The test results are based on four main metrics:

  • Detection Rate: Reflects the comprehensiveness of the WAF's detection ability. Missing detections are considered "false negatives".
  • False Positive Rate: Reflects the interference with normal traffic. Unreliable results are considered "false positives".
  • Accuracy: A composite metric of the detection rate and false positive rate, aiming to balance between false negatives and false positives.
  • Detection Time: Reflects the performance of the WAF; the longer the detection time, the poorer the performance.

Detection time can be directly measured using tools. The other three metrics can be mapped to the concept of prediction classification in statistics:

  • TP (True Positives): The number of attack samples intercepted.
  • TN (True Negatives): The number of normal samples correctly allowed.
  • FN (False Negatives): The number of attack samples allowed, i.e., missed detections.
  • FP (False Positives): The number of normal requests intercepted, i.e., false alarms.

The formulas for the above three metrics are as follows:

  • Detection Rate = TP / (TP + FN)
  • False Positive Rate = FP / (TP + FP)
  • Accuracy = (TP + TN) / (TP + TN + FP + FN)

To reduce the impact of randomness and minimize errors, for "Detection Time," I will break it down into "90% Average Time" and "99% Average Time" metrics.

Test Samples

  • Data Source: All test data comes from my own browser.
  • Packet Capture Method: Use Burp Suite as a proxy, point the browser globally to Burp, and export the XML file, then use a Python script to process it into individual requests.

Based on past experience, the ratio of normal traffic to attack traffic for exposed services on the internet is usually around 100:1. We will use this ratio for sample allocation.

  • White Samples: Browsing Weibo, Zhihu, Bilibili, and various forums, collecting a total of 60,707 HTTP requests, totaling 2.7 GB (this process took 5 hours).
  • Black Samples: To ensure thorough testing, I collected black samples using four different methods, totaling 600 HTTP requests (this process took 5 hours).

The black sample collection methods are:

  1. Simple Generic Attack Traffic: Deploy a DVWA target machine and attack each generic vulnerability example.
  2. Common Attack Traffic: Use all attack payloads provided on the PortSwigger website.
  3. Targeted Vulnerability Traffic: Deploy a VulHub target machine and attack each classic vulnerability using default POCs.
  4. Countermeasure Attack Traffic: Increase the countermeasure level of DVWA and attack it again under medium and high protection settings.

Testing Method

With the test metrics and samples defined, we now need three things: a WAF, a target machine to receive the traffic, and testing tools.

  • WAF: All WAFs use initial configurations without any adjustments.
  • Target Machine: Use Nginx, configured to return a 200 status for any request as follows:
location / {
    return 200 'hello WAF!';
    default_type text/plain;
}
Enter fullscreen mode Exit fullscreen mode
  • Testing Tools: The requirements for the testing tool are:
    • Parse Burp's export results.
    • Reassemble the HTTP requests.
    • Remove the Cookie header to ensure data can be open-sourced.
    • Modify the Host header field to ensure the target machine can receive the traffic correctly.
    • Determine if the request was intercepted by the WAF based on whether a 200 status was returned.
    • Mix black and white samples and send the requests evenly.
    • Automatically calculate the above "testing metrics".

I found two open-source WAF testing tools that look good and meet most of the requirements. By combining these tools and adding some additional details, they can be used:

  • gotestwaf: An open-source WAF testing tool from Thailand.
  • blazehttp: An open-source WAF testing tool from Chaitin Tech.

Start Testing

SafeLine Community Edition

  • TP: 426
  • TN: 33,056
  • FP: 38
  • FN: 149
  • Total Samples: 33,669
  • Success: 33,669
  • Errors: 0
  • Detection Rate: 74.09%
  • False Positive Rate: 8.19%
  • Accuracy: 99.44%
  • 90% Average Time: 0.73 ms
  • 99% Average Time: 0.89 ms

Coraza

  • TP: 404
  • TN: 27,912
  • FP: 5,182
  • FN: 171
  • Total Samples: 33,669
  • Success: 33,669
  • Errors: 0
  • Detection Rate: 70.26%
  • False Positive Rate: 92.77%
  • Accuracy: 84.10%
  • 90% Average Time: 3.09 ms
  • 99% Average Time: 5.10 ms

ModSecurity

  • TP: 400
  • TN: 25,713
  • FP: 7,381
  • FN: 175
  • Total Samples: 33,669
  • Success: 33,669
  • Errors: 0
  • Detection Rate: 69.57%
  • False Positive Rate: 94.86%
  • Accuracy: 77.56%
  • 90% Average Time: 1.36 ms
  • 99% Average Time: 1.71 ms

Nginx-Lua-WAF

  • TP: 213
  • TN: 32,619
  • FP: 475
  • FN: 362
  • Total Samples: 33,669
  • Success: 33,669
  • Errors: 0
  • Detection Rate: 37.04%
  • False Positive Rate: 69.04%
  • Accuracy: 97.51%
  • 90% Average Time: 0.41 ms
  • 99% Average Time: 0.49 ms

SuperWAF

  • TP: 138
  • TN: 33,048
  • FP: 46
  • FN: 437
  • Total Samples: 33,669
  • Success: 33,669
  • Errors: 0
  • Detection Rate: 24.00%
  • False Positive Rate: 25.00%
  • Accuracy: 98.57%
  • 90% Average Time: 0.34 ms
  • 99% Average Time: 0.41 ms

Comparison Table

Image description

The SafeLine Community Edition performed the best overall, with the fewest false positives and false negatives.

Conclusion

To ensure fairness and impartiality, all testing tools and data used in this article are open-sourced and available at:

https://gitee.com/kxlxbb/testwaf

Different test samples and methods may lead to significant differences in test results. It is necessary to select appropriate test samples and methods based on the actual situation.

The results of this test are for reference only and should not be used as the sole standard for evaluating products, tools, algorithms, or models.

Top comments (0)