I have a dev Cassandra cluster that me and other developers use for testing their websites. It has around 50-60 keyspaces. We had an issue with this cluster: constant high CPU usage even with no reads/writes. Cluster has 3 nodes, each node has 4 CPU cores assigned. It used 400% CPU on each node almost every time even having no reads/writes at all. I contacted our internal Cassandra expert regarding this issue and get response that such load was ok due to number of keyspaces. More developers started using cluster and Cassandra started feeling really bad. I tried to deploy my website (that uses Cassandra as storage) and realized that 2 out of 3 nodes are down. This happened a day before already. I decided to get to the bottom of the truth, why cassandra is so slow, at any price.
I encountered an article https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html regarding cassandra tuning.
Page has reference to https://github.com/aragozin/jvm-tools -- really powerful tools for java process stats analysis.
I downloaded binary from https://mavenlibs.com/jar/file/org.gridkit.jvmtool/sjk (since original link was unavailable) and run it with parameters
wget https://repo1.maven.org/maven2/org/gridkit/jvmtool/sjk/0.21/sjk-0.21.jar
java -jar sjk-0.21.jar ttop -s localhost:7199 -o CPU -n 30
It get me a clue what was loading cassandra:
2023-07-28T18:43:01.041-0500 Process summary
process cpu=106.16%
application cpu=105.76% (user=104.62% sys=1.15%)
other: cpu=0.39%
thread count: 95
heap allocation rate 14mb/s
[001422] user=30.05% sys= 0.09% alloc= 5407kb/s - prometheus-http-1-3
[000542] user=20.31% sys= 0.02% alloc= 5424kb/s - prometheus-http-1-2
[000078] user=19.28% sys= 0.11% alloc= 659kb/s - prometheus-http-1-1
[002413] user=18.87% sys= 0.09% alloc= 1311kb/s - prometheus-http-1-5
[002050] user=12.92% sys= 0.02% alloc= 15kb/s - prometheus-http-1-4
[004127] user= 0.92% sys= 0.37% alloc= 872kb/s - RMI TCP Connection(64)-172.26.34.166
[000081] user= 0.82% sys= 0.12% alloc= 393kb/s - read-hotness-tracker:1
[000047] user= 0.31% sys= 0.20% alloc= 3160b/s - ScheduledFastTasks:1
[000030] user= 0.31% sys= 0.03% alloc= 1039kb/s - OptionalTasks:1
[003238] user= 0.21% sys= 0.12% alloc= 24kb/s - Native-Transport-Requests-1
[003268] user= 0.10% sys= 0.06% alloc= 15kb/s - Thread-4
[000034] user= 0.10% sys=-0.02% alloc= 0b/s - LocalPool-Cleaner
[000029] user= 0.00% sys= 0.05% alloc= 1560b/s - PERIODIC-COMMIT-LOG-SYNCER
[000015] user= 0.00% sys= 0.05% alloc= 3122b/s - ScheduledTasks:1
[000033] user= 0.00% sys= 0.04% alloc= 0b/s - Reference-Reaper
[004128] user= 0.10% sys=-0.07% alloc= 3643b/s - JMX server connection timeout 4128
[003224] user= 0.00% sys= 0.03% alloc= 11kb/s - GossipStage:1
It was Prometheus metric collector. Looks like it was misconfigured somehow. I disabled it in /etc/cassandra/conf/cassandra-env.sh
. And finally CPU load went from 400% to 0%.
Top comments (0)