Cassandra集群呈现“G1老一代GC In”甚至删除节点

xv8emn3q  于 10个月前  发布在  Cassandra
关注(0)|答案(1)|浏览(91)

这是我工作的公司的情况,这导致了我的睡眠不足,因为我在墙前,找不到任何选择。我有一个51个cassandra 3.11.9节点的生产集群,负载很大(每个节点从600到800 GB),这是一个非常烦人的问题:由于未知的原因,这些机器开始执行GC暂停,这确实损害了我的应用程序的响应时间,因为客户端上的一致性级别是ONE,造成了延迟的峰值。
以下是一些示例:

28 | CHANGED | rc=0 >>
INFO  [Service Thread] 2023-07-14 01:56:45,583 GCInspector.java:285 - G1 Old Generation GC in 12039ms.  G1 Old Gen: 54648149088 -> 12146746552; G1 Survivor Space: 134217728 -> 0;
INFO  [Service Thread] 2023-07-14 02:14:24,394 GCInspector.java:285 - G1 Old Generation GC in 57918ms.  G1 Old Gen: 67780410096 -> 59704216816; Metaspace: 61436792 -> 61302392
INFO  [Service Thread] 2023-07-14 02:15:44,506 GCInspector.java:285 - G1 Old Generation GC in 64576ms.  G1 Old Gen: 67971190408 -> 64736391536;
INFO  [Service Thread] 2023-07-14 02:17:06,520 GCInspector.java:285 - G1 Old Generation GC in 66242ms.  G1 Old Gen: 68043573704 -> 66792790424;
INFO  [Service Thread] 2023-07-14 02:21:31,210 GCInspector.java:285 - G1 Old Generation GC in 257268ms.  G1 Old Gen: 68046631936 -> 67703054448;

254 | CHANGED | rc=0 >>
INFO  [Service Thread] 2023-07-14 02:36:26,170 GCInspector.java:285 - G1 Old Generation GC in 46654ms.  G1 Old Gen: 133621345752 -> 49403423024; Metaspace: 67436096 -> 67339688
INFO  [Service Thread] 2023-07-14 02:38:58,627 GCInspector.java:285 - G1 Old Generation GC in 89392ms.  G1 Old Gen: 133594285096 -> 103157948104;
INFO  [Service Thread] 2023-07-14 02:40:59,754 GCInspector.java:285 - G1 Old Generation GC in 93345ms.  G1 Old Gen: 135071359720 -> 105377369048; G1 Survivor Space: 33554432 -> 0;
INFO  [Service Thread] 2023-07-14 02:43:29,171 GCInspector.java:285 - G1 Old Generation GC in 106174ms.  G1 Old Gen: 133812654600 -> 119264140552; G1 Survivor Space: 234881024 -> 0;
INFO  [Service Thread] 2023-07-14 02:45:36,900 GCInspector.java:285 - G1 Old Generation GC in 95625ms.  G1 Old Gen: 135225564784 -> 99943593104;
INFO  [Service Thread] 2023-07-14 02:46:53,820 GCInspector.java:285 - G1 Old Generation GC in 55875ms.  G1 Old Gen: 133359614104 -> 60924511688; G1 Survivor Space: 872415232 -> 0;
INFO  [Service Thread] 2023-07-14 02:48:22,803 GCInspector.java:285 - G1 Old Generation GC in 38493ms.  G1 Old Gen: 133978126912 -> 36277631424;
INFO  [Service Thread] 2023-07-14 02:50:11,320 GCInspector.java:285 - G1 Old Generation GC in 34789ms.  G1 Old Gen: 134004918888 -> 35377344368;

250 | CHANGED | rc=0 >>
INFO  [Service Thread] 2023-07-14 00:18:52,262 GCInspector.java:285 - G1 Old Generation GC in 96017ms.  G1 Old Gen: 73628910144 -> 59159105432; Metaspace: 58018496 -> 57907432
INFO  [Service Thread] 2023-07-14 00:46:41,400 GCInspector.java:285 - G1 Old Generation GC in 30177ms.  G1 Old Gen: 41448088568 -> 24094354384; G1 Survivor Space: 67108864 -> 0;
INFO  [Service Thread] 2023-07-14 02:18:34,910 GCInspector.java:285 - G1 Old Generation GC in 40940ms.  G1 Old Gen: 74016882928 -> 27759131352; Metaspace: 57315192 -> 57128720
INFO  [Service Thread] 2023-07-14 02:36:02,256 GCInspector.java:285 - G1 Old Generation GC in 57658ms.  G1 Old Gen: 73488401080 -> 40838191112; Metaspace: 54701984 -> 54651552
INFO  [Service Thread] 2023-07-14 02:37:47,374 GCInspector.java:285 - G1 Old Generation GC in 87036ms.  G1 Old Gen: 73498188264 -> 65920831896;
INFO  [Service Thread] 2023-07-14 02:39:58,921 GCInspector.java:285 - G1 Old Generation GC in 111435ms.  G1 Old Gen: 73496794000 -> 70079092144;

字符串
在过去的几个月里,我尝试了几种方法,比如:

  • 增加示例类型(增加JVM),但错误只是得到更多的时间发生,但无论如何发生。
  • 正在删除出现此问题的节点,但新节点开始出现此问题
  • 使用G1GC和不同的JVM供应商,比如阿苏尔

目前,我没有以下选项:

  • 改变连接或数据建模,因为这取决于其他团队。
  • 更新Cassandra到4版本,应用程序需要首先更新。

现在,我拥有的唯一资源是在这些节点上使用“disablebinary”来避免峰值,但这并不好。
JVM

INFO  [main] 2023-07-17 18:40:11,668 CassandraDaemon.java:507 - JVM Arguments: [-javaagent:/opt/simility/include/exporters/jmxexporter/jmx_prometheus_javaagent-0.12.0.jar=7070:/opt/simility/include/exporters/jmxexporter/cassandra.yml, -ea, -javaagent:/opt/simility/include/cassandra/lib/jamm-0.3.0.jar, -XX:+UseThreadPriorities, -XX:ThreadPriorityPolicy=42, -Xms64G, -Xmx64G, -XX:+ExitOnOutOfMemoryError, -Xss256k, -XX:StringTableSize=1000003, -XX:+UseG1GC, -XX:G1RSetUpdatingPauseTimePercent=5, -XX:MaxGCPauseMillis=500, -Djava.net.preferIPv4Stack=true, -Dcassandra.config=file:///opt/simility/conf/cassandra/cassandra.yaml, -Djava.rmi.server.hostname=172.33.135.28, -Dcom.sun.management.jmxremote.port=7199, -Dcom.sun.management.jmxremote.rmi.port=7199, -Dcom.sun.management.jmxremote.ssl=false, -Dcom.sun.management.jmxremote.authenticate=false, -Dcassandra.libjemalloc=/lib64/libjemalloc.so.1, -Dlogback.configurationFile=logback.xml, -Dcassandra.config=file:///opt/simility/conf/cassandra/cassandra.yaml, -Dcassandra.logdir=/opt/simility/log/cassandra, -Dcassandra.storagedir=/opt/simility/include/cassandra/data]


谢谢你,谢谢
几件事,在最后一条消息解释

eblbsuwk

eblbsuwk1#

这种行为的第一个主要原因是巨大的分配。我在这里写了一篇论文:
https://medium.com/@stevenlacerda/identifying-and-fixing-humongous-allocations-in-cassandra-bf46444cec41
您可以通过在jvm设置中添加以下内容来确定问题是否是巨大的分配:

-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps 
-XX:+PrintHeapAtGC 
-XX:+PrintTenuringDistribution 
-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintPromotionFailure 
-XX:PrintFLSStatistics=1
-Xloggc:<file-path>
-XX:+PrintAdaptiveSizePolicy

字符串
一旦你有了gc日志,你就可以在gc日志中寻找巨大的分配:

14999:  6153.047: [G1Ergonomics (Concurrent Cycles) do not request concurrent cycle initiation, reason: still doing mixed collections, occupancy: 18253611008 bytes, allocation request: 4299032 bytes, threshold: 9663676380 bytes (45.00 %), source: concurrent humongous allocation]


这是一个4.2MiB的分配。在本例中,堆区域大小为8 MiB,因此巨大的分配是大于区域大小一半的任何值。并且区域大小被计算为:
以MiB / 2048为单位的堆大小= XMiB,向下舍入到最接近的2的倍数,即2、4、8、16、32……其中32是最大值)。
您还可以在gc日志中看到堆区域大小:

14.414: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: requested by GC cause, GC cause: Metadata GC Threshold]
{Heap before GC invocations=0 (full 0):
 garbage-first heap   total 20971520K, used 212992K [0x00000002c0000000, 0x00000002c0805000, 0x00000007c0000000)
  region size 8192K, 27 young (221184K), 0 survivors (0K)


在这种情况下,它是8192 K。
如果您确实看到了巨大的分配,那么您可以做的是增加堆大小,-Xmx参数,或者您可以增加区域大小:

-XX:G1HeapRegionSize=32M


32 M是最大值。
理想情况下,这不是解决方案,解决方案是降低突变的大小,但这应该会为您提供一些回旋余地。如果没有,那么你可能有一个情况,你有一个对象,使你头痛。我以前也见过。要识别该问题,您需要在问题发生之前获得内存转储,这并不容易。在理想情况下,您将设置HeapDumpOnOutOfMemory,它将OOM,然后您将有一个堆转储,可以使用EclipseMAT进行分析。

相关问题