I have been working on large scale distributed infrastructure for a few years now and reached point that I believe I should really start putting in serious effort in noting down my learnings before all knowledge vanish with my loosing memories.
A week or two ago, I have been investigating a "memory leaking" case in which a service that talks to two clusters of distributed services to retrieve information. The service is running in Java. We took several heap dump over a course of 3-5 days, using
  jmap -dump:live,file=<file_name> <pid>
and Yourkit to load them up to examine what's the biggest changes over time in different heaps.
The heap analysis pointed out an increasing byte[] which has been rapidly growing with a speed roughly .3G/day. And these bytes come from netty.
Two key traits revealed from YourKit. First, the call stack indicate the byte[] was referenced by DynamicChannelBuffer which was in term referenced by LengthFieldBasedFrameDecoder. This is the class that "deframe" the incoming bytes prior to deserialization. The other part is, I noticed in the bigger heap, the length of byte[] tend to be generally bigger, whereas in the smaller one, their size distribute toward the smaller range.
Having these two pieces of information help me found the cause this "leak". Our cluster client has to maintain hundreds of connections to service cluster. To reduce the cost of establishing network connections, we reuse them. Each connection has a DynamicChannelBuffer associated with it and will grow by doubling itself when the incoming data is large. Since our data size have a wide range and can be unpredictable, overtime, all channels would receive the large dataset and adjusted itself.  Because of the size of the clusters, the combined memory used in these DynamicChannelBuffers are non-trival, a couple of Gigabytes in one JVM.
A simple fix in this case would be to close the network channels every now and then, thus discarding the old channels and its associating DynamicChannelBuffer. One thing to watch in this case would be choosing reasonable rate to close the channels. Having them close/reopen too frequent might cause the client being too chatty, thus a longer backlog on the netty server side (netty server has a default backlog setting, which needs to be adjusted).
Another fix would be periodically detach the FrameDecoder from the channel, and recreate new ones. This implementation though might be too much tied to a specific implementation of Netty.
In general, when you have a client that connect to large clusters, it is a common problem how the clients efficiently using memory to buffer the received data in JVM. Some choose to keep these buffers short lived in the young gen, others might be pooling these buffers to reduce pressure on Garbage Collection. Either way, whenever your system involves a cluster to a large cluster, it is an area that you should remember to carefully examine.
 
No comments:
Post a Comment