Summary
Recently, while working on the realcomp project, I've been greatly disturbed by how much more memory the Java version uses over the C++ version. With approximately 30,000 active connections the Java version was using just under 5.3GB; whereas the C++ version only uses about 300MB.
So what, in a basic server which is doing nothing more than receiving 512 byte messages and echoing them back to the client, can possibly require so much memory?
A Short Explanation of the Purpose and Benefit of Buffered IO
If you're not entirely clear on what the purpose and benefit of using buffered IO is, here is a short explanation.
Generally, it is considered an expensive operation to read from or write to an external source (in this case a socket). If there is 1KB of data to be read from the socket and you read it in chunks of, let's say, 100 bytes at a time, then you will be performing 11 read operations. For each such read operation there will be a context switch from user mode to kernel mode because the socket buffer is maintained by the kernel. What happens when you use buffered IO, through something like the BufferedReader class, is that the first time you read the first 100 bytes from the socket, the BufferedReader class actually retrieves the entire 1KB of data from the socket and stores it in it's own buffer so that subsequent read operations will not require accessing the socket buffer, or the context switch to kernel mode.
As long as the buffering mechanism (e.g. BufferedReader) is able to retrieve more data from the socket than the application is actually requesting at that moment, then there should be a performance benefit. Exactly how much of a benefit depends, of course, on how much overhead is avoided by not having to make additional system calls.
Now, let's proceed!
Investigation
My first assumption was the high memory usage was due to having 30,000 threads running. Except that at process startup we're pre-starting a pool of 30,000 threads in order to avoid the overhead of spawning them when the connections are received. After the server has started up, but before any connections start coming in, the memory usage with 30,000 threads is around 642MB.
Purely coincidentally, while profiling the server code to try to figure out what exactly was taking so much CPU time, I happened to notice that 82% of the runtime was being spent in BufferedReader.read(). I spent some time investigating that function and those which it calls but nothing obvious jumped out at me. Nevertheless, to try to improve the performance of the Java version I thought I'd try reading/writing directly to/from the socket, rather than going through the BufferedReader and BufferedWriter classes. Perhaps they are, internally, doing something unnecessary.
Within the realcomp method HttpConnectionHandler.Run(), I replaced the use of BufferedReader and BufferedWriter with InputStream and OutputStream, respectively. Unfortunately, what I found was that the CPU usage, under the same test, went from a median of 53.8% to 63.0%. That struck me as extremely peculiar because InputStream and OutputStream work with binary data, whereas BufferedReader and BufferedWriter work with character data. That means that Input/OutputStream do not perform any decoding or transformation on the data being received/sent. That fact alone should have resulted in less overhead. But, what troubled me even more, is that the client is sending a single 512 byte message, which the server is then reading in a single call to the read()
method of the given "reader" class, specifying a length
argument of 512. In other words, we should not have actually been utilizing the buffering provided by BufferedReader/BufferedWriter. That is: if we call BufferedReader.read() with a length
argument of 512, and there is only 512, or fewer, bytes to be read from the socket then there is nothing more for BufferedReader to read from the socket and store in it's internal buffer.
At this point, I should remind you since BufferedReader/Writer work with character data, the actual buffer size, in bytes, will be at least twice the size you specify. That is because the char type, in Java, is at least 2 bytes.
In consideration of the foregoing, we should not have seen a performance difference between reading from the socket using buffered IO and using direct, unbuffered IO. In fact, it would seem, using non-buffered binary reads should have yielded better results than using buffered, character based reads. If nothing else, there are fewer internal (to the standard Java library) operations being performed.
Needless to say (though I will, anyway), I was intrigued and compelled by my desire to know everything, to investigate further.
But first, and more critical to the topic of this post, I happened to notice another significant effect on the runtime results after switching from BufferedReader to InputStream. The memory usage, after running for 5 minutes with 22,500 to 30,000 active connections went from about 5.3GB down to only 1.6GB. Eureka! Was it possible I had found the cause of the memory hogging?
Up to that point I had been creating the BufferedReader and BufferedWriter without specifying a buffer size. I foolishly assumed the wonderful engineers at Oracle and Sun would have used a reasonable buffer size, or better yet, that the buffer size would expand and shrink based on the recent loads. I was surprised to find that the default buffer size was 8,192 (remember, that's characters, so the buffer is actually at least 16,384 bytes, and there are separate reader and writer buffers, so really we're looking at 32,768 bytes for each connection)! With a buffer size of 8,192 our server was eating up 5.3GB of system memory.
I played around with various different buffer sizes and reran the tests after each change. Here's the results:
Memory Usage (MB) | Median CPU Usage | Average CPU Usage | |
---|---|---|---|
BufferedReader / BufferedWriter (256 char buffer) |
3,763 | 61.48% | 61.05% |
BufferedReader / BufferedWriter (512 char buffer) |
3,801 | 50.35% | 51.45% |
BufferedReader / BufferedWriter (4096 char buffer) |
4,606 | 52.15% | 56.93% |
BufferedReader / BufferedWriter (default buffer size (8192)) |
5,289 | 53.79% | 58.46% |
BufferedReader / BufferedWriter (16384 char buffer) |
7,213 | 52.55% | 58.03% |
BufferedInputStream / BufferedOutputStream (512 byte buffer) |
1,790 | 53.54% | 58.95% |
BufferedInputStream / BufferedOutputStream (8192 byte buffer) |
3,708 | 55.99% | 57.48% |
InputStream / OutputStream | 1,639 | 62.95% | 67.64% |
As might be expected, with respect to using the BufferedReader, we see that when we specify a buffer smaller than the amount of data to be read each time the CPU usage goes up (61.48%). And when we specify a buffer size exactly the same size as the amount of data actually in the socket buffer and which we're requesting (i.e. 512 chars) then we get the best CPU performance (50.35%). But, it's interesting that specifying a buffer size larger than the amount of data to be read and actually read results in lower performance. I would think it would result in wasted space but shouldn't adversely affect CPU performance.
What really baffles me about these numbers, though, is that BufferedInputStream actually performs worse than BufferedRead. This is vexing because it seems counter-intuitive that handling character data, which requires additional processing, would use less CPU time than handling binary data, which is just read and passed on.
You may notice the significant memory usage difference between BufferedReader and BufferedInputStream when specifying a buffer of 512. This can be attributed to the fact that BufferedInputStream is in terms of bytes, whereas BufferedReader is in terms of chars (which are 2 bytes).
So What! Memory is Cheap!
I'm sure there are going to be some people out there that are going to say "But memory is cheap. It's no big deal to just add more memory to the server." That is true. Memory is fairly inexpensive now.
Main system memory, anyway. Now, the memory that's used for your processor cache - that's a different story. And the more memory your application is actively referencing, the more likely it is going to be evicting cache lines more frequently which means more cache misses, which means more loads from slow, system memory, which means much higher CPU utilization...and a much slower application.
The CPU cache is shared by all processes running on the server, so if your program is causing excessive cache line evictions then the entire server will be affected.
Also, you cannot upgrade the processor cache without upgrading the entire processor - and that probably won't be cheap. If the server is a multi-processor system (as any decently powerful server should be), then you may have to upgrade all of the processors at the same time.
And finally, if you're unfortunate enough to be developing a high performance, scalable system in Java then you have absolutely no control over the placement of your data structures in memory, so trying to improve CPU cache performance is going to be futile anyway.
Conclusions
It would seem, in order to get the best performance, in terms of scalability and handling large numbers of concurrent users, we would want to be able to set the buffer size of the buffering mechanism we use (e.g. BufferedReader, BufferedWriter) to as close to the size of the messages the server would be receiving, without going below that size. Unfortunately, in a real world system that would be almost impossible to do reliably because messages will likely be many different sizes. One possible approach might be for the server to regularly monitor the sizes of the messages being received and each time a new BufferedReader is created in response to a new connection being received, we would use that information to decide how large to make the buffer. But then we're heading down that road of having to make sophisticated code just to get around problems with the Java language/libraries/platform. If you go down that road of introducing sophisticated designs to appease the language then you might as well just use C++ in the first place and not have to deal with it.
I really think the only reasonable approach to something like this would be to put the buffer size into the application's runtime configuration repository (be it a config file, an LDAP server, or a database). That way, the system administrator or operations department can set the buffer sizes for optimal performance based on what they encounter in production. One thing we cannot do as software engineers, is assume an appropriate buffer size today will still be an appropriate buffer size tomorrow, or next year.
The Proof
I never believe anything unless I can review the code and the results myself, firsthand, in my own environment. And I would never expect anyone else to, either (though sadly, that's just not the case nowadays; people will believe anything as long as enough other people believe it).
So, here is the project/source code that was used for the tests: realcomp.tgz.
The specific area of interest is in the HttpConnectionHandler.Run() method, which can be found in /realcomp/realcomp_java_server/src/solaronyx/realcomp/server/HttpConnectionHandler.java. Or, if you use Netbeans, you can load the project and save yourself a lot of time.
And, here is the output from running 'top' on the JVM while maintaining 22,500 to 30,000 active connections.
- Using InputStream / OutputStream
- Using BufferedInputStream / BufferedOutputStream with 512 byte buffer
- Using BufferedInputStream / BufferedOutputStream with 8192 byte buffer (the default)
- Using BufferedReader / BufferedWriter with 256 character buffer
- Using BufferedReader / BufferedWriter with 512 character buffer
- Using BufferedReader / BufferedWriter with 4096 character buffer
- Using BufferedReader / BufferedWriter with 8192 character buffer (the default)
- Using BufferedReader / BufferedWriter with 16384 character buffer
Comments