For some time, I’ve been aware that the performance of Errbit is not great. But that problem has so far taken a back seat to Errbit’s other priorities. For v0.5.0, I want to at least make a dent here and actually do something to improve this situation.
But before I get started, I felt like it was a good idea to put together some kind of bench test for two reasons. First, I want a record of where this began so I can see how much progress we make over time. Second, and more importantly, I need a decision making framework. If I need to make a decision to sacrafice some functionality for performance, I want to have a handle on how much relative performance is at stake.
My goal is not to create a perfect or objective or complete measurement of Errbit’s behavior, but simply to get a sense of how Errbit’s performance characteristics change as errors are added to its database and as the code changes over the course of this performance improvement project.
Testing Methodology
To start, I created errbit_loader with a pair of load tests for the two most common errbit uses, creating and finding errors. You start the test by specifying the number of errors to create and search for in each block (count), and the number of parallel requests to keep open at any one time (concurrency). In this round of testing, I used count=1000 and concurrency=8. In each test, errbit_loader runs the create batch, followed by the search batch and then repeats this pair of tests until stopped.
90% of the errors created by errbit_loader are repeating, so they should be grouped with other like errors. The other 10% have unique messages which should cause them to be not grouped with other errors.
I know from experience that performance degrades tremendously with many millions of errors, but it took more than six hours just to create 100,000 errors in these load tests in my environment. For that reason, I have decided to stop at 100,000 for now and increase the total count after we make some progress.
Software and Hardware Information
For this round of testing, I am using Errbit v0.4.0, Ruby 2.1.2, and mongo 2.6.3. Errbit is using its default deployment strategy which is single-threaded unicorn with three worker processes and preload_app set to true. The test process, mongo, and Errbit itself are all running on the same physical machine which is a Lenovo ThinkPad T430s.
In just over six hours (6:02.59), I inserted 100,000 errors, in blocks of 1,000 and completed 100,000 searches. There is a bit of noise in these data between 80-90k because I forgot the tests were running at one point and started using my laptop for other things. The results should still be usable, though, since the underlying trends are clearly visible.
This first chart is a total request time overview. Out of each block of 1,000 tests, I’m showing the 95th percentile, with five percent of requests being slower than each point on the chart. Time to insert errors appears to be fairly constant as the database grows, while searching seems to grow linearly.
Here’s a closer look at the results for creates. This chart shows the 90th, 95th and 99th percentile.
And here’s the same view of search requests.
My hunch is we can do a lot better than this. Instinctively, it doesn’t seem like it should take anywhere near a full second to insert an error. I’m not completely unhappy with the results for search requests, but I think that’s only because there are still relatively few errors in the database. Once we speed up the inserts, we should be able to more easily test how performance degrades as we start getting into millions of errors.