Round One

One of the first things I did when tracking down performance issues was motivated by a lot of experience with Rails apps. A very common issue I’ve seen with Rails applications comes down to a simple misuse of active_record. Errbit doesn’t use active_record, but it does use Mongoid which presents something rather like an active_record interface for mongo-powered applications.

After spending some time with Errbit and some cursory analysis, I suspected the performance problem was in the data layer (making unnecessary and expensive queries). So I set out to record all mongo queries made while inserting notices.

Here’s the list of queries I collected from one notice insertion:

namespace=errbit_development.apps selector={"api_key"=>"acae731a1bd8cd2b0c0946f2a952027b"} flags=[:slave_ok] limit=0 skip=0 project=nil | runtime: 31.7640ms
namespace=errbit_development.backtraces selector={"fingerprint"=>"a6e12c6962939b955c0bf325c0cff0936ae0e98f"} flags=[:slave_ok] limit=0 skip=0 project=nil | runtime: 2.1269ms
namespace=errbit_development.backtraces selector={"_id"=><BSON::ObjectId:0x70012658875760 data=55748bb163726f67e4000004>} flags=[:slave_ok] limit=0 skip=0 project=nil | runtime: 1.8985ms
namespace=errbit_development.errs selector={"fingerprint"=>"ece6d74c1056b16c50ed3ba2f03dfc0bf6e5e7d2"} flags=[:slave_ok] limit=0 skip=0 project=nil | runtime: 0.4275ms
namespace=errbit_development.$cmd selector={:insert=>"notices", :documents=>[{"_id"=><BSON::ObjectId:0x70012660942640 data=5574c1a963726f7f89000000>, "message"=>"Whatever the mind of man can conceive and believe, it can achieve", "error_class"=>"ArgumentError", "backtrace_id"=><BSON::ObjectId:... flags=[] limit=-1 skip=0 project=nil | runtime: 0.6309ms
namespace=errbit_development.problems selector={"_id"=><BSON::ObjectId:0x70012665027820 data=55748bb563726f67e400008c>} flags=[:slave_ok] limit=0 skip=0 project=nil | runtime: 0.3846ms
namespace=errbit_development.$cmd selector={:update=>"problems", :updates=>[{:q=>{"_id"=><BSON::ObjectId:0x70012665383440 data=55748bb563726f67e400008c>}, :u=>{"$inc"=>{"notices_count"=>1}}, :multi=>false, :upsert=>false}], :writeConcern=>{:w=>1}, :ordered=>true} flags=[] limit=-1 skip=0 project=nil | runtime: 0.4282ms
namespace=errbit_development.$cmd selector={:update=>"problems", :updates=>[{:q=>{"_id"=><BSON::ObjectId:0x70012665383440 data=55748bb563726f67e400008c>}, :u=>{"$set"=>{"messages"=>{"cd4c2e0346474d0b2eb7f1a630309239"=>{"value"=>"Whatever the mind of man can conceive and believe, it can achieve... flags=[] limit=-1 skip=0 project=nil | runtime: 0.5181ms
namespace=errbit_development.apps selector={"_id"=>"55748b8463726f67e4000001"} flags=[:slave_ok] limit=0 skip=0 project=nil | runtime: 0.3517ms

While working on these queries, I noticed that Durran Jordan had recently announced a new chapter of mongoid development. Upgrading to the latest version seemed like a nice, easy thing to try. So I did that hoping to see a cheap improvement in Errbit. If anything, the performance measurements were slightly worse than with Mongoid 4, but not enough to warrant rolling back. I decided to stick with Mongoid 5 while continuing the investigation.

I noticed by looking at this list that some of these queries are in fact unnecessary. By making fewer queries, it should be possible to improve performance because each query blocks the running thread from doing any other work until the response comes in from the socket.

After a bunch of work, I was able to reduce the query count from nine to five:

namespace=errbit_development.apps selector={"api_key"=>"3777b363b3c03bc4e146f1e2523bf70b"} flags=[:slave_ok] limit=0 skip=0 project=nil | runtime: 0.4294ms
namespace=errbit_development.$cmd selector={:findandmodify=>"backtraces", :query=>{"fingerprint"=>"b8ff41db0f445d80a3a9c3ec91a2ecd9a06b1b56"}, :update=>{"$setOnInsert"=>{:fingerprint=>"b8ff41db0f445d80a3a9c3ec91a2ecd9a06b1b56", :lines=>[{"number"=>"425", "file"=>"[GEM_ROOT]/gems/activesupport-... flags=[:slave_ok] limit=-1 skip=0 project=nil | runtime: 2.0866ms
namespace=errbit_development.errs selector={"fingerprint"=>"155745cfe8d1e28f7485ce3236da5438f11d49df"} flags=[:slave_ok] limit=0 skip=0 project=nil | runtime: 0.4470ms
namespace=errbit_development.$cmd selector={:insert=>"notices", :documents=>[{"_id"=><BSON::ObjectId:0x69947441683300 data=557e0c9b63726f35570004c7>, "message"=>"Life is about making an impact, not making an income", "error_class"=>"ArgumentError", "request"=>{"url"=>"http://example.org/verify... flags=[] limit=-1 skip=0 project=nil | runtime: 0.3948ms
namespace=errbit_development.$cmd selector={:findandmodify=>"problems", :query=>{"_id"=><BSON::ObjectId:0x69947441715500 data=557dfdd763726f1d6d000031>}, :update=>{"$set"=>{"app_name"=>"test", "environment"=>"development", "error_class"=>"ArgumentError", "last_notice_at"=>Sun, 14 Jun 2015 23:2... flags=[:slave_ok] limit=-1 skip=0 project=nil | runtime: 0.6351ms

By making use of findandmodify, I was able to combine the operations to find and modify backtraces and problems from five queries into just two. And I was able to shave one more query from the end by caching the result of querying the app record.

Expecting this to make a big difference in performance, I took some new measurements. But in my next set of tests, I found no measurable impact.

Round Two

Not to be deterred, I kept playing with Errbit until I noticed something interesting that I hadn’t before. While running my tests, I saw that the request log seemed to flow in bursts, even though my server is running a single process and a single thread. This burst-pause-burst behavior made me suspect GC as the cause, so I instrumented Errbit with gc_tracer to see if I could confirm.

Here’s a look at the server log during this test where I print GC!!!!!! on every GC event:

$ bundle exec unicorn -c config/unicorn.default.rb > /dev/null
I, [2015-06-14T17:36:21.656538 #27248]  INFO -- : Refreshing Gem list
I, [2015-06-14T17:36:24.054633 #27248]  INFO -- : listening on addr=0.0.0.0:8080 fd=10
I, [2015-06-14T17:36:24.057564 #27248]  INFO -- : master process ready
I, [2015-06-14T17:36:24.058008 #27323]  INFO -- : worker=0 spawned pid=27323
I, [2015-06-14T17:36:24.060624 #27323]  INFO -- : worker=0 ready
127.0.0.1 - - [14/Jun/2015 17:36:29] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 1.0742
127.0.0.1 - - [14/Jun/2015 17:36:29] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0239
GC!!!!!!
127.0.0.1 - - [14/Jun/2015 17:36:29] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0910
127.0.0.1 - - [14/Jun/2015 17:36:29] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0273
127.0.0.1 - - [14/Jun/2015 17:36:29] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0400
127.0.0.1 - - [14/Jun/2015 17:36:29] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0267
127.0.0.1 - - [14/Jun/2015 17:36:29] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0253
127.0.0.1 - - [14/Jun/2015 17:36:29] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0266
127.0.0.1 - - [14/Jun/2015 17:36:30] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.7641
127.0.0.1 - - [14/Jun/2015 17:36:30] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0310
127.0.0.1 - - [14/Jun/2015 17:36:30] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0827
127.0.0.1 - - [14/Jun/2015 17:36:30] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0264
GC!!!!!!
127.0.0.1 - - [14/Jun/2015 17:36:30] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0952
127.0.0.1 - - [14/Jun/2015 17:36:30] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0313
127.0.0.1 - - [14/Jun/2015 17:36:30] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0236
127.0.0.1 - - [14/Jun/2015 17:36:30] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0262
127.0.0.1 - - [14/Jun/2015 17:36:31] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.7083
127.0.0.1 - - [14/Jun/2015 17:36:31] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0324
127.0.0.1 - - [14/Jun/2015 17:36:31] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.1000
127.0.0.1 - - [14/Jun/2015 17:36:31] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0381
127.0.0.1 - - [14/Jun/2015 17:36:31] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0689
127.0.0.1 - - [14/Jun/2015 17:36:31] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0357
GC!!!!!!
127.0.0.1 - - [14/Jun/2015 17:36:31] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0375
127.0.0.1 - - [14/Jun/2015 17:36:31] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.0266
127.0.0.1 - - [14/Jun/2015 17:36:32] "POST /notifier_api/v2/notices HTTP/1.1" 200 - 0.6870
...

The amount of time spent performing the actual GC is not negligible. We’re seeing one GC event for every 10 requests. We can tune Ruby’s GC parameters to change this behavior, but it will come at the expense of memory and increasing the time between GC events will likely increase the time used during GC events since it would take more time to sweep a larger memory space. Running into this many garbage collection events on a regular basis indicates the process is steadily allocating chunks of memory. Since allocating memory takes some amount of time, profiling memory use will likely be my next priority.

It’s worth pointing out that what we’re seeing in these logs is not inherently bad. It’s merely indicative of a process that steadily allocates more garbage collectible memory. Whether or not this is too much depends entirely on the actual work that the process is doing. In this case, the process is basically recording one error notification per request and I’m inclined to think there’s a large amount of unnecessary memory allocation which contributes in a sizable way to Errbit’s performance issues.

Continue reading »

For some time, I’ve been aware that the performance of Errbit is not great. But that problem has so far taken a back seat to Errbit’s other priorities. For v0.5.0, I want to at least make a dent here and actually do something to improve this situation.

But before I get started, I felt like it was a good idea to put together some kind of bench test for two reasons. First, I want a record of where this began so I can see how much progress we make over time. Second, and more importantly, I need a decision making framework. If I need to make a decision to sacrafice some functionality for performance, I want to have a handle on how much relative performance is at stake.

My goal is not to create a perfect or objective or complete measurement of Errbit’s behavior, but simply to get a sense of how Errbit’s performance characteristics change as errors are added to its database and as the code changes over the course of this performance improvement project.

Testing Methodology

To start, I created errbit_loader with a pair of load tests for the two most common errbit uses, creating and finding errors. You start the test by specifying the number of errors to create and search for in each block (count), and the number of parallel requests to keep open at any one time (concurrency). In this round of testing, I used count=1000 and concurrency=8. In each test, errbit_loader runs the create batch, followed by the search batch and then repeats this pair of tests until stopped.

90% of the errors created by errbit_loader are repeating, so they should be grouped with other like errors. The other 10% have unique messages which should cause them to be not grouped with other errors.

I know from experience that performance degrades tremendously with many millions of errors, but it took more than six hours just to create 100,000 errors in these load tests in my environment. For that reason, I have decided to stop at 100,000 for now and increase the total count after we make some progress.

Software and Hardware Information

For this round of testing, I am using Errbit v0.4.0, Ruby 2.1.2, and mongo 2.6.3. Errbit is using its default deployment strategy which is single-threaded unicorn with three worker processes and preload_app set to true. The test process, mongo, and Errbit itself are all running on the same physical machine which is a Lenovo ThinkPad T430s.

In just over six hours (6:02.59), I inserted 100,000 errors, in blocks of 1,000 and completed 100,000 searches. There is a bit of noise in these data between 80-90k because I forgot the tests were running at one point and started using my laptop for other things. The results should still be usable, though, since the underlying trends are clearly visible.

This first chart is a total request time overview. Out of each block of 1,000 tests, I’m showing the 95th percentile, with five percent of requests being slower than each point on the chart. Time to insert errors appears to be fairly constant as the database grows, while searching seems to grow linearly.

Here’s a closer look at the results for creates. This chart shows the 90th, 95th and 99th percentile.

And here’s the same view of search requests.

My hunch is we can do a lot better than this. Instinctively, it doesn’t seem like it should take anywhere near a full second to insert an error. I’m not completely unhappy with the results for search requests, but I think that’s only because there are still relatively few errors in the database. Once we speed up the inserts, we should be able to more easily test how performance degrades as we start getting into millions of errors.

Continue reading »

As you probably do not recall, this blog was not always running on Wordpress. In it’s first incarnation, it was actually running Drupal . And as of yesterday, we can add Wordpress to the list of former lithostech.com platforms. Jekyll Homepage, Feb 2015 This site is now running on Jekyll, and since it seems to be relatively unkown compared to wordpress, I thought I’d take the opportunity to explore this change and explain the move. What follows could be looked at as a comparison of Wordpress to Jekyll as a blogging platform.

Theming

Wordpress has served me well. There are plenty of free, prebuilt themes ready to rock. Sadly, if you’re searching for Wordpress themes and you only consider themes that are mobile-friendly, SEO-friendly, easy to work on and not full of bloat, you’ve already elimated the vast majority of both free and paid themes that are available. If you want something visually appealing, you’re really down to a small handful of available themes which means your only real option is to use one of those and extend it, or write your own.

This is a bit daunting, but totally possible, especially starting from an example. Just read theme development docs and have a great time. You can <?php var_dump($foo); die(); ?> your way to a complete theme if you have the time, but unless you’re very familiar with this theming API, it’s going to take a while. It’s possible to do just about anything you can imagine, but I found theme authorship to be slow and painful and more than a little annoying.

Jekyll uses liquid for templating which is quite powerful and much more compact than plain PHP. After all, Jekyll is a much simpler tool than Wordpress, and for me that simplicity is a core strength. Have a look at this page’s layout to get an idea of how simple it is to write Jekyll themes.

Continue reading »

JavaScript dependency injection is all the rage these days. But after looking through all the options, I just haven’t found the perfect framework. That’s why I’m introducing my own framework to provide the best possible interface, helping you to inject exactly the dependency you need.

First, I want to introduce the problem we’re trying to solve. Let’s say you have a JavaScript function that has a dependency, but the client knows too much about the dependency. This is what smarty-pants engineers call “tight coupling”:

function Dependency1(say){
  alert(say);
}

function Client(){
  new Dependency1('hi'); // bad! tightly coupled interface
}

new Client();

Continue reading »

Suddenly this topic seems to be all around me. Although, I’m not sure if it was always here, or if I just wasn’t paying attention. But it’s here in a big way. The meritocracy discussion seems to be mainly about women in tech, but to a lesser extent, minorities.

At first it was an old colleague of mine on twitter. And frankly, I was annoyed to see his many-times-a-day posts about feminism and how the whole world is conspiring to keep his daughter from enjoying science and math. It went on like this for years and I eventually stopped following him because he never talked about anything else and I was tired of his rant.

What I didn’t understand at the time was this was all part of a much larger discussion. Although I did notice the growing trend about a year ago and started to take notice. Looking back at stories from the past few years shows just how much attention this is getting from major internet media outlets: Tech Crunch, The Guardian, Quartz, The Atlantic, Wired, NPR, The Boston Globe. You can even watch the rise of the term ‘meritocracy’ on Google trends as it begins in early 2009, likely in association with its so-called myth.

Recently, I also ran across an indiegogo project called CODE: Debugging the Gender Gap which promises to explore the lack of diversity in the tech world. At this point, it’s fully funded and I’m excited to see it when it’s ready.

Today, the discussion seems to be finding new ground after Microsoft CEO Satya Nadella’s remarks caused a minor internet storm, where he made some regretful remarks about how women should behave regarding salaries and ended up having to take it all back. And that’s great, the man ate his silly words and hopefully we all learned something.

Continue reading »