At Capital Match, we run Haskell in production. That means taking GHC-compiled Haskell programs and running them on the machines that our nginx reverse proxy connects to. They listen on port 80, and the reverse proxy forwards requests to them. It’s a fairly standard setup.
We were also using Amazon’s
t2 series of virtual machines. According to Amazon’s official documentation,
T2 instances are designed to provide moderate baseline performance and the capability to burst to significantly higher performance as required by your workload. They are intended for workloads that don’t use the full CPU often or consistently, but occasionally need to burst.
That sounds ideal for our purposes. Haskell is fast and the majority of our web requests take just a few milliseconds to complete, but there are a few complicated requests, such as when the admin is performing complex queries, that have a latency of several seconds, during which one or more CPU cores are fully utilized (yes some of these complex queries are parallelized thanks to Haskell’s awesome parallelism support).
But in May we started to discover that our app would run very slowly even not under much load at all. CPU utilization would almost always stay at 100%, which frankly shouldn’t happen, and everything just becomes a lot slower. As if the CPU itself has become throttled.
Very soon, it was realized that the culprit was CPU credit depletion. This is what Amazon’s CloudWatch has to say about one of the instances:
As you can see, initially when this instance was created sometime on May 5th, it had 120 CPU credits, exactly what Amazon’s documentation says it should have. It started to climb because the machine was largely idle, but soon thereafter the CPU credits started plummeting. It decreased at an almost uncannily constant rate, and eventually it would reach zero (not shown in the image above).
We spent quite some time puzzling over this. But in the end, what led us to the eventual resolution of the issue was simple. Our application has an API to obtain garbage collector statistics. It basically uses the
getGCStats function from
GHC.Stats (we were using GHC 8.0.1 at the time this happened) and sends the result back. For the purpose of resolving this issue, two of the fields were relevant:
gcCpuSeconds. It is found that the latter was almost 100x the former!
In hindsight of course checking GC stats is an obvious step, but in the heat of the problem we spent quite some time focusing on just our code, including using various different methods to determine whether it was certain requests that took an abnormally long time. We did fix a few endpoints to be more efficient, but in the end these didn’t resolve the issue at all.
Figuring out that the GC was the issue was the hard part. The next step in the puzzle was to resolve it. Reading GHC manual’s section on GC tuning and investigating the different options available quickly led us to the correct answer: idle garbage collection.
As the manual explains,
[…] a major GC is automatically performed if the runtime has been idle (no Haskell computation has been running) for a period of time. The amount of idle time which must pass before a GC is performed is set by the -I ⟨seconds⟩ option […]
[…] if the amount of live data in the heap is particularly large, then the idle GC can cause a significant delay, and too small an interval could adversely affect interactive responsiveness.
Searching the GHC Trac revealed that someone else reported exactly the same problem seven years ago. It is remarkable how much the Haskell ecosystem has matured in the intervening years but some things didn’t change at all. The bug reporter was using Happstack to run the backend for a webapp, which started to use a lot more CPU when the amount of live data in memory increased. Here at Capital Match, we also have large amounts of live data in RAM. In the bug report Happstack was updating an
IORef with the current time every second (presumably for the
Date header), now we have much better solutions thanks to Michael Snoyman’s
Control.AutoUpdate, but in this case there was something else in our Haskell app that was running periodically and therefore triggering periodic garbage collections. The fix was simple: as simple as just specifying an option in the RTS options.
Upon reflection, I remembered reading this excellent blog post by Jim Fisher titled Low latency, large working set, and GHC’s garbage collector: pick two of three. The problem described in the blog post was very different (latency not low enough), but one of the causes of the issue was the same, that GC would take a long time to complete when the amount of live data in memory was large. In hindsight it was of course very obvious but I felt that this wasn’t very widely known.
Nowadays in GHC 8.2.1 we have compact regions. In some cases when the live data in memory hardly ever changes, we can make use of it to save the garbage collector lots of work. The
GHC.Compact module comes with GHC, but there’s also a slightly more user-friendly package written by the same Edward Z. Yang. We are not yet using it in production, but we have plans to do so, and we are looking forward to unlocking some more performance improvements!