Grouping Crashes by Frame

Keeping track of crashes can be really tough. With only a few, you actually can just look at each one. But, this one-by-one approach rapidly becomes unfeasible, even with a modest amount of reports. As the volume and variety of crashes increases, you need some kind of automated tool to categorize and group your crashes. These groups help you direct your efforts towards the events that happen the most frequently. Despite its importance, grouping really isn’t something that gets talked about very much.

The Blamed Frame

Grouping crashes requires finding one or more elements that will be consistent across crash events. We need something common to look for in each report. But what do we pick? Let’s start by looking at what nearly every service does, which is a single stack frame. Stacksift’s terminology for this is approach is “deepest interesting frame”, but Apple refers to this as “crash point”.

Matching crashes against a single frame has a bunch of advantages. First, the stack trace of a crashing thread is often a useful place to start looking when understanding a crash. It’s common for large parts of a trace to contain OS functions. These can change from release to release, so using just one frame can help insulate us from this variation. Another big advantage is that in some cases, a crash cause can be reduced down to a single call site. Things like invariant failures (preconditions, forced-unwrapping) and thrown exceptions can be completely described by one frame.

Choosing the Frame

Despite its near-ubiquitous use for crash aggregation, using a single blamed frame actually has a bunch of drawbacks. The first and most obvious is deciding which frame to choose. This can be extremely hard to do well, even for a human with access to the code involved. To help illustrate just how tricky this can be, I’ve selected a crash from a Stack Overflow question (slightly tweaked for clarity). It looks like a great example of an object over-release. You don’t need to study the trace too closely here, but this is it in its entirety.

0   libobjc.A.dylib                 objc_release + 16
1   CoreFoundation                  cow_cleanup + 168
2   CoreFoundation                  -[__NSDictionaryM dealloc] + 144
3   MyApp                           -[CKContent .cxx_destruct] (CKContent.m:56)
4   libobjc.A.dylib                 object_cxxDestructFromClass(objc_object*, objc_class*) + 112
5   libobjc.A.dylib                 objc_destructInstance + 88
6   libobjc.A.dylib                 _objc_rootDealloc + 52
7   MyApp                           -[CKTableViewCell .cxx_destruct] (CKTableViewCell.m:44)
8   libobjc.A.dylib                 object_cxxDestructFromClass(objc_object*, objc_class*) + 112
9   libobjc.A.dylib                 objc_destructInstance + 88
10  libobjc.A.dylib                 _objc_rootDealloc + 52
11  UIKitCore                       -[UIResponder dealloc] + 152
12  UIKitCore                       -[UIView dealloc] + 872
13  UIKitCore                       -[UITableViewCell dealloc] + 236
14  MyApp                           -[CKModel dealloc (CKModel.m:345)
15  CoreFoundation                  -[__NSArrayM dealloc] + 228
16  UIKitCore                       -[UITableView .cxx_destruct] + 1524
17  libobjc.A.dylib                 object_cxxDestructFromClass(objc_object*, objc_class*) + 112
18  libobjc.A.dylib                 objc_destructInstance + 88
19  libobjc.A.dylib                 _objc_rootDealloc + 52
20  UIKitCore                       -[UIResponder dealloc] + 152
21  UIKitCore                       -[UIView dealloc] + 872
22  UIKitCore                       -[UIScrollView dealloc] + 852
23  UIKitCore                       -[UITableView dealloc] + 364
24  UIKitCore                       __destroy_helper_block_e8_32s40s + 24
25  libsystem_blocks.dylib          _Block_release + 148
26  Foundation                      -[_NSTimerBlockTarget dealloc] + 44
27  Foundation                      _timerRelease + 64
28  CoreFoundation                  __CFRunLoopDoTimer + 936
29  CoreFoundation                  __CFRunLoopDoTimers + 276
30  CoreFoundation                  __CFRunLoopRun + 1640
31  CoreFoundation                  CFRunLoopRunSpecific + 424
32  GraphicsServices                GSEventRunModal + 160
33  UIKitCore                       UIApplicationMain + 1932
34  MyApp                           main (main.m:101)
35  libdyld.dylib                   start + 4

There are a few things I want to call out about this trace. Most of the stack is made up of OS-supplied functions. The crashing function is deep within low-level object lifecycle management flow, with a few compiler-generated functions sprinkled in. Some frames are symbol + byte offset, but some of the functions within the main executable contain file and line info.

So, which frame do you think is most reasonable to blame? Let’s step through the stack and try to decide.

0   libobjc.A.dylib                 objc_release + 16
1   CoreFoundation                  cow_cleanup + 168
2   CoreFoundation                  -[__NSDictionaryM dealloc] + 144
3   MyApp                           -[CKContent .cxx_destruct] (CKContent.m:56)

The first three frames are quite low-level, and related to memory lifecycle flow. I think we can safely assume that code is bug free, and only crashes when given bad input. Frame 4 is within the app, but .cxx_destruct is a compiler-generated function. Let’s keep going.

4   libobjc.A.dylib                 object_cxxDestructFromClass(objc_object*, objc_class*) + 112
5   libobjc.A.dylib                 objc_destructInstance + 88
6   libobjc.A.dylib                 _objc_rootDealloc + 52
7   MyApp                           -[CKTableViewCell .cxx_destruct] (CKTableViewCell.m:44)
8   libobjc.A.dylib                 object_cxxDestructFromClass(objc_object*, objc_class*) + 112
9   libobjc.A.dylib                 objc_destructInstance + 88
10  libobjc.A.dylib                 _objc_rootDealloc + 52
11  UIKitCore                       -[UIResponder dealloc] + 152
12  UIKitCore                       -[UIView dealloc] + 872
13  UIKitCore                       -[UITableViewCell dealloc] + 236
14  MyApp                           -[CKModel dealloc (CKModel.m:345)

Ok, finally we get to Frame 14, which represents some actual code within the app - a custom dealloc method. But, just getting here requires some pretty sophisticated knowledge of OS library behavior. With a few heuristics, we can make an algorithm that selects this frame. Is that method to blame? It’s very unlikely. It’s probably just going through and releasing instance variables. But, this is almost certainly what a crash analysis system would select. It’s not unreasonable, just not a terribly good selection. Let’s continue and see what else we find.

15  CoreFoundation                  -[__NSArrayM dealloc] + 228
16  UIKitCore                       -[UITableView .cxx_destruct] + 1524
17  libobjc.A.dylib                 object_cxxDestructFromClass(objc_object*, objc_class*) + 112
18  libobjc.A.dylib                 objc_destructInstance + 88
19  libobjc.A.dylib                 _objc_rootDealloc + 52
20  UIKitCore                       -[UIResponder dealloc] + 152
21  UIKitCore                       -[UIView dealloc] + 872
22  UIKitCore                       -[UIScrollView dealloc] + 852
23  UIKitCore                       -[UITableView dealloc] + 364
24  UIKitCore                       __destroy_helper_block_e8_32s40s + 24
25  libsystem_blocks.dylib          _Block_release + 148
26  Foundation                      -[_NSTimerBlockTarget dealloc] + 44

We see now that this dealloc method was just invoked as part of an even longer string of deallocations. But, frame 26 has a truly interesting clue. This entire deallocation process was kicked off by a block-based NSTimer. Block capture is a common source of object lifecycle issues. And, we can see that it looks like it was a UITableView reference that was at the root of the graph. Not a smoking gun, necessarily, but something real to investigate as a possible reproduction step.

Selecting frame 26 would be an extremely difficult thing to get an analysis algorithm to do. It’s definitely possible, but it requires many heuristics, some of which would have to be based on function/OS behaviour that could change over time. And, it would be very hard to determine how these kinds of heuristics would behave in the general case. In short, it can be done for this particular situation, but the results would be very hard to predict for others. And, even if we could overcome all that, is it really better a choice?

Is Choosing Poorly a Problem?

Stepping back, it’s fair to ask how serious a problem we have here. Say we do choose frame 14, as many reporting systems would. How bad is that? Well, this example crash we’re looking at is almost certainly a lifecycle manage issue. These kinds of bugs are tricky, because the crashes they produce are typically non-deterministic. The bug makes a particular object pointer invalid, a problem known as a dangling pointer. Sometimes, you might crash in dealloc just like we did here. Sometimes, you might not crash. Sometimes, you might get particularly unlucky and that pointer might point to another totally unrelated object that is then sent a dealloc message. Heap corruption is no fun.

I say all this because there’s really no single frame that can describe what’s going on. You have a bug, the over-release, that can result in a potentially large number of possibly random symptoms. Boiling this situation down to one frame just doesn’t make sense. But, if you do, what will happen is a phenomenon I call under-grouping. There’s one underlying issue, but the resulting crashes will cause your system to produce more than one group. This can make it much harder to zero-in on a root cause.

The reverse situation, over-grouping can also occur. When you over-group, crashes that aren’t related are all lumped together. This is a challenging issue, and often is hard to notice. Under-grouping is a very common situation for utility functions and other code that is used by different systems.

Ultimately, both under- and over-grouping throw off the counting of crashes. Most developers prioritize their stability work based on crash counts. So, poor grouping makes it harder for you decide where your debugging effort is needed.

The take-away is frame-based grouping is a very reasonable thing to do, and can produce great grouping. But, it requires very carefully-chosen heuristics, and even then, doesn’t always work well for some common situations. Stacksift’s approach is to group this kind of crash into a specific object lifecycle management issue, as well as picking a frame to blame. This establishes a relationship between this particular effect with others that might produce completely different stack traces.

Frame Signatures

One additional interesting challenge of frame-based grouping is getting consistent matches across releases. Frames have function names (symbols), along with a byte-offset into the function. Sometimes, you also have file and line info. What we’re after is a consistent signature.

Symbol and line number isn’t a wonderful choice, because any code added with the file above this point will cause a mismatch. Function offset seems more desirable, because it is relative to the function start. However, it comes with two problems. The first is this value will not be consistent across CPU architectures. This is becoming less of an issue, but the bigger problem is you’re really at the whim of the compiler. A new Xcode build could introduce a small optimization tweak that could change offsets for half the functions in your code.

Luckily, we typically have access to the line where a function is declared. This lets us turn the byte offset into a line offset. Using line-relative signatures gives us a way of describing a frame that is more likely to remain constant across unrelated source and compiler changes. This is an easy way to combat an annoying manifestation of under-grouping.

Conclusion

If you’ve used a crash reporting service, you’ve used frame-based grouping. And there’s a reason it’s so common - it can work really well. It is true, the heuristics needed to pick the right frame makes it an imprecise approach. But, if you have a crash with a cause captured by the current stack, it is the optimal approach.

Unfortunately, many common types of crashes have a root-cause that is not captured by the current stack. The captured crash is just an effect, separated in time, code location, or both. In these situations, grouping based on a frame isn’t ideal, and can work quite poorly. A very common example is watchdog timeouts captured via MetricKit. These are almost never attributable to a frame, and can result in a very large number of frame-based groups. These kinds of issues are exactly why Stacksift groups crashes using multiple factors. This increases the likelihood of surfacing your most common problems, and gives you ways to understand and navigate their relationships.

Grouping has a huge impact on how your crashes are sorted and prioritized. A system that uses more than just a single blamed frame can help you better understand and fix them.

If this sounds interesting, sign up!

Sep 22, 2021 - Matt Massicotte

Previous: Stacksift is Ready