Demystifying the ANRs Puzzle

Context

Application Not Responding or ANRs are a very difficult puzzle to solve due to their indeterministic nature. Especially in the codebases where there is no quality gate to prevent the regressions on the main thread. Whenever there is an ANR and developers are assigned to solve them they are only left with Google Play Console and other APMs like Crashlytics, Instabug, Bugsnag to create observability on these ANRs. There are some factors that make this puzzle even more difficult to solve:

  1. Not to mention this but yes mostly the ANRs are not reproducible
  2. Stacks on Google Play Console/APMs doesn't make sense
  3. What to look out for in the Google Play Console and other APMs to debug the ANRs?
  4. It's very difficult to quantify your KRs or timelines whenever the product asks for a timeline to solve them.

We will come to these challenges in more detail and will also see how to overcome them. Let's first see how the system responds whenever there is ANR.

System responding to ANRs

Whenever ANRs are raised system raises a SIGQUIT signal, a native termination signal which can terminate your app and force runtime to dump the ANR trace that we know from the path: /data/anr/traces.txt.

Before starting an app, the Android runtime registers a native signal handler for this SIGQUIT signal which is responsible to capture thread state, thread stack traces, VM info, and other relevant information we see in traces.txt bug report. You can have a look at the runtime SIGQUIT handler here.

This traces.txt bug report is useful to debug ANRs and gives a lot of useful information around thread state, thread stack traces, logs, VM info, etc. You can follow this documentation in order to understand how to read bug reports.

Note: You cannot programatically obtain stream of this traces.txt file from your process because this is written in system data directory and your process do not have permission to read it.

So far we have seen how runtime responds to ANR and what it offers you to create observability on ANR. Let's see how other APMs are doing it.

Different ways to collect ANRs

So far I have seen the following three approaches by which any library or APM can collect ANRs. Let's walk through these approaches while looking at their reliability and problems:

  1. Watchdog approach: Scheduling a runnable and checking if it executes in 5 seconds which is the threshold for the runtime to respond according to documentation. The problem with this approach is that it is very difficult to time ANRs. The reason is that it depends on the type of ANRs, state of the app, and sometimes even devices (OEMs) as well. Hence, it becomes very tricky to schedule these runnable and you might end up getting some false positives.
  2. Catching native SIGQUIT: As we discussed in the previous section that the system responds by raising a native SIGQUIT signal. Thus some APM tools create a custom native signal handler to hook to this SIGQUIT signal which will behave as a listener to capture ANR reports which contain thread stack trace, device metadata, etc.
  3. ApplicationExitInfo: This is an API to inform the death of your application process and the reason due to which the application was killed. This API exposes REASON_ANR as one of the reasons when the application was killed due to an ANR by the system. There is even an API to extract the trace dumps as input stream which lets to extract metadata. (This API was added in API level 30)
Fun fact: This ApplicationExitInfo API is used by Crashlytics and is used to provide ANR reports on the Crashlytics console. You can see the FAQs for reference here and usage here.

Let's understand some of the common problems and challenges which developers face while chasing ANRs.

Challenges in solving ANRs

Apart from the challenges we discussed in the context section, I will emphasize the most challenging things while facing ANRs and then later in the blog will try to answer how to overcome these challenges:

Problems with console Play console in reporting ANRs

We have always been in a situation where we have to solve an ANR and we don't know what to look for in the play console. Not only observability but there are many issues that also affect planning strategies against these ANRs. Some challenges which I want to highlight are as follows:

  1. Grouping Capability of ANRs on Play Consoles is not that good as the grouping works by the ANR title and the stack traces.
  2. Reporting of ANRs is only available for a few users who opt-in to share diagnostic information for vitals, this means that the actual ANR rate could be different.
  3. No APIs are available to create intelligence out of the play console data.
  4. Hard to know a trend of a particular type of ANR, we could see multiple entries of one category of ANR due to the strict grouping we discussed in the first point which makes it difficult to visualize impact.
  5. If you want to know which ANRs have stack traces from your project that could be useful to narrow down but many times these stack traces are from a platform code or even not present in some reports. This makes it even harder to solve these ANRs.
  6. Due to this lot of manual work is required if you want to create some intelligence or capture metadata around the ANRs.

Given the above scenario, it is very challenging to determine a fixed timeline against solving ANRs and bringing down them with a fixed percentage.

Unreliable stacktraces in ANR reports

One thing which you would see common in all the ANR reporting tools including Play Console, Firebase, or even bug reports is that stack trace sometimes will not make sense. They point to platform code and might confuse you around the actual root cause.

So, the reason why you get such a stacktrace is that those stacktraces are captured exactly at the time (hypothetically let's say at 5th second) when ANR occurred and when the system was handling SIGQUIT signal. The actual root cause could be stuck for a long time but not long enough to land in the ANR report. The actual root cause may have escaped at the time when stacktraces are captured by these ANR reporters. You can find a note about this here in this documentation about reading bug reports.

So, don't forget that the stacktraces that you are observing could be innocent and the actual root cause didn't make its way to the report.

So far, we have seen challenges we face while looking at ANRs, lets look at a workflow that has worked for me while studying ANRs.

Strategically thinking of solving ANRs  

In the last section after looking at challenges, we understood that we can't solve ANR solely based on stacktrace as we do in crashes because the stacktraces could be innocent. Although, Play Console was also challenging let's see the steps on how you can use Play Console as a starting point to solve your ANR, and then slowly we will explore other options as well:

  1. Look for the main thread state in ANR which can be either of the following: native, waiting, blocked, WaitingForGCToComplete, etc. These are not the vanilla java thread states that we have learned in books. The possible thread states are mentioned here with their description: thread_state.h.
  2. You can take the investigation forward on the basis of this state. As an example, WaitingForGcToComplete means that the main thread could be blocked due to the Garbage collector and it indicates that the process might be having a high memory footprint and might also be closing to OOM (Out of Memory Error). This means that during this session there could be a possibility of memory leaks or bad memory management in general, Waiting could be due to any lock or blocked resource.
  3. Look for ANR stack trace for the valid stack frames which belong to your app and if anything looks suspicious that can also justify the thread state which you are getting on the console.
  4. Look for different entries in ANRs with the same type of title because it could be due to the same reason and sometimes the capturing fails as well.
  5. If you have some information from above to create a hypothesis try reproducing it locally and go for a fix if your hypothesis seems strong or you are able to reproduce.
  6. By following the above steps if you have an area or interaction of the app narrowed down you can also try to turn on strict mode on that particular interaction.
  7. There would still be ANRs that will not show any valuable information on Play Console. For the remaining ANRs, a better tool to chase would be Firebase. As we understood earlier that solely stacktrace might not help, firebase could come in handy in chasing the user trail and narrowing down the actual flow where the user faced ANR. You can add custom keys and values to customize your Crashlytics report to obtain the user-based attributes like screen names, user id, etc.
  8. Having the above data would help to narrow down and then cross-check on the developer's environment by turning on strict mode.
Note: Reports on playstore will not overlap with Firebase Crashlytics because the reporting is completely different. You can read more about this here.

Firebase is a better reporting tool than Play Console for ANRs because the grouping works on basis of stack frame and is not strict like Play Console. Moreover, we can attach custom logs to ANR reports and also use big query to track necessary trends on ANR. The only drawback is that it works for API 30+.

A curated list of ANRs on Play Console

Have you also been in a place where you are not able to understand the ANR title on Play Console and it got you thinking about what it actually means and does it has something to do with your ANR?

Let's see some list of ANRs that I was able to reproduce and get the same titles in the log console of Android Studio:

  1. Broadcast of Intent { act=android.intent.action.TIME_TICK flg=0x50000014  (has extras) }

This occurs when a broadcast was sent but it was not able to process because the main thread was blocked due to some reason. Time seems to be dependent on the state of the application (background or foreground). For reproducing this, block the main and try to send the broadcast through ADB. The main thread could be blocked due to things like GC, lock, or the whole system being suspended because of memory crunch.

The important thing to note here is that it is not necessary that this broadcast will be the root cause, if you are lucky enough to get stacktraces on the console you can narrow down the investigation by also relating it to the thread state. Otherwise, we can chase the user trail on firebase by putting a custom log as we discussed earlier.

2. Input dispatching timed out (Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago.  Wait queue length: 26.  Wait queue head age: 5921.5ms.)

This occurs when the main thread is blocked for almost 5 seconds and there are input events after that which the main thread was not able to perform.

3. Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.)

In this type of ANRs, the window usually fails to get focus due to the slow creation of a new window or slow exit of an old window, which results in ANR.

Reproducing: Acquire a lock on a background thread and then try to acquire it on the main thread which will block the main thread and then send key events through ADB to navigate to the home screen of the launcher.

4. Context.startForegroundService() did not then call Service.startForeground(): ServiceRecord {}

Foreground Service did not call startForeground to post notification to let the user know about service in 5 seconds window defined by the system after calling startForegroundService.

Did you find this blog useful? How are you guys solving for ANRs? Reach out to me at @droid_singh and let me know.