Discovering iOS memory leaks: A case study with Firefox app

Context

Before we start, let's quickly recap what a memory leak is in the context of iOS. Memory management on iOS is done by automatic reference counting (ARC). Under ARC, when an object is allocated, the system increments its retained count, effectively keeping it in memory. This object remains in memory until it is deallocated or goes out of scope. The scope in "going out of scope" refers to the place where the variable was defined for example in the case of a method its end is the method body.

Prioritizing Memory Leaks

Before we discuss detecting and fixing memory leaks, we must understand prioritization here. It could be very tempting to just start using the app from any point and fix every leak you see. The problem here is you don't know if it's contributing to the OOM exceptions in production and fixing it will create a meaningful impact. I usually create the following 2 strategies for this:

  1. Fixing production leaks: These are the potential leaks in the user journey where we see OOM exceptions in production. This would require getting the particular user journey through analytics or Firebase Crashlytics and then running it locally through a UI test to track the particular memory leaks.
  2. Monitoring leaks on core flows: Meanwhile, we mitigate fires of leaks from production, it is important to start having a bare minimum of tests over core flows and logging all the new memory leaks so that it doesn't become a cat-dog race while you go after production leaks.

Selecting the app and core flows

For the scope of this article, we'll be selecting an open-source demo app: a Firefox iOS app (Github), writing UI tests around its core flows, and then tracking the leaks in the core flows. Now since we are not working at Firefox neither as a developer nor as a product manager we'll ask our friend: ChatGPT for the core flows of the app 😉. Here is the prompt I'm using to get these core flows:

You're an experienced product manager for the Mozilla Firefox iOS app, which has the following description from the app store:

      <---- App Store description for Mozilla here ---->

Can you list different core flows from the app? A core flow is the flow most used by the users and contributes to the company's revenue.

Here is the response I got, not so surprised as well! These are definitely most used features of any browser I use.

1. Enhanced Privacy Browsing:
* Activation and customization of Enhanced Tracking Protection.
* Usage of private browsing mode.
* Cross-Device Browsing and Synchronization:

2. Syncing bookmarks, history, tabs, and passwords across devices.
* Continuing a browsing session from one device to another.

3. Search and Navigation:
* Utilizing the search bar for web queries.
* Accessing frequently visited sites and search suggestions.

4. Tab Management:
* Opening, organizing, and managing multiple browser tabs.
* Using tab thumbnails and numbered tabs for easier navigation.

5. Personalized Homepage Experience:
* Customization of the homepage with recent searches, bookmarks, top sites, and Pocket recommendations.
* Changing wallpapers from the limited-edition collection.

6. Password Management:
* Saving, autofilling, and managing online passwords.
* Syncing passwords across different devices.

For the next step, we will select one of the flows from this and discuss tracking memory leaks from it. But before let's discuss the tool we are going to use for detecting leaks.

Introducing: leaks tool 😎

leaks tool is a command line tool from macOS that can help you track memory leaks from the memory space of a process. It categorizes the output into two parts:

  1. Root Leak: Memory buffers, not referenced by any other buffers may be because of improper deallocation or losing a pointer.
  2. Root cycles: These are the famous retain cycles that form a dependency on each other, and are not referenced by any other buffer outside the cycle.

This is a useful man page for leaks command that also shows various options and their description.

The best way to use leaks is to add MallocStackLogging flag in the environment while launching the app and then triggering the leaks command line. Let's see how it's done for Mozilla app (with bundle identifier org.mozilla.ios.Fennec):

// Launching the app with MallcoStackLogging
export SIMCTL_CHILD_MallocStackLogging=1 && xcrun simctl launch booted org.mozilla.ios.Fennec
// This should launch the app and give the following output:
// ---> org.mozilla.ios.Fennec: 89617, pid and bundle id <---

// Dumping the memory leak
leaks 89617

Launching with MallocStackLogging is very important here because it provides you with stacktrace that you can use to track your leaked buffer.

💡
It's recommended to use a "profile" build for these purposes that is closer to the release build and doesn't strip symbols. Doing this on release build locally would require you to symbolicate the stacktraces

Now let's dive in and try to find memory leaks in one of the core flows of the Mozilla Firefox app.

Detecting Memory leaks

Being a command line in contrast to manually running xcode instruments for the leaks gives a huge advantage of creating a setup that can work with your UI tests in CI/CD. This integration enables automated detection of memory leaks, both existing ones and regressions that may occur during fixing. Let's see how we can detect the leaks for the enable/disable search suggestions feature in the Firefox app:

  1. Launching the app:
export SIMCTL_CHILD_MallocStackLogging=1 && xcrun simctl launch booted org.mozilla.ios.Fennec
  1. Writing instrumentation for the flow: We are going to use Maestro for this step. This will fit easily in a shell script at last. This is the maestro flow we use to assert suggestions are shown and then disabling hides them.
appId: org.mozilla.ios.Fennec
---
- tapOn: 
   text: "Skip"
   optional: true
- tapOn:
   text: "Skip"
   optional: true
- tapOn:
   text: "Skip"
   optional: true
- tapOn: "Search or enter address"
- inputText: "Wikipedia"
- assertNotVisible: "Empty list"
- tapOn:
    id: "urlBar-cancel"
- tapOn:
    id: "TabToolbar.menuButton"
- tapOn:
    id: "menu-Settings"
- tapOn:
    id: "Search"
- tapOn:
    text: "1"
    index: 0
- tapOn: "Settings"
- tapOn: "Done"
- tapOn: "Search or enter address"
- inputText: "Wikipedia"
- assertVisible: "Empty list"
  1. Executing the leaks command from pid obtained in 1st step
leaks <pid> > ./suggestions_leak_report.txt

This execution surfaced 40+ leaks! let's try to see the report and see how to read it.

Reading leaks Reports

You might be surprised after you see these 40+ entries in the report. But the reality is that the unique leaks might be less and there might also be false positives.

  1. Getting unique leaks: The reporting is based on an allocation stack so there could be instances where the reason for the leak is the same but it got reported as multiple entries. The best way to solve this is to group leaks based on stack traces. The stacks which have similar code paths belong to the same group.
  2. Careful of false positives: Because of the way we are using the leaks tool with instrumentation here, there might be some reported stacks that might not be real leaks, especially for ROOT LEAKS cases. These might still get reported because the scope of some methods and classes might not have ended yet due to the presence of async blocks.

The best way to verify this category of true ROOT LEAKS would be to bring them under XCTest to verify if the deinit or deallocation code is called. This is an example of how Alarmofire is using the leaks tool in atExit block to detect memory leaks in XCTest.

💡
If you end up finding a false positive you can even start excluding the leak from the report from the option --exclude. This option helps you specify the symbol which you want to exclude from the report.

It is still very useful for detecting and fixing ROOT CYCLES, let's see one example from the report and try to fix it.

ROOT CYCLE between Deferred<[T]> objects

The stacktrace responsible for this retain cycle from the leaks tool is the following, I have also attached the full leak trace here in case you want to have a look.

5 org.mozilla.ios.Fennec     closure #1 in closure #2 in ContentBlocker.init(logger:) + 188  ContentBlocker.swift:106
4 org.mozilla.ios.Fennec     ContentBlocker.compileListsNotInStore(completion:) + 368 ContentBlocker.swift:319
3 org.mozilla.ios.Shared     all<A>(_:) + 637  Deferred.swift:135
2 libswiftCore.dylib         swift_allocObject + 39
1 libswiftCore.dylib         swift_slowAlloc + 40
0 libsystem_malloc.dylib     _malloc_zone_malloc + 241 

ContentBlocker prepares an array of Deferred tasks that are combined in all function from Deferred.swift here.

If you dive into the code of all function you will understand why this becomes a cycle:

public func all<T>(_ deferreds: [Deferred<T>]) -> Deferred<[T]> {
    if deferreds.count == 0 {
        return Deferred(value: [])
    }
    // 1
    let combined = Deferred<[T]>()
    var results: [T] = []
    results.reserveCapacity(deferreds.count)

    var block: ((T) -> ())!
    // 2
    block = { t in
        results.append(t)
        if results.count == deferreds.count {
            combined.fill(results)
        } else {
            deferreds[results.count].upon(block)
        }
    }
    deferreds[0].upon(block)

    return combined
}
  1. The combined variable is intended to capture and unify all the results of input deferred elements.
  2. In the if-else condition, the block closure fills in the result in combined or schedules itself back in else block.

Since the closure strongly references combined and the block referenced itself back in the else block also in line 20, it creates a retain cycle.

The fix for this should be capturing combined weakly instead of strong reference so that it can be deallocated back. This can be done as:

block = { [weak combined] t in
   results.append(t)
   if results.count == deferreds.count {
       combined?.fill(results)
   } else {
       deferreds[results.count].upon(block)
   }
}

Commit for this fix is here in case you want to checkout. After making this fix, if you run the same test, you'll observe that this is not reported anymore!

Did you find this tool useful and are interested in seeing more examples of fixing leaks? If yes, do share this and let me know, would love to write another part of this showing more examples of leaks found and their fixes. Reach out to me at @droid_singh and let me know.