How We Reduced Crash Rates by 30% on a 50M+ User OTT Streaming Platform
How We Reduced Crash Rates by 30% on a 50M+ User OTT Streaming Platform
Last year, our iOS team reduced crash rates by 30% on a streaming platform serving 50M+ daily active users. It wasn't one heroic fix. It was a system.
Here's exactly how we did it — the framework, the tools, and the cultural shift that made stability a default, not an afterthought.
The Problem
When you're serving millions of concurrent users across dozens of device + OS combinations, crashes aren't just bugs. They're retention killers. A crash during a live cricket match means a user opens a competitor's app. They might not come back.
We were sitting at a crash-free rate that was "acceptable" by industry standards but not where we wanted it. The challenge wasn't finding crashes — our dashboards were full of them. The challenge was prioritizing the right ones.
Step 1: Categorize Before You Fix
We stopped treating crashes as a flat list sorted by occurrence count. Instead, we grouped them into four categories:
- Launch crashes — App doesn't open at all. Highest severity. Users uninstall.
- Playback crashes — App crashes during video streaming. Directly impacts retention and session time.
- Background crashes — App crashes in the background. Less visible but affects push notification delivery and background refresh.
- Edge-case crashes — Crashes in settings, profile, lesser-used features. Low frequency, low impact.
Each category got a different priority level and a dedicated owner. This alone changed how fast we moved — instead of the whole team looking at 200 crashes, four engineers each owned 50 with clear priority.
Step 2: Build a Priority Matrix
Not all crashes are equal, even within the same category. We built a simple scoring model:
Impact Score = User Impact x Frequency x Feature Criticality
- User Impact: Does this crash lose the user's session? Does it corrupt data? Or is it recoverable?
- Frequency: How many unique users hit this per day?
- Feature Criticality: Is this in the core playback flow, or in a settings screen buried 4 levels deep?
A crash during video playback affecting 10,000 users/day scores much higher than a crash in the notification preferences screen affecting 50 users/day. Before this matrix, we were fixing crashes by count alone — which often meant spending a week on an obscure crash with high volume but zero user impact.
Step 3: Instrument What You Can't Reproduce
Some crashes only appeared on specific device + OS combinations. iPhone 8 on iOS 15.7. iPad mini on iOS 16.1. You can't reproduce these on your M1 Mac running the latest simulator.
Our approach:
-
Surgical breadcrumbs — Not blanket logging (that kills performance on a streaming app), but targeted diagnostic events around the code paths that were failing. We logged state transitions, memory pressure events, and network state changes only in the suspect areas.
-
Xcode Instruments profiling — We used the Allocations and Leaks instruments to track memory patterns that led to crashes on lower-memory devices. The iPhone 8 with 2GB RAM was our canary.
-
Custom crash metadata — We enriched crash reports with app state at the time of crash: was the user in foreground/background? Was a video playing? What was the network condition? This context turned "EXC_BAD_ACCESS in PlayerViewController" into "EXC_BAD_ACCESS when switching from WiFi to cellular during live stream playback."
-
Device-specific test matrix — We maintained a physical device lab with the top 10 crash-producing device + OS combos and ran regression tests on them before every release.
Step 4: Make Stability a Team Habit, Not a Sprint Goal
This was the cultural shift that made the numbers stick.
- Crash review in every retro — New crashes from the latest release got discussed before new features got prioritized. This made the cost of shipping unstable code visible to the whole team.
- Crash budget per release — We set a threshold: no release goes out if the crash-free rate drops below X%. This gave product managers a concrete trade-off when pushing for aggressive timelines.
- On-call rotation for crash spikes — When a new release caused a crash spike, one engineer was responsible for triage within 4 hours. Not the whole team. One person, with authority to request a hotfix.
- Celebrate stability wins — We tracked crash-free rate as a team metric and called out improvements in standups. Sounds small, but it shifted the team's identity from "feature builders" to "engineers who ship stable features."
The Tools
- Firebase Crashlytics — Primary crash reporting. Custom keys and logs for context.
- Xcode Instruments — Allocations, Leaks, Time Profiler for performance-related crashes.
- Custom dashboards — Built on top of Crashlytics data to show crash trends per category, per release, per device family.
- Automated alerts — Slack notifications when crash-free rate drops below threshold after a release.
The Result
- 30% reduction in crash rate over 6 months
- Faster release confidence — the team stopped dreading release day
- Reduced hotfix frequency — from ~2 per month to less than 1 per quarter
- Cultural shift — stability became part of the team's identity, not a chore
Key Takeaway
Crash reduction at scale isn't about being a better debugger. It's about building a system: categorize, prioritize, instrument, and make stability a habit. No magic. Just discipline and the right framework.
The engineers who build for millions of users don't just write code that works — they write code that fails gracefully when things go wrong. And things always go wrong.
I write about mobile engineering, AI integration, and building products at scale. More at minhazpanara.com.
Minhaz Panara
Full Stack Developer passionate about building modern web applications and sharing knowledge through blogging.