Harmful Content Detection
Harmful content detection at production scale is a multi-modal real-time challenge: catch policy violations across text, images, and video before they reach users, without generating false positives that restrict legitimate accounts. I'll work through business and ML objectives, system architecture, data and features, modeling, infrastructure, evaluation, and robustness.
Solution Walkthrough
Business Objective
The objective of this system is to minimize user exposure to harmful content (measured in views of violating posts) subject to a strict false-positive guardrail of 95% precision on automated removals. This framing is important because it shifts focus from simply detecting harmful content to actually preventing harm. A post that gets removed after 10 views causes less damage than one that accumulates 10 million views before we catch it, so our system needs to prioritize early detection of potentially viral harmful content.
The precision guardrail is critical here. Users who have legitimate content removed incorrectly will be frustrated, may stop posting, and could leave the platform entirely. at production scale, even a 5% false positive rate on removals translates to millions of angry users. So we're balancing two opposing forces: aggressively catching harmful content to protect the majority of users, while being extremely conservative about automated removals to protect creators and maintain trust.
This objective naturally leads to a tiered enforcement strategy. For content we're 95%+ confident is harmful, we auto-remove. For content in the 70-95% range, we might demote it in feed ranking or flag it for human review. Below 70%, we monitor but don't act unless we get additional signals like user reports.
ML Objective
At its core, this is multi-modal binary classification, is this post harmful or not? But the reality is richer. We're dealing with text, images, and behavioral signals. We need calibrated probabilities that map to enforcement actions. And critically, we need predictions at multiple stages: immediately when content is posted using just the content itself, and then updated predictions as behavioral data arrives, comments, reactions, shares, reports.
The temporal aspect is crucial. A post at T=0 has zero behavioral signals, but as it accumulates views, we gather more information. Someone posting "gross!" or "I didn't want to see that" in the comments is signal. The model needs to handle this naturally, making decisions with partial information early and refining as data arrives.
Unlock Full Solution
Get access to the complete walkthrough, key concepts, summary, and follow-up questions.