System DesignCareerInterview Trends

Why FAANG Killed the Algo Round for ML Engineers (And What Replaced It)

·11 min read

Two years ago, I prepped a friend for a Meta ML engineer loop. He was sharp, fast, could solve LeetCode mediums in under twelve minutes. We did about 200 problems together over six weeks. He walked in expecting the coding round to be his strongest signal.

He didn't get the offer.

The debrief was strange. The coding round was "fine," not a red flag but not a strong positive either. What sank him was the ML system design round. The interviewer described an ambiguous production scenario: a recommender suddenly returning biased outputs, a hundred million users affected. What's your move in the next thirty minutes? My friend defaulted to architecture. He drew a clean diagram and talked about how he'd rebuild the pipeline. The interviewer kept pulling him back: "what would you do first?"

He didn't have an answer. Six weeks of LeetCode hadn't given him any practice reasoning about a real production system under time pressure.

I've spent most of the time since coaching ML engineers through these loops and comparing notes with engineers at FAANG companies on how the rubrics have shifted. What happened to my friend is now happening to a lot of strong engineers, and most of them don't understand why. They're preparing for an interview format that's been substantially replaced.

The contamination problem

The coding round still exists and you'll still get one. But its predictive weight has collapsed, and the cause is straightforward: the signal stopped working.

By late 2024, Claude Sonnet could solve roughly 80% of LeetCode mediums on the first try in under thirty seconds. GPT-4 wasn't far behind. By 2025, basically every engineer I work with was using these tools daily for actual work. The skill of producing a clean two-sum implementation in twelve minutes stopped predicting anything useful about on-the-job performance, because anyone with an API key could do it with minimal effort.

There was no reliable way for an interviewer to distinguish between a candidate who had grinded NeetCode 150 last weekend and one who was genuinely fluent at production engineering. Same output, same speed, same clean code.

Companies responded in two directions. Meta piloted AI-aware coding rounds in late 2025 using CoderPad with Claude and GPT-4o-mini available to candidates. You get the tools, and the interviewer watches how you use them: can you tell when the model is wrong, do you reach for it on architecture decisions or just boilerplate, do you verify the output or trust it without checking? OpenAI runs something similar with screen sharing and live narration. The scoring criterion shifted from "can you write code" to "can you work with AI the way a competent engineer would."

The other response was to pull back to in-person. In-person interview rounds went from 24% of all technical interviews in 2022 to 38% in 2025, driven largely by cheating concerns. Microsoft now runs split rounds for some ML positions, one session with AI tools and one without, scored separately.

The coding round as the main event is over. For ML engineers this is especially pronounced, since the work itself is already heavily AI-assisted and coding ability was always a weaker proxy for the skills that actually matter in ML roles.

What actually got harder

The weight moved to the rounds where the signal is still clean, and for ML engineers those rounds have gotten noticeably harder in the last two years. Model reasoning, evaluation strategy, and system failure modes are now the primary signal.

Reasoning under ambiguity. ML system design used to mean "design a recommendation system." Broad, predictable, a surface most candidates had rehearsed. Today it sounds more like: "Our ranking model is suddenly returning stale results for three percent of users in Southeast Asia. What's your first hour?" There's no clean answer. The interviewer is watching how you frame the problem, what you choose to ignore, which hypotheses you prioritize, and what you'd verify first. Candidates who fall back on a memorized template get filtered out. Candidates who slow down, ask clarifying questions, and reason through the problem explicitly tend to get strong-hire notes.

ML system design with real constraints. Not "design YouTube" in the abstract. The question is more like: design YouTube with a 100ms latency budget, a downstream team that already depends on your output format, a vendor relationship you inherited and can't change, and an A/B test that's been running three weeks with flat results. The constraints are part of the question now. The interviewer wants to see whether you can hold the actual complexity of a production decision in your head, not whether you can draw a tidy architecture diagram on a whiteboard.

AI-collaboration literacy. This one is newer and the dimension most candidates aren't prepared for. Can you describe how you'd build a system with AI tooling in the loop without sounding like it's 2022? Do you have a real opinion about when to verify an LLM output versus when it's safe to trust it? Anthropic and OpenAI score this explicitly. Google and Meta have started to. Walking into a 2026 ML loop without ever mentioning AI tooling unprompted signals something about whether you've kept up with how the work actually gets done.

Thinking out loud. The candidates who get strong-hire reports almost all reason audibly throughout the whole session. Not narrating what they're doing, but exposing their thought process: naming assumptions explicitly, flagging uncertainty, saying what they're considering before they decide to discard it. Candidates who silently work to a correct answer often score lower than candidates who reached a weaker answer with more visible reasoning. The interviewer is evaluating the process, not just the output.

What the model evaluation question actually tests

One question comes up in some form in almost every senior and staff ML loop right now: "How do you know your model is working?"

Most candidates answer with offline metrics. They describe accuracy or AUC on a held-out test set, maybe mention cross-validation. Interviewers hear this and write something like "shallow eval instincts" in their notes.

The stronger answer starts by separating offline evaluation from online evaluation and being explicit that they measure different things. Offline metrics tell you how your model performs on historical data under controlled conditions. Online metrics, usually surfaced through A/B tests, tell you what the model actually does to the system you care about in production. The gap between those two is where most ML systems run into their interesting failures, and candidates who don't acknowledge that gap come across as having only ever worked on toy problems.

From there, a staff-level answer gets into specifics: what's the north-star metric versus the proxy metric you're optimizing against, how you'd construct a golden evaluation set that doesn't rot as the data distribution shifts, and what you'd do if the online and offline signals disagree (which happens more often than candidates tend to expect).

LLM-as-judge has become a common follow-up at AI-first companies. The question is usually something like: "Your team wants to use a large language model to evaluate output quality at scale. What are the failure modes?" Good candidates talk about the judge's own biases, position effects, verbosity preference, and the circular logic problem that shows up when the judge and the model share overlapping training data. Weaker candidates describe the setup without any critical analysis of it.

The evaluation question doesn't have a single correct answer. The interviewer is checking whether your mental model of the eval problem has the right complexity, or whether you've been treating evaluation as an afterthought that comes after model training is done.

Why the ML deep-dive is harder than it sounds

Many loops now include a round where the interviewer interrogates a specific ML system you've actually built. This sounds like good news for candidates with real experience, and it is, but it's also where a lot of experienced engineers lose points they didn't expect to lose.

The round typically starts with you describing a system, then quickly narrows to the decisions you made that were non-obvious. Why did you choose that architecture over the alternatives you considered? What was your data labeling strategy and what were its known weaknesses? How did you handle training-serving skew? What did your evaluation setup miss that you only discovered once the system was in production?

Candidates who've shipped real ML systems usually have reasonable answers to those questions. Where they tend to leave points is on the data side of the problem. Interviewers notice when candidates describe their model architecture in detail but gloss over feature engineering, labeling quality, and how they dealt with distribution shift over time. Those aren't peripheral concerns, they're usually where the interesting problems actually lived, and interviewers who run production systems know it.

The practical prep implication is to write out the complete story of one or two systems before the loop: what the problem was, what the data looked like, what you tried that didn't work, how you knew when the model was good enough, and what failed in production that surprised you. The questions an interviewer asks during a deep-dive almost always come from that surface area.

A Meta E5 loop, then and now

What an L5/E5 Meta ML loop looked like in 2022 versus today:

2022: Two coding rounds, one ML system design, one ML breadth-and-depth round, one behavioral. Coding carried about 35-40% of the weight in committee. Behavioral was often treated as the lightest round.

2026: One coding round, often shorter, with pseudocode accepted for harder parts. One ML system design that's tightly constrained and scenario-based. One ML deep-dive on a system you've actually built, with the interviewer pushing on specific decisions. One behavioral round that's harder than it used to be, now probing for judgment, scope, conflict, and how you handle being wrong.

Coding accounts for maybe 15-20% of the weight now. System design and the deep-dive decide hire versus no-hire for most candidates. The behavioral round is no longer a formality.

If I were prepping today

A lot of the prep advice on the internet is outdated. "Do 500 LeetCode problems" was reasonable guidance in 2018, increasingly questionable by 2023, and counterproductive today. The diminishing returns past about 80-100 problems are steep. You're spending hours on a signal that accounts for maybe 15% of the decision instead of on signals that account for 40%.

Rough time allocation for a FAANG ML loop today:

  • 30% system design. Not whiteboarding practice but real-system reasoning. Read actual postmortems from Cloudflare, AWS, and GitHub. Walk through them as if you were in the room making the calls. Practice stating constraints before jumping to solutions.
  • 25% behavioral. Most ML engineers underprep here. Practice telling stories about non-obvious judgment calls from your own work. Practice being asked "what would you have done differently?" in a way that comes across as reflective rather than defensive.
  • 20% ML depth on systems you've shipped. Write out the full story of two systems before the loop, including the data decisions and the production failures.
  • 15% coding. Sixty to eighty problems covering the major patterns is enough. Past that point, the returns drop sharply.
  • 10% AI-collaboration practice. Run mock interviews where you reason out loud while using AI tools, and practice being explicit about when you'd verify output versus trust it.

This allocation looks off if you're calibrated to older advice. It's calibrated to the interview you're actually walking into.

What's actually different about this bar

The modern ML loop isn't primarily testing what you know. It's testing how you reason when you don't have a clean answer in front of you.

LeetCode prep built pattern recognition: see problem, recognize pattern, execute the solution. That model worked when interviews were testing whether you could execute known patterns under time pressure. Current ML loops are testing something different. They're testing whether you can structure an ambiguous problem, hold incomplete information while working through it systematically, and explain your reasoning to someone who is actively looking for the gaps.

The candidates who get offers don't always have better answers. They slow down when they don't know something, name it explicitly, and work through it in the open rather than guessing confidently or going quiet. They propose alternatives and explain why they're choosing one. They say "I'm not sure about this part, but here's how I'd think through it" rather than projecting certainty they don't have.

My friend lost his loop in 2024 because he'd prepared for a version of the interview that no longer carries much weight. For most experienced ML engineers, the underlying skills are already there. Production ML work requires exactly this kind of reasoning. The gap is usually just practice making that reasoning visible to a stranger across a video call, under time pressure, on problems you haven't seen before.

Prep for questions like these with GradientCast — see our plans. Staff-level ML system design walkthroughs and behavioral answers, built by senior ML engineers with FAANG experience.

More from Insights