The Hiring Loop Is Broken: How AI Tools Destroyed the Coding Test — and What Engineering Leaders Are Using Instead

LeetCode is dead as a signal. Take-home assignments are solved in twenty minutes by any candidate with a Claude subscription. The technical hiring frameworks that engineering leaders spent years refining have been neutralized — and most organizations haven't replaced them with anything. Here's what the most rigorous engineering teams are actually doing now, and why the same verification crisis applies to every outsourcing relationship you manage.

LeadershipThe Hiring Loop Is Broken: How AI Tools Destroyed the Coding Test — and What Engineering Leaders Are Using Instead

The Assessment That Stopped Working

Somewhere in early 2025, a quiet consensus formed among engineering hiring managers: the take-home coding assignment had stopped being a reliable signal. Candidates who struggled through a video call were submitting polished, well-commented, architecturally sound solutions 48 hours later. The scores were excellent. The hiring decisions were increasingly wrong.

By the first quarter of 2026, the problem had escalated from a hiring manager's frustration into a structural crisis for engineering recruitment. A February 2026 survey by Hired — covering over 3,200 engineering candidates and 480 hiring managers — found that 74% of candidates report using AI assistance on technical assessments, up from 31% eighteen months earlier. More troublingly, 68% of hiring managers report being unable to reliably distinguish AI-assisted submissions from unassisted ones. The gap between what assessments are designed to measure and what they actually measure has become so wide that many engineering leaders are quietly treating technical assessments as near-worthless inputs to hiring decisions.

This is not a cheating problem. It is a design problem. The coding tests, take-home projects, and LeetCode-style algorithm challenges that have dominated technical hiring for the past decade were designed to evaluate a specific set of skills — the ability to recall algorithms, translate requirements into code, and produce working implementations within bounded time and context. Those are real skills. They are also, in 2026, skills that any reasonably capable AI assistant can demonstrate on behalf of the candidate who prompts it correctly.

The consequence is not just that bad candidates are getting through screens. It is that the signal-to-noise ratio of technical assessments has collapsed to the point where high-effort, rigorous evaluation is producing roughly the same hiring accuracy as no evaluation at all. Engineering leaders are spending significant engineering-manager hours on processes that are delivering outcomes barely better than random — and in some cases, actively selecting for AI-prompting skill over engineering judgment.

Why the Standard Fixes Aren't Working

The first-order response to the AI assessment problem — making tests harder, adding proctoring, restricting internet access — has largely failed. The evidence from organizations that have tried it is consistent enough to be instructive.

Harder tests select more aggressively against candidates from non-traditional backgrounds, candidates who are returning to work after career breaks, and candidates whose strengths lie in system design and pragmatic problem-solving rather than algorithmic recall. They also don't actually solve the AI problem — a sufficiently capable model can pass most 'hard' LeetCode-style challenges that humans find genuinely difficult. The signal improvement is marginal; the candidate pool narrowing is real.

Proctoring solutions — screen recording, webcam monitoring, keystroke analysis — have a documented bias problem and a documented efficacy problem. A 2025 HireVue internal study leaked to The Markup found that proctored assessments reduced candidate completion rates by 34% while failing to detect AI assistance in the majority of cases where it was used. Candidates who are sophisticated enough to use AI assistance are also sophisticated enough to use it in ways that don't trigger behavioral anomaly detection.

Live coding sessions with interviewers watching in real time have held up better than asynchronous assessments — it's harder to invisibly prompt an LLM when you're on a video call. But live coding sessions have their own documented problems: they measure performance under artificial social pressure rather than actual engineering competence, they introduce interviewer bias at a higher rate than structured alternatives, and they favor candidates whose working style involves verbalizing their thinking — a correlation with performance that the research does not consistently support.

The organizations that have doubled down on traditional assessment methods, making them more elaborate and rigorous, have mostly succeeded in creating longer hiring processes with roughly the same quality outcomes. The signal that technical assessments were always providing — a rough filter for minimum competence at coding tasks — has been degraded to near-zero for AI-familiar candidates. Elaborating the filter does not restore the signal.

Key Takeaways

  • Harder tests don't defeat AI assistance — they just narrow the candidate pool with no corresponding accuracy gain
  • Proctoring solutions reduce candidate completion rates by ~34% while detecting AI assistance in a minority of actual cases
  • Live coding under observation measures stress performance, not engineering competence — and introduces higher interviewer bias
  • Elaborate versions of broken assessments produce longer hiring processes with similar quality outcomes

What Is Actually Working: The New Evaluation Stack

The engineering organizations that have rethought technical evaluation from first principles — rather than patching existing processes — are converging on a set of methods that share a common design insight: rather than trying to prevent AI use, design evaluations where AI assistance is either permitted explicitly or structurally irrelevant to the signal being measured.

Contextual problem framing. Instead of asking candidates to solve a problem, ask them to define one. Present a realistic system description — a legacy codebase narrative, an architecture diagram with obvious constraints, an incident postmortem — and ask the candidate to identify the most important engineering problems, articulate the tradeoffs in addressing them, and describe how they would approach diagnosis. This format measures precisely the judgment that senior engineering roles require and that AI models consistently struggle with: the ability to read organizational and technical context and identify what actually matters. A model given the same prompt will typically produce a comprehensive but context-insensitive analysis. A strong candidate will identify the specific tensions that make the situation interesting.

Architecture critique sessions. Presenting a candidate with a real (anonymized) architecture from a past project — complete with the genuine mistakes and tradeoffs that real systems accumulate — and asking them to identify weaknesses, propose improvements, and defend their reasoning in dialogue with the interviewer. The dialogue component is what matters: a model can identify architecture problems, but it cannot sustain a defense of a specific position under follow-up questioning from someone with domain context. The signal in these sessions comes from how candidates respond to pushback, not from their initial analysis.

AI-allowed work samples. A growing number of engineering teams have flipped the script entirely: explicitly inviting candidates to use any AI tools they want during a work sample, and evaluating the quality of the output and the candidate's ability to explain, extend, and validate it. This approach is counterintuitive but revealing. The delta between what different candidates produce when given identical access to AI tools is significant — and it maps well to actual performance differences. A senior engineer using AI produces a more robust, better-reasoned output than a junior engineer using the same tools. The gap in judgment doesn't disappear; it becomes visible at a higher level of abstraction.

Contribution portfolio analysis. For candidates with public engineering histories — open source contributions, GitHub activity, technical writing, conference talks — systematic analysis of their contribution record provides a signal that AI cannot retroactively fabricate. Not the volume of commits, but the nature of them: Do they fix the right bugs? Do their code reviews show genuine understanding? Do their pull request descriptions demonstrate reasoning about tradeoffs? This is time-intensive and not universally applicable, but for senior engineering roles, it is the most reliable signal available.

Key Takeaways

  • Contextual problem framing measures judgment that AI models consistently underperform on — identifying what matters in a real situation
  • Architecture critique sessions: the dialogue component defeats AI, because models can't sustain position under domain-expert pushback
  • AI-allowed work samples reveal judgment differences at a higher level of abstraction — the senior/junior gap persists regardless of tool access
  • Contribution portfolio analysis provides a signal AI cannot retroactively fabricate — but is time-intensive and requires a public history

The Trial Engagement Model: Hiring for What You Can Observe

Perhaps the most significant structural shift in engineering hiring practice is the quiet expansion of paid trial engagements. Rather than evaluating candidates through simulated work, more engineering organizations are evaluating them through actual work — paid short-term projects (typically two to four weeks) on real but non-critical problems, conducted before a full employment offer.

This model has existed in software development for decades, particularly in agencies and consulting firms. What's new in 2026 is its migration into product engineering organizations that previously relied exclusively on interview-based evaluation. A March 2026 survey by Stack Overflow found that 28% of software companies with more than 100 engineers now use some form of paid trial engagement as a standard component of senior engineering hiring — up from 8% in 2023.

The appeal is obvious: you observe how candidates actually work, not how they perform under artificial assessment conditions. You see how they handle ambiguity, how they communicate with teammates, how they respond to feedback, how they make decisions when the requirements aren't fully specified. These are precisely the dimensions that conventional assessments measure poorly and that predict actual job performance reliably.

The objections are real but manageable. Trial engagements extend hiring timelines — the average two-to-four week trial adds three to six weeks to a process that many candidates are running in parallel with other opportunities. The competitive risk is genuine: top candidates won't always accept a trial before an offer. Managing the IP and confidentiality implications of work performed by candidates who don't ultimately join requires explicit legal structure. And the evaluation criteria for a trial engagement are inherently more subjective than a scored coding test — which means that the bias risks that structured assessments were designed to mitigate resurface if trial evaluation isn't disciplined.

The organizations using this model effectively have addressed these objections through three practices: keeping trial projects genuinely bounded and contained (not full-scale production work, not unpaid), building structured evaluation rubrics for trial engagement observation that match the criteria used for performance evaluation of existing employees, and offering meaningful compensation that signals that the organization values the candidate's time and work.

Key Takeaways

  • 28% of software companies with 100+ engineers now use paid trial engagements for senior hiring — up from 8% in 2023
  • Trial engagements observe actual work behaviors: ambiguity handling, communication, decision-making under incomplete requirements
  • Risks are real: timeline extension, competitive disadvantage with in-demand candidates, IP management, and subjectivity bias
  • Effective implementations use bounded projects, structured evaluation rubrics, and meaningful compensation

The Outsourcing Verification Crisis Nobody Is Talking About

Engineering leaders who have updated their internal hiring practices often haven't carried the same logic through to their outsourced and nearshore team evaluation. This is a significant blind spot — because the same AI-era verification problem applies at least as acutely to evaluating external engineering partners.

The traditional approach to verifying an outsourcing partner's engineering capability follows a predictable pattern: technical interviews with candidate engineers (sometimes including coding assessments), review of portfolio projects and references, and assessment of the team's process maturity. All three of these components are now as compromised as the internal hiring processes they were modeled on.

Technical interviews with offshore and nearshore candidates have the same AI assistance problem as any other interview format, with an additional complication: the social dynamics that make live coding interviews somewhat resistant to AI use — shared physical space, real-time visual monitoring, the difficulty of covertly prompting a model while being watched — are absent in remote formats. A candidate conducting a video-based technical assessment from their home environment has more opportunity to use AI assistance undetected than an in-person candidate, not less.

Portfolio review is similarly degraded. AI tools have dramatically lowered the floor on what a portfolio project can look like without genuine engineering ability behind it. A portfolio built with significant AI assistance can display impressive surface-level quality — clean code, comprehensive test coverage, well-structured documentation — while the engineer behind it lacks the depth to architect new systems, debug non-trivial failures, or make sound technical tradeoffs under ambiguity.

The verification problem is particularly acute for critical outsourcing decisions — cases where engineering quality directly affects product reliability, security, or compliance. Signing a multi-year engagement with an outsourced team whose actual capability level you've misjudged is a different magnitude of error than making a bad hiring decision for a single employee. The cost of discovery — when the capability gap surfaces in production — is commensurately higher.

Key Takeaways

  • Outsourced team verification uses the same compromised methods as internal hiring — with fewer safeguards in remote assessment formats
  • Remote technical interviews are more susceptible to covert AI assistance than in-person formats, not less
  • AI-assisted portfolios can display impressive surface quality while masking lack of depth for novel problems
  • The cost of a capability misjudgment in outsourcing is higher than in individual hiring — multi-year impact, not a single bad hire

A Better Framework for Evaluating Engineering Partners in the AI Era

The same design principles that are making internal technical hiring more robust translate directly to outsourced team evaluation — with some modifications specific to the partnership context.

Structured problem definition sessions. Before committing to a partner engagement, conduct a working session where senior engineers from the candidate vendor team are presented with a realistic technical challenge from your domain — not a coding problem, but a system design or architecture problem with genuine ambiguity and competing constraints — and asked to work through it in dialogue with your engineering leadership. The goal is not to evaluate whether they get the 'right' answer, but to observe their reasoning process: how they scope the problem, what questions they ask, how they handle the constraints they're given, how they respond to pushback. This is a format where genuine senior engineering judgment is visible, and where AI assistance provides substantially less cover than in a structured coding assessment.

Reference architecture review. Ask the candidate team to present and defend a technical architecture from a completed project — ideally one with genuine complexity, constraints, and post-deployment learning. The defense conversation — where your engineers ask specific questions about the decisions made, the tradeoffs accepted, and what the team would change in hindsight — is where the signal lives. A team that genuinely built and operated the system can answer these questions in a way that reveals deep familiarity. A team that built it with significant AI assistance, or that has overstated their involvement, typically cannot sustain the specificity that detailed follow-up requires.

Bounded paid pilots. Apply the same trial engagement logic to outsourcing evaluation. A paid, bounded pilot engagement — typically four to eight weeks, on a real but non-critical project — provides direct observation of how the team actually works: their communication discipline, their approach to scope ambiguity, their code quality in production context, and their ability to make good decisions without being over-specified. This is more expensive than a traditional vendor evaluation, but the cost is small relative to the risk of a multi-year commitment to the wrong partner.

Ownership-signal evaluation. The most reliable long-term signal for outsourced team quality is not competence at bounded tasks — it's the disposition to take genuine ownership of problems. Senior-led teams with real ownership orientation ask questions you haven't thought to answer, flag problems you haven't noticed, and make architectural judgments without being instructed. AI tools cannot replicate this disposition, because it requires caring about outcomes rather than tasks. Building evaluation criteria and ongoing governance processes that reward ownership behavior — and structure contracts to incentivize it — is the most durable upgrade to outsourcing evaluation available.

Key Takeaways

  • Structured problem-definition sessions reveal reasoning process and judgment — AI assistance provides less cover than in coding assessments
  • Reference architecture defense sessions require specificity that AI-assisted work or overstated involvement cannot sustain under questioning
  • Bounded paid pilots provide direct work observation — expensive relative to interviews, cheap relative to a multi-year commitment to the wrong team
  • Ownership disposition is the signal AI cannot replicate — structure evaluation and contracts to surface and reward it

What This Means for the Engineering Talent Market in 2026

The broader consequence of the assessment crisis is a revaluation of engineering skills that the market is still processing. When coding tests reliably differentiated candidates, the market premium for engineers who could pass them was real — LeetCode grinding was, for a decade, a documented path to compensation premium. When coding tests stop working, the premium migrates toward the qualities that remaining evaluation methods actually measure.

The discussion on r/ExperiencedDevs, r/cscareerquestions, and engineering forums on X has been vigorous on this point throughout early 2026, and the emerging consensus is directionally consistent: the premium is shifting from task execution skills — coding speed, algorithm recall, implementation throughput — toward judgment, communication, and system thinking. This was always the right set of skills to value in senior engineering. The measurement problem prevented the market from pricing it correctly. The AI era has not changed what makes a great engineer — it has changed what evaluation methods can reliably distinguish.

For engineering leaders managing talent in this environment, the practical implication is a recalibration of hiring criteria at every seniority level. Junior engineers are increasingly evaluated less on coding ability (which AI augments aggressively) and more on learning trajectory, intellectual curiosity, and the quality of their questions — signals that correlate with the rate at which they'll develop the judgment that AI augmentation makes more valuable. Senior engineers are evaluated more on their demonstrated ability to make high-quality decisions under ambiguity and ownership — the capabilities that AI makes more important even as it makes individual coding tasks less differentiating.

This shift has one important implication for the outsourcing market specifically: it accelerates the divergence between volume-oriented outsourcing models and outcome-oriented senior-led engagement models. In a world where AI makes average coding throughput a commodity, the value of a partner relationship built on judgment, ownership, and outcome responsibility increases rather than decreases. The assessment frameworks that engineering leaders use to evaluate outsourcing partners should reflect that reality — and so should the contract structures, governance processes, and KPI frameworks that govern ongoing relationships.

Key Takeaways

  • The market premium is migrating from task execution (coding speed, algorithm recall) to judgment, communication, and system thinking
  • Junior engineers increasingly evaluated on learning trajectory and question quality; seniors on judgment under ambiguity and ownership
  • The shift accelerates divergence between volume-oriented outsourcing (commoditized by AI) and senior-led outcome-oriented engagement (more valuable)
  • Evaluation frameworks, contract structures, and KPIs for outsourcing should reflect judgment and ownership as the primary value drivers

The Bottom Line

The technical assessment was never a perfect filter — it was a rough, gameable proxy for engineering competence that worked well enough when the tools available for gaming it were limited. AI has removed that limitation. The response is not nostalgia for a measurement regime that worked tolerably, or increasingly elaborate versions of processes that AI has made unreliable. It is a genuine rethink of what evaluation is for: not filtering candidates who can execute bounded tasks, but identifying engineers who exercise judgment, take ownership, and build things that hold up in production. The methods that reveal those qualities — contextual problem framing, architecture critique dialogues, AI-allowed work samples with explanation requirements, bounded paid pilots, contribution portfolio analysis — are more demanding than a scored coding test. They require engineering leadership time and discipline. They are less scalable than automated assessment pipelines. But they measure what actually matters — and in an environment where AI has made the easier measurements worthless, that distinction is everything. For CTOs managing outsourced and nearshore engineering teams, the same logic applies with equal force. The vendor evaluation frameworks built for the pre-AI era are producing increasingly noisy signals. The organizations that update those frameworks — treating judgment, ownership, and contextual reasoning as the primary evaluation criteria rather than coding throughput and portfolio surface quality — will build better engineering partnerships. The ones that keep running the same assessments will keep being surprised when capable-looking teams underperform on work that actually matters.

Building a team in Eastern Europe?

StepTo helps European and US companies build senior-led nearshore engineering teams in Serbia. Let's talk about what your next engagement could look like.

Start a conversation
I

Written by

Ivan

Senior Engineer · StepTo

Ivan is a senior full-stack engineer at StepTo with a focus on cloud-native architecture, DevOps automation, and engineering team dynamics. He covers the intersection of agentic AI tools and real-world software delivery — from how teams adopt AI coding assistants to the organizational shifts that follow.

Performance-led engineering

Senior engineers who move work forward, not just tickets.

Work with accountable, English-fluent professionals who communicate clearly, protect quality, and deliver with a steady operating rhythm. Cost efficiency matters, but performance is why clients stay with us.

Delivery signals · senior engineering team
Senior ownership
Lead-level
Delivery rhythm
Weekly
Timezone overlap
CET
1 teamaccountable for outcomes, communication, and execution