ROI & Measurement

Measuring Coaching ROI Beyond Satisfaction Scores

 ·  Rachel Kim

Every L&D program at a mid-market company eventually faces the same reckoning: a budget review where a CFO or COO asks what the company is getting for its investment in leadership development. The standard answer — high NPS, good completion rates, positive participant feedback — rarely survives that conversation. Finance professionals recognize satisfaction scores as insufficient evidence of business impact, and they're right.

The problem isn't that HR teams don't want better metrics. The problem is that the connection between leadership development inputs and business outcomes is genuinely hard to establish, and most available measurement frameworks either stop too early in the causal chain (measuring learning rather than behavior change) or require causal inference methods that HR teams don't have the analytical capacity to execute.

I want to share the measurement model we've developed and refined with teams that are using Coachvyne over a sustained period. It's not a perfect model — no L&D measurement model is — but it produces evidence that holds up in finance conversations, which is the bar that matters.

Why satisfaction scores fail as ROI evidence

Post-session NPS and "would you recommend this program" scores have a specific structural problem as ROI evidence: they measure participant satisfaction with the experience, not change in the behavior the experience was designed to develop. These two things are related but far from identical.

A leader can find a coaching session highly valuable and engaging while not changing their behavior at all — because the session produced insight without practice, and insight alone rarely produces behavior change at the leadership level. Conversely, a session that a leader found uncomfortable or challenging because it surfaced a real gap can produce significant behavior change even if the NPS is mediocre. Discomfort and learning are not the same thing as discomfort and dissatisfaction, but the distinction gets lost in a satisfaction survey.

The Kirkpatrick model, which most L&D professionals know, describes four levels of evaluation: reaction, learning, behavior, and results. Most programs measure Level 1 (reaction) and occasionally Level 2 (learning via a knowledge quiz or post-session assessment). Level 3 (behavior change) and Level 4 (business results) get skipped because they're harder and slower to measure. But Level 1 and Level 2 evidence doesn't satisfy a finance conversation that's asking about Level 4.

Layer 1: Dimension score movement as a measure of skill acquisition

The first layer of a meaningful ROI measurement model is tracking dimension score movement across sessions — not a single session, but the trajectory over 3–6 sessions covering a 60–90 day development period. This measures something closer to Level 3 behavior change than a post-session quiz does, because the dimension score is derived from observable behavior in a scenario, not from what the leader says they learned.

What constitutes meaningful movement? In our session data, we treat a 1.5+ point improvement in a targeted dimension — a dimension that was below 5.5 on the first session and was the focus of micro-drill work between sessions — as evidence of skill acquisition. Movement below 1.5 points in a targeted dimension is within the noise range of session-to-session variance and doesn't indicate reliable skill change.

For this layer to be useful as ROI evidence, you need two things: a baseline assessment before the development program starts (so you have a starting point), and a defined set of target dimensions for each participant (so you're measuring change in the dimensions that were actually developed, not looking for movement anywhere in the profile). Without both, you're measuring whether scores went up somewhere, which isn't the same as measuring whether your program produced the skill change it was designed to produce.

Dimension score movement is still a proximal measure — it shows that the leader can perform better in a simulation context, not necessarily that the behavior transferred to real work situations. That's why it's the first layer rather than the only layer.

Layer 2: Manager-observed behavior change at the 60-day mark

The second layer is a targeted manager observation at 60 days after the development program starts. This is specifically designed to address the transfer question: is the behavior that improved in simulation also showing up in real work situations?

The key design detail here is specificity. A generic manager observation checklist ("demonstrates effective leadership in cross-functional situations — 1 to 5") produces noise. A targeted observation prompt that mirrors the specific behavior tracked in the dimension score produces signal. If a participant's development program targeted Accountability Framing, the manager observation prompt is: "In the last 30 days, when did [name] own a missed commitment or team failure without deflecting to context or external factors? Describe a specific instance."

The specificity forces the manager to recall a concrete behavioral instance rather than providing a holistic impression. If they can cite one, that's evidence of transfer. If they can't, that's a signal worth investigating — it might mean the behavior changed in simulation but didn't transfer, or it might mean the manager hasn't had sufficient observation of situations that would elicit the behavior.

We're not suggesting that manager observations alone are sufficient ROI evidence. They're subjective and susceptible to halo effects. But combined with dimension score movement, they provide a two-source confirmation that something real changed in how the leader operates — which is considerably more persuasive than a single data source.

Layer 3: Promotion velocity and internal fill rate as program-level outcomes

The third layer requires the longest timeline but produces the strongest evidence for finance conversations. Promotion velocity — the average time from program entry to VP promotion for participants — and internal fill rate for VP roles are both measurable outcomes that connect to real business costs.

The finance case for faster VP pipeline development is straightforward. An external VP hire costs roughly $25,000–$45,000 in recruiting fees plus a meaningful premium in total compensation over an equivalent internal promotion. Time-to-contribution for an external hire runs 6–12 months longer than for an internal promotion because of the institutional knowledge gap. And external VP hires have higher early-tenure attrition than internal promotions — when an external VP doesn't work out in the first 18 months, the replacement cost compounds.

A program that accelerates internal VP promotions by an average of 6–8 months per participant, and increases the internal fill rate for VP roles by 15–20 percentage points, produces a cost avoidance figure that is real and computable. At a company that makes 4–6 VP hires per year, moving 2 of those from external to internal saves $80K–$150K in direct recruiting costs alone, without counting the productivity difference or the attrition risk.

The measurement challenge at this layer is attribution — how much of the promotion velocity improvement is due to the development program versus selection effects (you identified your best directors for the program in the first place)? Honest answer: you can't fully control for selection effects without a randomized controlled trial, which isn't feasible in most organizational contexts. What you can do is track the same cohort over time and compare their outcomes to a matched comparison group of directors who didn't participate in the program, controlling for tenure and initial performance rating.

That comparison isn't rigorous causal inference, but it's good enough for a business case. Finance isn't demanding academic standards of evidence — they're asking whether the investment makes sense given the available evidence, and a well-constructed comparison produces evidence that's far stronger than an NPS score.

Layer 4: Retention of high-potential leaders at 18 months

The final layer is retention, and it deserves its own discussion because it often gets overlooked in ROI conversations. Leadership development has a well-documented effect on retention: leaders who are in structured development programs report higher engagement and are significantly less likely to accept external offers during the program period. The mechanism is straightforward — a leader who is visibly invested in and developing within their current organization has a higher bar for the external opportunity that would make them leave.

The retention ROI calculation is similar to the promotion velocity calculation but in the opposite direction. VP-level attrition costs are typically estimated at 150–200% of annual total compensation — replacement cost, productivity loss during transition, and institutional knowledge loss. A program that retains one high-potential director who would otherwise have left at the 12-month mark pays for itself before the leadership development budget is even discussed.

The measurement here is simpler: track 18-month retention rates for program participants versus a matched comparison group of directors who weren't in the program. If the retention rate difference is meaningful — 15+ percentage points — that's a quantifiable business outcome.

Assembling the measurement stack for a CFO conversation

These four layers — dimension score movement, manager-observed behavior change, promotion velocity and internal fill rate, and 18-month retention — don't all need to be available at the same time for a meaningful ROI case. In the first six months of a program, you'll have Layer 1 and Layer 2. By month 12–18, you'll have initial Layer 3 and Layer 4 signals. By month 24+, you'll have a full picture.

The mistake is waiting until you have all four layers before making the ROI case. Use what you have, be explicit about what each layer measures and what its limitations are, and show the trajectory. Finance teams are generally more comfortable with honest framing of methodological limitations than with a clean-looking model that papers over uncertainty.

The goal isn't a perfect ROI calculation. The goal is evidence that justifies continued investment and shows the direction of program impact. Four imperfect layers of measurement, assembled honestly, accomplish that better than one layer of perfect data that stops at satisfaction scores.