MaryBeth Pecha

CASE STUDY

Emotion is Signal

Amazon insights — when customer sentiment begins to behave like an economic primitive

Author’s Note

The moment that stayed with me wasn’t the model output.

It was the pause that followed it.

We had just shown that customer sentiment — long treated as anecdote — behaved like a causal input. Ranked. Directional. Economically material. The science landed. The implications were

clear.

And then the conversation moved on.

That disconnect, between what could be proven and what organizations were prepared to absorb, became as instructive as the analysis itself. This case documents the work that led there, and the lessons that followed.

Large organizations are exceptionally good at measuring what they can already see.

They track cost, speed, selection, uptime, and utilization. They forecast demand, optimize pricing, and instrument nearly every operational surface. What they struggle to do, especially at scale, is treat human emotion as something other than anecdote.

At Amazon Grocery, customer emotion was everywhere: in post-transaction surveys, Net Promoter Score (NPS), customer complaints, and free-text feedback that teams skimmed but

trusted inconsistently. Emotion was discussed frequently, but acted on cautiously. It was considered important, but fragile. Useful, but political. Insightful, but rarely decisive.

By 2018, NPS had become a shared executive language across Amazon. It showed whether things were improving or deteriorating. It told leadership when customers were unhappy. What it did not reliably tell them was what to do next.

This gap was particularly striking given Amazon’s strength elsewhere. The company had pioneered sophisticated predictive systems that used vast numbers of inputs: demand signals, lead times, substitution patterns, seasonality, to place inventory in the right location before customers ever clicked “buy.” These systems worked because Amazon knew how to identify which inputs mattered, how they interacted, and how to operationalize them continuously.

Customer emotion, by contrast, resisted that same treatment. It existed outside known frameworks, lacked a trusted structure, and remained largely unmodeled as an input variable. And it showed.

As the NPS metric matured, executives increasingly left reviews aligned on the direction of the score, but uncertain about intervention. Should teams invest in fixing missing items, improving

delivery windows, expanding selection, lowering prices, or refining in-store experience? Each function arrived with descriptive trends, year-over-year comparisons, and operational anecdotes. The result was an expensive, manual, and surprisingly low-tech game of whack-a-mole for a FAANG company in 2021.

Emotion was visible, but not yet usable.

This case explores what happened when customer sentiment was treated not as context or commentary, but as signal: something that could be modeled causally, ranked by relative importance, and tied directly to economic outcomes. Not perfectly. Not exhaustively. But

rigorously enough to change how decisions could be made.

By the time this work began, Amazon Grocery had no shortage of data. The organization ran bi-annual relationship NPS (R-NPS) and comparative NPS (C-NPS) reports at scale. These reports provided a reliable 30,000-foot view of customer loyalty across geographies and channels. They were widely circulated, deeply reviewed, and politically visible.

They were also increasingly insufficient.

Leadership feedback had converged around a common frustration:

“We see the story. We agree it matters. But we don’t know what to do differently.”

The cadence was too slow for a rapidly evolving grocery business that was opening new stores, expanding delivery models, and recalibrating its identity post-pandemic. Six-month snapshots could not keep pace with operational reality. More importantly, the reports were descriptive by design. They showed what moved, but not why— and certainly not which levers mattered most.

On the operational side, teams relied heavily on Transactional NPS (TNPS) and Rate-My-Experience (RME) data. These surveys generated enormous volume and high response rates, but analysis remained trend-based: week-over-week, month-over-month, year-over-year. Teams spent significant human capital chasing fluctuations without a clear understanding of causal drivers.

Without a way to isolate which customer experiences disproportionately influenced loyalty, prioritization defaulted to intuition, hierarchy, and the loudest narrative in the room. Emotion (despite being the thing customers were explicitly telling Amazon about) remained downstream of “harder” metrics like cost, speed, and selection.

What leadership needed was not another scorecard. They needed a way to translate customer feeling into decision-grade signal.

The mandate that launched this work was intentionally vague.

A senior leader wanted to explore whether science and prediction could strengthen Amazon’s NPS insights. There were no detailed requirements, no committed headcount, and no executive goal attached. The work began in late 2021, during peak pandemic strain, inside an understaffed organization where most teams were focused on execution and compliance rather than long-horizon innovation.

In Amazon terms, this meant there was no roadmap. And no protection.

From a conventional Amazon career perspective, raising a hand for a project like this made little sense. It crossed too many functions, lacked a clear owner, and carried no obvious path to near-term impact or promotion. Several peers were candid about this. More than once, I was met with confusion rather than encouragement: Why take on something this ambiguous, this political, this exposed?

They weren’t wrong.

I was asked to lead the work largely because of my background in finance and accounting, combined with prior experience running the in-stock organization for Amazon’s largest retail category. I had seen firsthand how difficult it was for teams to know which customer issues truly mattered, and how costly it was to chase everything at once. But expertise alone doesn’t explain why I said yes.

By that point, I was operating from a different set of incentives.

What had changed was not my ambition, but its orientation. I was no longer optimizing for visibility or clean ownership. I was drawn instead to problems that sat between systems. The ones without natural homes, but with outsized consequences if left unresolved.

That choice was not neutral. While I worked through ambiguity with no headcount, no roadmap, and no guaranteed outcome, I watched peers take on clearer scopes, accrue visible wins, earn headcount, and move faster through promotion cycles. There were moments when it felt like I was falling behind — not because the work lacked rigor, but because it resisted easy attribution.

I understood the trade. I knew this path carried political risk and personal cost. I also knew that if the work succeeded, it would matter in a way safer bets would not.

I worked inside Amazon’s Benchmarking organization, a strategic advisory group with no P&L and no production mandate, but significant political exposure. Success required influence without

authority (or funding) across data science, economics, operations, retail leadership, and multiple vice presidents.

There was real risk in pushing this work forward. Demonstrating that customer emotion had causal economic impact would force prioritization decisions some teams might prefer to avoid. It would also challenge the assumption that sentiment was too soft or subjective to compete with traditional operational metrics.

Still, the problem felt worth solving.

Emotion was already embedded in the system. Customers were telling Amazon what mattered to them every day. The open question was whether that signal could be treated with the same rigor as any other input; modeled carefully enough to earn trust without flattening the human experience it represented.

I knew this path would be slower and less certain than safer alternatives. I chose it anyway.

That decision set the direction for everything that followed.

The core hypothesis was simple, but nontrivial to prove.

Customer emotion was already being measured at scale. The question was whether it merely described past experience, or whether it could function as a causal input into future behavior. If

emotion truly mattered, it should do more than correlate with loyalty. It should predict it. And if it could predict loyalty, it should ultimately show up in revenue.

Three hypotheses guided the work:

1. Customer sentiment has a causal relationship with future spend and churn.

2. Not all emotional experiences matter equally. A small number disproportionately shape loyalty.

3. These effects can be isolated, ranked, and quantified economically, even in a complex system like grocery.

The goal was not to explain everything customers felt. It was to identify which emotional signals were decision-relevant — the ones that justified reallocating real resources.

If the hypotheses held, emotion could move from commentary to constraint.

The methodological challenge was credibility.

Sentiment analysis often fails not because the data is weak, but because the modeling is loose. To be useful inside Amazon, the approach had to withstand scrutiny from economists, data scientists, and finance partners alike. Correlation would not be enough.

Data Foundation

The model linked Transactional NPS (TNPS) and Rate-MyExperience (RME) survey responses to individual customer behavior over time, allowing sentiment to be evaluated alongside concrete outcomes such as spend and churn.

The feature set included over one hundred variables spanning customer-reported experiences, operational outcomes, channel context, and customer tenure.

Importantly, price and selection were not included in the initial model. This was not philosophical, but structural. Survey design at the time did not consistently capture these inputs. Rather than introduce biased proxies, the MVP focused on variables that were reliably present, while partnering with the grocery team to expand survey design in later iterations.

Modeling Approach

We used gradient-boosted decision trees (LightGBM) combined with double machine learning techniques to approximate causal effects. Separate models were trained for promoters and detractors, segmented by customer type and channel.

This allowed us to estimate the relative causal importance of individual experiences — something descriptive trend analysis could not do. Prior approaches required teams to manually interpret shifts and guess at causes. The causal framework replaced that guesswork with ranked drivers.

Economists reviewed assumptions and limitations throughout. Bias was discussed explicitly. Entitlement calculations were framed conservatively by design.

Model performance was validated on held-out test splits, with AUC scores of 0.84 for promoters and 0.91 for detractors. Strong classification performance for behavioral prediction. Full calibration against experimental ground truth was not completed. That standard requires production-scale investment and ongoing experimentation infrastructure — a commitment the project did not receive before it stalled institutionally. The 7–8% figure reflects what could be rigorously defended with the data and methodology available, not an upper bound on what a fully instrumented causal pipeline would ultimately confirm.

The ambition was not perfection. It was trust.

The results surprised even those inclined to believe emotion mattered.

Emotion Has Economic Consequences

Customers reporting negative emotional experiences exhibited two distinct economic effects. Among customers who continued to shop, detractors spent meaningfully less than comparable non-detractors. Separately, detractors were significantly more likely to churn altogether.

Focusing only on the direct spend effect among retained customers, detractors spent approximately 7–8% less per month, controlling for prior behavior. If sustained, this implies a similar 7–8% reduction in annual spend per customer, even before accounting for elevated churn risk.

This framing was intentionally conservative. The model did not monetize churn-driven lifetime value loss, nor did it incorporate sentiment spillover effects. Grocery is a high-frequency, high-

trust category, and negative experiences are often shared within households and social circles. A single detractor may influence multiple potential customers who never appear in the dataset. That

secondary impact (widely understood to be powerful) was deliberately excluded.

The estimate reflected only what could be defended quantitatively: the minimum economic cost of unresolved negative emotion.

Emotion Is Structured, Not Diffuse

A small number of experiences explained a disproportionate share of loyalty impact:

— Missing items

— Expired or rotten items

— Items packaged with care

— Wrong items

— Damaged items

These drivers accounted for roughly 16–45% of modeled loyalty impact, depending on segment. The model also carried a cost avoidance dimension: identifying disproportionate impact across a field of 100+ variables meant the descriptive TNPS analysis previously consuming analyst cycles across multiple teams became unnecessary overhead once causal drivers were known.

This directly contradicted the prevailing operational posture. Teams had been chasing dozens of metrics simultaneously, guided by descriptive fluctuations and local narratives. The causal analysis revealed that much of that effort was misallocated.

Emotion was not evenly distributed across experiences. It was concentrated.

Early Emotion Is Fragile

New customers were significantly more emotionally sensitive than repeat customers. Early negative experiences had a larger and more durable impact on future behavior.

Emotion, in this context, was not just signal. It was risk. And risk, once surfaced, could be managed.

The original goal was narrow: to understand whether sentiment could be modeled causally within grocery.

Once the prototype succeeded, a new question emerged: how quickly could the model retrain once the pipeline existed?

The answer was striking. Subsequent training cycles could compress from months to weeks — and in some cases, days.

At that point, the work no longer resembled a bespoke analysis. It began to behave like a repeatable pattern. The underlying structure— sentiment inputs linked to behavioral outcomes, ranked causally— was not grocery-specific. Similar datasets existed elsewhere across Amazon.

The language for this did not yet exist. No one was talking about primitives. But instinctively, it was clear the insight extended beyond a single use case.

Early conversations with science partners and AWS counterparts focused on feasibility, not productization. The conclusion was consistent: the limiting factor was no longer technical.

The findings landed. Leadership was surprised by the strength of the causal signal. The work influenced survey redesign and informed how economics teams thought about pricing and selection.

But moving from insight to production required something the project never had: a clear owner with aligned incentives.

No single executive wanted to absorb the long-term cost required to operationalize the model. The work sat between organizations, none of which owned the end-to-end outcome. Executive turnover compounded the problem. Macroeconomic tightening raised the bar for cross-cutting investments.

In hindsight, an incentive mismatch had been present from the beginning.

I reported to an executive optimized for compliance and execution against known goals. The work required sponsorship from someone oriented toward growth and revenue creation. The misalignment was structural — but decisive.

A peer had warned me early on. Grocery, he said, was prone to executive turnover. Long-horizon bets were fragile there. I heard the warning. I discounted it.

I believed strong data would speak for itself.

It didn’t.

Even excellent insight cannot overcome misaligned incentives at the top.

Despite the growing pains, I am glad I did this work.

It was my first deep exposure to applied data science and causal modeling. More importantly, it clarified something I had long suspected: human emotion is not noise. It is compressed

information about trust and risk, and when treated with care, it can be modeled without stripping it of meaning.

The work also sharpened my understanding of organizations. Solving the right problem is necessary but insufficient. Incentive alignment and executive ownership matter as much as analytical rigor.

This project changed how I choose problems. I began following questions that cut across systems, even when they were politically inconvenient.

Once emotion could be treated as signal here, it became impossible not to notice where else it was being ignored.

This work began as an attempt to improve a single metric. It ended by reframing how customer sentiment itself could be understood.

Emotion is information. Directional, economically consequential, and often overlooked. When linked carefully to behavior, sentiment can inform prioritization, resource allocation, and product design, especially in high-frequency categories where small differences compound.

The limiting factor is rarely data. It is whether organizations are willing, and structurally able, to act on what customers are

already telling them.

This case does not argue that sentiment models replace judgment. It argues something narrower and more durable: emotion belongs in the decision-making stack.

At Amazon Grocery, treating emotion as signal clarified which experiences mattered most, where early risk concentrated, and how sentiment translated into economic outcomes. It also exposed a

familiar organizational truth: insight travels only as far as incentives allow.

The work succeeded analytically.

It stalled institutionally.

That outcome was not a failure of modeling, but a constraint of structure.

Emotion, when respected, is not something to be filtered out.

It is one of the most powerful signals we have.

Opening Frame — emotion as signal

II. The Problem — When Metrics Stop Driving Decisions

III. Protagonist & Mandate — Working in the Whitespace

IV. Hypothesis — Emotion as Causal Input

V. Methodology — Modeling Emotion with Rigor

VI. Findings — What Emotion Revealed

VII. Discovering Scale — When Insight Suggested Reusability

VIII. Organizational Outcome — Incentives Matter More than Insight

IX. Reflection — Judgement Earned

X. Transferrable Insight — Emotion as Signal

XI. What This Case Study Ultimately Shows

Closing Signal