b

Counterfactuals, LLC

Tyler Hanson

Home | About Me | Services | Blog and Case Studies

Product A/B Testing (for Startups) is Broken

The power of a properly executed A/B test is frequently wasted on binary validation rather than open-minded learning

2024-01-03

Introduction

Over the last ten years or so, A/B testing has become table stakes for product teams at tech companies; any organization worth its salt is committed to evaluating any product change or new feature with an A/B test to determine whether or not it is having a positive impact on key success metrics. In a lot of cases, these testing strategies are implemented with care by a dedicated data analytics team and are quite rigorous. Unfortunately, from my experience as a data professional the prevailing culture and incentive structures at most tech companies mean that these efforts are often wasted and don't actually provide any benefit to product development.

A/B testing is frequently used for post-hoc validation of an existing roadmap, rather than an opportunity to learn

A/B tests are so appealing, in part, because they provide the ultimate form of quantitative validation: if a random subset of users exposed to a new feature are more likely to take a preferred action, and this result is statistically significant, then we can be confident that the feature was a success. This level of certainty is a stark contrast to the ambiguity inherent in the observational analyses more commonly performed by data analysts. If the right testing infrastructure is in place, an A/B test can also involve a lot less work than an observational analysis.

Despite being such a powerful tool, the overall impact of A/B testing is often limited because it is positioned at the end of the product development process, which typically goes something like this:

This process reduces the consequences of the A/B test to a binary: either the feature is a success and it’s fully rolled out, or it’s a failure and it’s rolled back. The A/B test is merely a final check-box in the product development process to ensure that the new feature is “doing no harm” before moving on to the next item in the roadmap.

This post-hoc approach to evaluating features ignores the true value of A/B testing: the opportunity to collect and learn from high-quality on user behavior. A more judicious approach to testing would mean designing experiments specifically to generate insights on users to understand which future features might be the most impactful - rather than testing these features after the time has already been invested to build them. It’s also important to analyze how different user segments responded to the test condition in order to understand why a given test resulted in success or failure, rather than responding to the test result as a binary.

Startup product teams use A/B testing to obsess over tiny changes rather than seeing the big picture

Because an A/B test provides such a binary picture of success/failure for new features, it’s easy for the magnitude of change to be ignored. Given a sufficiently large sample size, it’s possible for even a minuscule test effect to be “statistically significant”.

I’ve encountered this frequently when working with smaller startups. A product manager will ask, “how long do we need to run this test in order to see a significant result?” Ultimately, whether or not a result is statistically significant (which is commonly determined via a p-value) is a function of two things: the magnitude of the effect and the sample size observed. With a large sample size, even a small test effect can be detected as statistically significant; with more data in hand, it’s possible to distinguish between a small effect and zero. On the other hand, if the sample size is relatively small, only large test effects will be considered significant.

If someone asks me “how long do we need to run this test,” I consider this context to be very important because running a test for a longer period of time to collect additional data only helps identify test effects if they are very small. If an A/B test doesn’t return a significant positive result in a short period of time, we can already conclude that a large test effect is unlikely!

This distinction between large and small effects is particularly relevant for low-scale startups. These startups inherit a lot of A/B testing dogma from larger tech companies like Google or Facebook, either because the product managers working for the startup used to work for larger companies, or because the practices of these companies are widely publicized. However, large tech behemoths are operating in a fundamentally different landscape when it comes to testing product changes. Since they have such huge scale, they can easily collect statistically significant results for tests regardless of how small the test effect is. Additionally, even a tiny positive (or negative) effect from a new feature can translate into a huge difference in revenue or usage - more than enough to offset the salary of the product manager squinting at the results.

For low-scale startups, this same approach takes the product team in all of the wrong directions - they become conditioned to waiting long periods of time to conclude whether or not a feature was a success when these small margins of success will only translate into a tiny amount of additional revenue or usage anyway.

Product teams are not prepared to actually learn from “failed” tests

Ultimately, the vast majority of product A/B tests will indicate that the new feature has no impact on success metrics, positive or negative; only the rare few will actually result in a meaningful improvement.

In an ideal world, this would provoke an existential crisis on the part of product managers. After all, each new feature should be developed with the intention that it improves the user experience in a measurable way and furthers the company’s path to revenue. If A/B testing concludes that these features don’t actually matter, that should prompt a re-evaluation of the roadmap.

Unfortunately, the post-hoc testing process that I described above greatly limits the ability to introspect on failed A/B tests. On multiple occasions I’ve told a product manager that a new feature had a neutral or negative result, only to be told “well, we need to roll this feature out to keep making progress on the roadmap.”

“Failed” A/B tests are just as valuable as a successful test in terms of what can be learned about users, but in order to reap the benefits of this value the results of the test need to be properly internalized. Product managers need to be more existential and ask themselves:

This requires a degree of flexibility around the product roadmap to pivot and react to learnings, as well as a greater degree of introspection at the beginning of the product development process. In most cases, it will become clear that there never really was any evidence ex-ante that the new feature was going to be successful. That doesn’t mean that product teams shouldn’t take risks on new features, but they should be able to internalize when they are taking a risky bet and be prepared to make adjustments in the likely event when the bet doesn’t pay off.

Conclusion

A/B testing still has the potential to be an incredibly valuable tool for product analytics, but the culture around their implementation needs to change. Instead of treating testing as an obligation to validate features and generate certainty, product teams need to see it as an opportunity to derive insights and strategic direction that can’t easily be determined via observational analysis.