Over the last ten years or so, A/B testing has become table stakes for product teams at tech companies; any organization worth its salt is committed to evaluating any product change or new feature with an A/B test to determine whether or not it is having a positive impact on key success metrics. In a lot of cases, these testing strategies are implemented with care by a dedicated data analytics team and are quite rigorous. Unfortunately, from my experience as a data professional the prevailing culture and incentive structures at most tech companies mean that these efforts are often wasted and don't actually provide any benefit to product development.
A/B tests are so appealing, in part, because they provide the ultimate form of quantitative validation: if a random subset of users exposed to a new feature are more likely to take a preferred action, and this result is statistically significant, then we can be confident that the feature was a success. This level of certainty is a stark contrast to the ambiguity inherent in the observational analyses more commonly performed by data analysts. If the right testing infrastructure is in place, an A/B test can also involve a lot less work than an observational analysis.
Despite being such a powerful tool, the overall impact of A/B testing is often limited because it is positioned at the end of the product development process, which typically goes something like this:
This post-hoc approach to evaluating features ignores the true value of A/B testing: the opportunity to collect and learn from high-quality on user behavior. A more judicious approach to testing would mean designing experiments specifically to generate insights on users to understand which future features might be the most impactful - rather than testing these features after the time has already been invested to build them. It’s also important to analyze how different user segments responded to the test condition in order to understand why a given test resulted in success or failure, rather than responding to the test result as a binary.
Because an A/B test provides such a binary picture of success/failure for new features, it’s easy for the magnitude of change to be ignored. Given a sufficiently large sample size, it’s possible for even a minuscule test effect to be “statistically significant”.
I’ve encountered this frequently when working with smaller startups. A product manager will ask, “how long do we need to run this test in order to see a significant result?” Ultimately, whether or not a result is statistically significant (which is commonly determined via a p-value) is a function of two things: the magnitude of the effect and the sample size observed. With a large sample size, even a small test effect can be detected as statistically significant; with more data in hand, it’s possible to distinguish between a small effect and zero. On the other hand, if the sample size is relatively small, only large test effects will be considered significant.
If someone asks me “how long do we need to run this test,” I consider this context to be very important because running a test for a longer period of time to collect additional data only helps identify test effects if they are very small. If an A/B test doesn’t return a significant positive result in a short period of time, we can already conclude that a large test effect is unlikely!
This distinction between large and small effects is particularly relevant for low-scale startups. These startups inherit a lot of A/B testing dogma from larger tech companies like Google or Facebook, either because the product managers working for the startup used to work for larger companies, or because the practices of these companies are widely publicized. However, large tech behemoths are operating in a fundamentally different landscape when it comes to testing product changes. Since they have such huge scale, they can easily collect statistically significant results for tests regardless of how small the test effect is. Additionally, even a tiny positive (or negative) effect from a new feature can translate into a huge difference in revenue or usage - more than enough to offset the salary of the product manager squinting at the results.
For low-scale startups, this same approach takes the product team in all of the wrong directions - they become conditioned to waiting long periods of time to conclude whether or not a feature was a success when these small margins of success will only translate into a tiny amount of additional revenue or usage anyway.
Ultimately, the vast majority of product A/B tests will indicate that the new feature has no impact on success metrics, positive or negative; only the rare few will actually result in a meaningful improvement.
In an ideal world, this would provoke an existential crisis on the part of product managers. After all, each new feature should be developed with the intention that it improves the user experience in a measurable way and furthers the company’s path to revenue. If A/B testing concludes that these features don’t actually matter, that should prompt a re-evaluation of the roadmap.
Unfortunately, the post-hoc testing process that I described above greatly limits the ability to introspect on failed A/B tests. On multiple occasions I’ve told a product manager that a new feature had a neutral or negative result, only to be told “well, we need to roll this feature out to keep making progress on the roadmap.”
“Failed” A/B tests are just as valuable as a successful test in terms of what can be learned about users, but in order to reap the benefits of this value the results of the test need to be properly internalized. Product managers need to be more existential and ask themselves:
A/B testing still has the potential to be an incredibly valuable tool for product analytics, but the culture around their implementation needs to change. Instead of treating testing as an obligation to validate features and generate certainty, product teams need to see it as an opportunity to derive insights and strategic direction that can’t easily be determined via observational analysis.