Los Altos, California, United States
Contact Info
32K followers
500+ connections
About
Consulting and teaching…
Articles by Ron
-
Should you suggest or enforce a template for hypotheses in A/B tests?
Should you suggest or enforce a template for hypotheses in A/B tests?
By Ron Kohavi
-
When should you use quasi-experiments instead of controlled experiments, or A/B tests? The barometer question analogy
When should you use quasi-experiments instead of controlled experiments, or A/B tests? The barometer question analogy
By Ron Kohavi
Activity
-
Seen this? Love how the Kameleoon list of thought leaders in experimentation is helping surface other should-be-seen lists. This list of…
Seen this? Love how the Kameleoon list of thought leaders in experimentation is helping surface other should-be-seen lists. This list of…
Liked by Ron Kohavi
-
𝗖𝗥𝗢 𝗡𝗲𝘄𝘀 | 𝗪𝗲𝗲𝗸𝗹𝘆 𝗥𝗼𝘂𝗻𝗱𝘂𝗽 | 𝗝𝘂𝗹𝘆, 𝟮𝟮𝗻𝗱 The best of #ABtesting and #optimization on #LinkedIn last week 👇 💬 It's…
𝗖𝗥𝗢 𝗡𝗲𝘄𝘀 | 𝗪𝗲𝗲𝗸𝗹𝘆 𝗥𝗼𝘂𝗻𝗱𝘂𝗽 | 𝗝𝘂𝗹𝘆, 𝟮𝟮𝗻𝗱 The best of #ABtesting and #optimization on #LinkedIn last week 👇 💬 It's…
Liked by Ron Kohavi
-
❗ FACT: The most problematic form field is "Password" (we have Zuko Analytics data to back this up) ❗ As part of a series on optimizing the #ux of…
❗ FACT: The most problematic form field is "Password" (we have Zuko Analytics data to back this up) ❗ As part of a series on optimizing the #ux of…
Liked by Ron Kohavi
Experience & Education
Licenses & Certifications
Publications
-
Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing Methodology
The American Statistician
The rise of internet-based services and products in the late 1990s brought about an unprecedented opportunity for online businesses to engage in large scale data-driven decision making. Over the past two decades, organizations such as Airbnb, Alibaba, Amazon, Baidu, Booking.com, Alphabet’s Google, LinkedIn, Lyft, Meta’s Facebook, Microsoft, Netflix, Twitter, Uber, and Yandex have invested tremendous resources in online controlled experiments (OCEs) to assess the impact of innovation on their…
The rise of internet-based services and products in the late 1990s brought about an unprecedented opportunity for online businesses to engage in large scale data-driven decision making. Over the past two decades, organizations such as Airbnb, Alibaba, Amazon, Baidu, Booking.com, Alphabet’s Google, LinkedIn, Lyft, Meta’s Facebook, Microsoft, Netflix, Twitter, Uber, and Yandex have invested tremendous resources in online controlled experiments (OCEs) to assess the impact of innovation on their customers and businesses. Running OCEs at scale has presented a host of challenges requiring solutions from many domains. In this article we review challenges that require new statistical methodologies to address them. In particular, we discuss the practice and culture of online experimentation, as well as its statistics literature, placing the current methodologies within their relevant statistical lineages and providing illustrative examples of OCE applications. Our goal is to raise academic statisticians’ awareness of these new research opportunities to increase collaboration between academia and the online industry.
Other authorsSee publication -
Online Controlled Experiments and A/B Tests
Springer, New York, NY
Many good resources are available with motivation and explanations about online controlled experiments (Kohavi et al. 2009a, 2020; Thomke 2020; Luca and Bazerman 2020; Georgiev 2018, 2019; Kohavi and Thomke 2017; Siroker and Koomen 2013; Goward 2012; Schrage 2014; King et al. 2017; McFarland 2012; Manzi 2012; Tang et al. 2010). For organizations running online controlled experiments at scale, Gupta et al. (2019) provide an advanced set of challenges. We provide a motivating visual example of a…
Many good resources are available with motivation and explanations about online controlled experiments (Kohavi et al. 2009a, 2020; Thomke 2020; Luca and Bazerman 2020; Georgiev 2018, 2019; Kohavi and Thomke 2017; Siroker and Koomen 2013; Goward 2012; Schrage 2014; King et al. 2017; McFarland 2012; Manzi 2012; Tang et al. 2010). For organizations running online controlled experiments at scale, Gupta et al. (2019) provide an advanced set of challenges. We provide a motivating visual example of a controlled experiment that ran at Microsoft’s Bing. The team wanted to add a feature allowing advertisers to provide links to the target site. The rationale is that this will improve ads quality by giving users more information about what the advertiser’s site provides and allow users to directly navigate to the sub-category matching their intent. Visuals of the existing ads layout (Control) and the new ads layout (Treatment) with site links added are shown in Fig. 1.
Other authorsSee publication -
A/B Testing Intuition Busters: Common Misunderstandings in Online Controlled Experiments
Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22)
A/B tests, or online controlled experiments, are heavily used in industry to evaluate implementations of ideas. While the statistics behind controlled experiments are well documented and some basic pitfalls known, we have observed some seemingly intuitive concepts being touted, including by A/B tool vendors and agencies, which are misleading, often badly so. Our goal is to describe these misunderstandings, the “intuition” behind them, and to explain and bust that intuition with solid…
A/B tests, or online controlled experiments, are heavily used in industry to evaluate implementations of ideas. While the statistics behind controlled experiments are well documented and some basic pitfalls known, we have observed some seemingly intuitive concepts being touted, including by A/B tool vendors and agencies, which are misleading, often badly so. Our goal is to describe these misunderstandings, the “intuition” behind them, and to explain and bust that intuition with solid statistical reasoning. We provide recommendations that experimentation platform designers can implement to make it harder for experimenters to make these intuitive mistakes.
Other authorsSee publication -
Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing
Cambridge University Press
Getting numbers is easy; getting numbers you can trust is hard. This practical guide by experimentation leaders at Google, LinkedIn, and Microsoft will teach you how to accelerate innovation using trustworthy online controlled experiments, or A/B tests.
Other authorsSee publication -
Online randomized controlled experiments at scale: lessons and extensions to medicine
Trials 21, 150 (2020)
Many technology companies, including Airbnb, Amazon, Booking.com, eBay, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, and Yahoo!/Oath, run online randomized controlled experiments at scale, namely hundreds of concurrent controlled experiments on millions of users each, commonly referred to as A/B tests. Originally derived from the same statistical roots, randomized controlled trials (RCTs) in medicine are now criticized for being expensive and difficult, while in…
Many technology companies, including Airbnb, Amazon, Booking.com, eBay, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, and Yahoo!/Oath, run online randomized controlled experiments at scale, namely hundreds of concurrent controlled experiments on millions of users each, commonly referred to as A/B tests. Originally derived from the same statistical roots, randomized controlled trials (RCTs) in medicine are now criticized for being expensive and difficult, while in technology, the marginal cost of such experiments is approaching zero and the value for data-driven decision-making is broadly recognized.
Other authorsSee publication -
Top Challenges from the first Practical Online Controlled Experiments Summit
SIGKDD Explorations
Online controlled experiments (OCEs), also known as A/B tests, have become ubiquitous in evaluating the impact of changes made to software products and services. While the concept of online controlled experiments is simple, there are many practical challenges in running OCEs at scale and encourage further academic and industrial exploration. To understand the top practical challenges in running OCEs at scale, representatives with experience in large-scale experimentation from thirteen different…
Online controlled experiments (OCEs), also known as A/B tests, have become ubiquitous in evaluating the impact of changes made to software products and services. While the concept of online controlled experiments is simple, there are many practical challenges in running OCEs at scale and encourage further academic and industrial exploration. To understand the top practical challenges in running OCEs at scale, representatives with experience in large-scale experimentation from thirteen different organizations (Airbnb, Amazon, Booking.com, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, Yandex, and Stanford University) were invited to the first Practical Online Controlled Experiments Summit. All thirteen organizations sent representatives. Together these organizations tested more than one hundred thousand experiment treatments last year. Thirty-four experts from these organizations participated in the summit in Sunnyvale, CA, USA on December 13-14, 2018.
While there are papers from individual organizations on some of the challenges and pitfalls in running OCEs at scale, this is the first paper to provide the top challenges faced across the industry for running OCEs at scale and some common solutions.
[LinkedIn limits authors to 10 and I can't even post them here because I go over the description limit size]
Other authorsSee publication -
The Surprising Power of Online Experiments
Harvard Business Review
Today, Microsoft and several other leading companies conduct more than 10,000 online controlled experiments annually, with many tests engaging millions of users. These organizations have discovered that an “experiment with everything” approach has surprisingly large payoffs.
At a time when the web is vital to almost all businesses, rigorous online experiments should be standard operating procedure. If a company develops the software infrastructure and organizational skills to conduct…Today, Microsoft and several other leading companies conduct more than 10,000 online controlled experiments annually, with many tests engaging millions of users. These organizations have discovered that an “experiment with everything” approach has surprisingly large payoffs.
At a time when the web is vital to almost all businesses, rigorous online experiments should be standard operating procedure. If a company develops the software infrastructure and organizational skills to conduct them, it will be able to assess not only ideas for websites but also potential business models, strategies, products, services, and marketing campaigns—all relatively inexpensively. Controlled experiments can transform decision making into a scientific, evidence-driven process—rather than an intuitive reaction. Without them, many breakthroughs might never happen, and many bad ideas would be implemented, only to fail, wasting resources.Other authorsSee publication -
Pitfalls of Long-Term Online Controlled Experiments
IEEE Big Data 2016
Online controlled experiments (e.g., A/B tests) are now regularly used to guide product development and accelerate innovation in software. Product ideas are evaluated as scientific hypotheses, and tested on web sites, mobile applications, desktop applications, services, and operating system features.
One of the key challenges for organizations that run controlled experiments is to select an Overall Evaluation Criterion (OEC), i.e., the criterion by which to evaluate the different…Online controlled experiments (e.g., A/B tests) are now regularly used to guide product development and accelerate innovation in software. Product ideas are evaluated as scientific hypotheses, and tested on web sites, mobile applications, desktop applications, services, and operating system features.
One of the key challenges for organizations that run controlled experiments is to select an Overall Evaluation Criterion (OEC), i.e., the criterion by which to evaluate the different variants. The difficulty is that short-term changes to metrics may not predict the long-term impact of a change. For example, raising prices likely increases short-term revenue but also likely reduces long-term revenue (customer lifetime value) as users abandon. Degrading search results in a Search Engine causes users to search more, thus increasing query share short-term, but increasing abandonment and thus reducing long-term customer lifetime value. Ideally, an OEC is based on metrics in a short-term experiment that are good predictors of long-term value.
To assess long-term impact, one approach is to run long-term controlled experiments and assume that long-term effects are represented by observed metrics. In this paper we share several examples of long-term experiments and the pitfalls associated with running them. We discuss cookie stability, survivorship bias, selection bias, and perceived trends, and share methodologies that can be used to partially address some of these issues.
While there is clearly value in evaluating long-term trends, experimenters running long-term experiments must be cautious, as results may be due to the above pitfalls more than the true delta between the Treatment and Control. We hope our real examples and analyses will sensitize readers to the issues and encourage the development of new methodologies for this important problem.
Other authorsSee publication -
Pitfalls in Online Controlled Experiments
MIT CODE: Conference On Digital Experimentation
It's easy to run a controlled experiment and compute a p-value with five digits after the decimal point. While getting such precise numbers is easy, getting numbers you can trust is much harder. We share practical pitfalls from online controlled experiments across multiple groups at Microsoft.
-
Challenging Problems in Online Controlled Experiments
The Conference on Digital Experimentation @ MIT (CODE 2015)
Online controlled experiments are now widely run in the software industry. I share several challenging problems and motivate their importance. These include high-variance metrics, issues with p-values, metric-driven vs. design-driven decisions, novelty effects, and leaks
-
Online Controlled Experiments: Lessons from Running A/B/n Tests for 12 years
KDD 2015 Invited Keynote
The Internet provides developers of connected software, including web sites, applications, and devices, an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using trustworthy controlled experiments (e.g., A/B tests and their generalizations). From front-end user-interface changes to backend recommendation systems and relevance algorithms, from search engines (e.g., Google, Microsoft’s Bing, Yahoo) to retailers (e.g., Amazon, eBay, Netflix, Etsy) to…
The Internet provides developers of connected software, including web sites, applications, and devices, an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using trustworthy controlled experiments (e.g., A/B tests and their generalizations). From front-end user-interface changes to backend recommendation systems and relevance algorithms, from search engines (e.g., Google, Microsoft’s Bing, Yahoo) to retailers (e.g., Amazon, eBay, Netflix, Etsy) to social networking services (e.g., Facebook, LinkedIn, Twitter) to Travel services (e.g., Expedia, Airbnb, Booking.com) to many startups, online controlled experiments are now utilized to make data-driven decisions at a wide range of companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher’s experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and mining of online controlled experiments at scale (e.g., hundreds of experiments run every day at Bing) and deployment of online controlled experiments across dozens of web sites and applications has taught us many lessons. We provide an introduction, share real examples, key lessons, and cultural challenges.
-
Seven Rules of Thumb for Web Site Experimenters
KDD 2014
Web site owners, from small web sites to the largest properties that include Amazon, Facebook, Google, LinkedIn, Microsoft, and Yahoo, attempt to improve their web sites, optimizing for criteria ranging from repeat usage, time on site, to revenue. Having been involved in running thousands of controlled experiments at Amazon, Booking.com, LinkedIn, and multiple Microsoft properties, we share seven rules of thumb for experimenters, which we have generalized from these experiments and their…
Web site owners, from small web sites to the largest properties that include Amazon, Facebook, Google, LinkedIn, Microsoft, and Yahoo, attempt to improve their web sites, optimizing for criteria ranging from repeat usage, time on site, to revenue. Having been involved in running thousands of controlled experiments at Amazon, Booking.com, LinkedIn, and multiple Microsoft properties, we share seven rules of thumb for experimenters, which we have generalized from these experiments and their results. These are principles that we believe have broad applicability in web optimization and analytics outside of controlled experiments, yet they are not provably correct, and in some cases exceptions are known.
To support these rules of thumb, we share multiple real examples, most being shared in a public paper for the first time. Some rules of thumb have previously been stated, such as “speed matters,” but we describe the assumptions in the experimental design and share additional experiments that improved our understanding of where speed matters more: certain areas of the web page are more critical.
This paper serves two goals. First, it can guide experimenters with rules of thumb that can help them optimize their sites. Second, it provides the KDD community with new research challenges on the applicability, exceptions, and extensions to these, one of the goals for KDD’s industrial track.
Other authorsSee publication -
Online Controlled Experiments at Large Scale
KDD 2013
Web-facing companies, including Amazon, eBay, Etsy, Facebook, Google, Groupon, Intuit, LinkedIn, Microsoft, Netflix, Shop Direct, StumbleUpon, Yahoo, and Zynga use online controlled experiments to guide product development and accelerate innovation. At Microsoft’s Bing, the use of controlled experiments has grown exponentially over time, with over 200 concurrent experiments now running on any given day. Running experiments at large scale requires addressing multiple challenges in three areas:…
Web-facing companies, including Amazon, eBay, Etsy, Facebook, Google, Groupon, Intuit, LinkedIn, Microsoft, Netflix, Shop Direct, StumbleUpon, Yahoo, and Zynga use online controlled experiments to guide product development and accelerate innovation. At Microsoft’s Bing, the use of controlled experiments has grown exponentially over time, with over 200 concurrent experiments now running on any given day. Running experiments at large scale requires addressing multiple challenges in three areas: cultural/organizational, engineering, and trustworthiness. On the cultural and organizational front, the larger organization needs to learn the reasons for running controlled experiments and the tradeoffs between controlled experiments and other methods of evaluating ideas. We discuss why negative experiments, which degrade the user experience short term, should be run, given the learning value and long-term benefits. On the engineering side, we architected a highly scalable system, able to handle data at massive scale: hundreds of concurrent experiments, each containing millions of users. Classical testing and debugging techniques no longer apply when there are millions of live variants of the site, so alerts are used to identify issues rather than relying on heavy up-front testing. On the trustworthiness front, we have a high occurrence of false positives that we address, and we alert experimenters to statistical interactions between experiments. The Bing Experimentation System is credited with having accelerated innovation and increased annual revenues by hundreds of millions of dollars, by allowing us to find and focus on key ideas evaluated through thousands of controlled experiments. A 1% improvement to revenue equals $10M annually in the US, yet many ideas impact key metrics by 1% and are not well estimated a-priori. The system has also identified many negative features that we avoided deploying, despite key stakeholders’ early excitement, saving us similar large amounts
Other authorsSee publication -
Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data
WSDM 2013: The Sixth ACM International Conference on Web Search and Data Mining
Online controlled experiments are at the heart of making data-driven decisions at a diverse set of companies, including Amazon, eBay, Facebook, Google, Microsoft, Yahoo, and Zynga. Small differences in key metrics, on the order of fractions of a percent, may have very significant business implications. At Bing it is not uncommon to see experiments that impact annual revenue by millions of dollars, even tens of millions of dollars, either positively or negatively. With thousands of experiments…
Online controlled experiments are at the heart of making data-driven decisions at a diverse set of companies, including Amazon, eBay, Facebook, Google, Microsoft, Yahoo, and Zynga. Small differences in key metrics, on the order of fractions of a percent, may have very significant business implications. At Bing it is not uncommon to see experiments that impact annual revenue by millions of dollars, even tens of millions of dollars, either positively or negatively. With thousands of experiments being run annually, improving the sensitivity of experiments allows for more precise assessment of value, or equivalently running the experiments on smaller populations (supporting more experiments) or for shorter durations (improving the feedback cycle and agility). We propose an approach (CUPED) that utilizes data from the pre-experiment period to reduce metric variability and hence achieve better sensitivity. This technique is applicable to a wide variety of key business metrics, and it is practical and easy to implement. The results on Bing’s experimentation system are very successful: we can reduce variance by about 50%, effectively achieving the same statistical power with only half of the users, or half the duration.
Other authorsSee publication -
Online Controlled Experiments: Introduction, Learnings, and Humbling Statistics
Sixth ACM Conference on Recommender Systems
The web provides an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using controlled experiments (e.g., A/B tests and their generalizations). Whether for front-end user-interface changes, or backend recommendation systems and relevance algorithms, online controlled experiments are now utilized to make data-driven decisions at Amazon, Microsoft, eBay, Facebook, Google, Yahoo, Zynga, and at many other companies. While the theory of a controlled…
The web provides an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using controlled experiments (e.g., A/B tests and their generalizations). Whether for front-end user-interface changes, or backend recommendation systems and relevance algorithms, online controlled experiments are now utilized to make data-driven decisions at Amazon, Microsoft, eBay, Facebook, Google, Yahoo, Zynga, and at many other companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher's experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and mining of online controlled experiments at scale-thousands of experiments now-has taught us many lessons. We provide an introduction, share real examples, key learnings, cultural challenges, and humbling statistics.
-
Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained
KDD 2012
Online controlled experiments are often utilized to make data-driven decisions at Amazon, Microsoft, eBay, Facebook, Google, Yahoo, Zynga, and at many other companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher’s experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and mining of online controlled experiments at scale—thousands of experiments now—has taught us many lessons. These exemplify the…
Online controlled experiments are often utilized to make data-driven decisions at Amazon, Microsoft, eBay, Facebook, Google, Yahoo, Zynga, and at many other companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher’s experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and mining of online controlled experiments at scale—thousands of experiments now—has taught us many lessons. These exemplify the proverb that the difference between theory and practice is greater in practice than in theory. We present our learnings as they happened: puzzling outcomes of controlled experiments that we analyzed deeply to understand and explain. Each of these took multiple-person weeks to months to properly analyze and get to the often surprising root cause. The root causes behind these puzzling results are not isolated incidents; these issues generalized to multiple experiments. The heightened awareness should help readers increase the trustworthiness of the results coming out of controlled experiments. At Microsoft’s Bing, it is not uncommon to see experiments that impact annual revenue by millions of dollars, thus getting trustworthy results is critical and investing in understanding anomalies has tremendous payoff: reversing a single incorrect decision based on the results of an experiment can fund a whole team of analysts. The topics we cover include: the OEC (Overall Evaluation Criterion), click tracking, effect trends, experiment length and power, and carryover effects.
Other authorsSee publication -
Online Experiments: Practical Lessons
IEEE Computer, Vol 43, issue 9, pp. 82-85
When running online experiments, getting numbers is easy;
getting numbers you can trust is hard.Other authorsSee publication -
Controlled Experiments on the Web: Survey and Practical Guide
Data Mining and Knowledge Discovery journal, Vol 18(1), p. 140-181
The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments, A/B tests (and their generalizations), split tests, Control/Treatment tests, MultiVariable Tests (MVT) and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where…
The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments, A/B tests (and their generalizations), split tests, Control/Treatment tests, MultiVariable Tests (MVT) and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where endusers can help guide the development of features. Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques,which we showare not as simple in practice as is often assumed. Controlledexperiments typically generate large amounts of data, which can be analyzed using datamining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled
experiments.Other authorsSee publication -
Online Experimentation at Microsoft
Microsoft ThinkWeek Paper, recognized as top 30
Controlled experiments, also called randomized experiments and A/B tests, have had a profound influence on multiple fields, including medicine, agriculture, manufacturing, and advertising. Through randomization and proper design, experiments allow establishing causality scientifically, which is why they are the gold standard in drug tests. In software development, multiple techniques are used to define product requirements; controlled experiments provide a valuable way to assess the impact of…
Controlled experiments, also called randomized experiments and A/B tests, have had a profound influence on multiple fields, including medicine, agriculture, manufacturing, and advertising. Through randomization and proper design, experiments allow establishing causality scientifically, which is why they are the gold standard in drug tests. In software development, multiple techniques are used to define product requirements; controlled experiments provide a valuable way to assess the impact of new features on customer behavior. At Microsoft, we have built the capability for running controlled experiments on web sites and services, thus enabling a more scientific approach to evaluating ideas at different stages of the planning process. In our previous papers, we did not have good examples of controlled experiments at Microsoft; now we do! The humbling results we share bring to question whether a-priori prioritization is as good as most people believe it is. The Experimentation Platform (ExP) was built to accelerate innovation through trustworthy experimentation. Along the way, we had to tackle both technical and cultural challenges and we provided software developers, program managers, and designers the benefit of an unbiased ear to listen to their customers and make data-driven decisions. A technical survey of the literature on controlled experiments was recently published by us in a journal (Kohavi, Longbotham, Sommerfield, & Henne, 2009). The goal of this paper is to share lessons and challenges focused more on the cultural aspects and the value of controlled experiments.
Other authorsSee publication -
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants
Machine Learning journal, Vol 36, Nos. 1/2, pages 105-139
Methods for voting classication algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classiers for articial and realworld datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (three variants) and a Naive-Bayes inducer. The purpose of the study is to improve our understanding of why and when these algorithms,which use perturbation…
Methods for voting classication algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classiers for articial and realworld datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (three variants) and a Naive-Bayes inducer. The purpose of the study is to improve our understanding of why and when these algorithms,which use perturbation, reweighting, and combination techniques, affects classication error. We provide a bias and variance decompositionof the error to show how different methods and variants influence these two terms. This allowed us to determine that Bagging reduced variance of unstablemethods, while boosting methods (AdaBoost and Arc-x4) reduced both the bias and variance of unstable methods but increased the variance for Naive-Bayes, which was very stable. We observed that Arc-x4 behaves differently than AdaBoost if reweighting is used instead of resampling, indicating a fundamental difference. Voting variants, some of which are introduced in this paper, include: pruning versus no pruning, use of probabilistic estimates, weight perturbations (Wagging), and backtting of data. We found that Bagging improves when probabilistic estimates in conjunction with no-pruning are used, as well as when the data was backt. We measure tree sizes and show an interesting positive correlation between the increase in the average tree size in AdaBoost trials and its success in reducing the error. We compare the mean-squared error of voting methods to non-voting methods and show that the voting methods lead to large and signicant reductions in the mean-squared errors. Practical problems that arise in implementing boosting algorithms are explored, including numerical instabilities and underflows.
Other authorsSee publication -
Wrappers for Feature Subset Selection
Artificial Intelligence journal (97)
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method…
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain. We study the strengths and weaknesses of the wrapper approach and show a series of improved designs. We compare the wrapper approach to induction without feature subset selection and to Relief, a filter approach to feature subset selection. Significant improvement in accuracy is achieved for some datasets for the two families of induction algorithms used: decision trees and Naive Bayes.
Other authorsSee publication -
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection
IJCAI
We review accuracy estimation methods and compare the two most common methods: cross-validation and bootstrap. Recent experimental results on artificial data and theoretical results in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment -- over half a million runs of C4.5 and a Naive-Bayes algorithm -- to…
We review accuracy estimation methods and compare the two most common methods: cross-validation and bootstrap. Recent experimental results on artificial data and theoretical results in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment -- over half a million runs of C4.5 and a Naive-Bayes algorithm -- to estimate the effects of different parameters on these algorithms on real-world datasets. For cross-validation, we vary the number of folds and whether the folds are stratified or not; for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold stratified cross validation, even if computation power allows using more folds.
-
Supervised and Unsupervised Discretization of Continuous Features
Machine Learning
Many supervised machine learning algorithms require a discrete feature space. In this paper, we review previous work on continuous feature discretization, identify defining characteristics of the methods, and conduct an empirical evaluation of several methods. We compare binning, an unsupervised discretization method, to entropy-based and purity-based methods, which are supervised algorithms. We found that the performance of the Naive-Bayes algorithm significantly improved when features were…
Many supervised machine learning algorithms require a discrete feature space. In this paper, we review previous work on continuous feature discretization, identify defining characteristics of the methods, and conduct an empirical evaluation of several methods. We compare binning, an unsupervised discretization method, to entropy-based and purity-based methods, which are supervised algorithms. We found that the performance of the Naive-Bayes algorithm significantly improved when features were discretized using an entropy-based method. In fact, over the 16 tested datasets, the discretized version of Naive-Bayes slightly outperformed C4.5 on average. We also show that in some cases, the performance of the C4.5 induction algorithm significantly improved if features were discretized in advance; in our experiments, the performance never significantly degraded, an interesting phenomenon considering the fact that C4.5 is capable of locally discretizing features.
Other authorsSee publication
Patents
-
Changing results after back button use or duplicate request
Issued US 9,129,018
Enhancements of the user experience are provided when a user returns to a previously viewed page, such as a previously viewed page of search results. When a user returns to a previously viewed page, additional context information from a user's actions since the initial view of a page can be used to modify the previously viewed page and/or obtain a new version of the previously viewed page. In situations where the previously viewed page corresponds to a page of responsive results from a search…
Enhancements of the user experience are provided when a user returns to a previously viewed page, such as a previously viewed page of search results. When a user returns to a previously viewed page, additional context information from a user's actions since the initial view of a page can be used to modify the previously viewed page and/or obtain a new version of the previously viewed page. In situations where the previously viewed page corresponds to a page of responsive results from a search engine, the modified and/or new version of the search engine results page can include an expanded or reduced group of results, different types of results, different rankings for existing results, or a combination thereof.
Other inventorsSee patent -
Active hip
Issued US 8,433,916
Computing services that unwanted entities may wish to access for improper, and potentially illegal, use can be more effectively protected by using Active HIP systems and methodologies. An Active HIP involves dynamically swapping one random HIP challenge, e.g., but not limited to, image, for a second random HIP challenge, e.g., but not limited to, image. An Active HIP can also, or otherwise, involve stitching together, or otherwise collecting and including, within Active HIP software, i.e., a…
Computing services that unwanted entities may wish to access for improper, and potentially illegal, use can be more effectively protected by using Active HIP systems and methodologies. An Active HIP involves dynamically swapping one random HIP challenge, e.g., but not limited to, image, for a second random HIP challenge, e.g., but not limited to, image. An Active HIP can also, or otherwise, involve stitching together, or otherwise collecting and including, within Active HIP software, i.e., a HIP web page, to be executed by a computing device of a user seeking access to a HIP-protected computing service x number of software executables randomly selected from a pool of y number of software executables. The x number of software executables, when run, generates a random Active HIP key. If the generated Active HIP key accompanies a correct user response to the valid HIP challenge the system and/or methodology can assume with a degree of certainty that the current user is a legitimate human user and allow the current user access to the requested computing service.
-
Method and System for Determining Whether an Offering is Controversial Based on User Feedback
Issued US 8412557
The controversiality of an offering in a computer implemented system is computed based on user satisfaction feedback. A controversiality index can be provided to indicate the extent to which the offering is controversial.
Other inventorsSee patent -
Continuous usability trial for a website
Issued US 8,185,608
A continuous website trial allows ongoing observation of user interactions with website for an indefinite period of time that is not ascertainable at initiation of the trial. Users are randomly assigned to either a control group or one or more test groups. The control and test groups are served different sets of web pages, even though they access the same website. During the trial, the web pages for the control group are held constant over time, while the web pages for the test group(s) undergo…
A continuous website trial allows ongoing observation of user interactions with website for an indefinite period of time that is not ascertainable at initiation of the trial. Users are randomly assigned to either a control group or one or more test groups. The control and test groups are served different sets of web pages, even though they access the same website. During the trial, the web pages for the control group are held constant over time, while the web pages for the test group(s) undergo multiple modifications at separate occasions over time. As the web pages for the test group(s) are modified, statistical data collection continues to learn how user behavior changes as a result of the modifications. The statistical data obtained from the users of the various groups may be compared and contrasted and used to gain a better understanding of customer experience with the website.
Other inventors -
Detection of behavior-based associations between search strings and items
Issued US 8,112,429
A system and method are disclosed for automatically detecting associations between particular sets of search criteria, such as particular search strings, and particular items. Actions of users of an interactive system, such as a web site, are monitored over time to generate event histories reflective of searches, item selection actions, and possibly other types of user actions. An analysis component collectively analyzes the event histories to automatically identify and quantify associations…
A system and method are disclosed for automatically detecting associations between particular sets of search criteria, such as particular search strings, and particular items. Actions of users of an interactive system, such as a web site, are monitored over time to generate event histories reflective of searches, item selection actions, and possibly other types of user actions. An analysis component collectively analyzes the event histories to automatically identify and quantify associations between specific search strings (or other types of search criteria) and specific items. As part of this process, a decay function reduces the weight given to a post-search item selection event based on intervening events that occur between the search event and the item selection event
Other inventorsSee patent -
Bayes rule based and decision tree hybrid classifier
US 6,026,399
The present invention provides a hybrid classifier, called the NB-Tree classifier, for classifying a set of records
-
Method system and computer program product for visualizing an evidence classifier
US 6,460,049
A method, system, and computer program product visualizes the structure of an evidence classifier.
Other inventorsSee patent -
Method, system, and computer program product for visualizing a decision-tree classifier
US 6,278,464
A method, system and a computer program product for visualizing a decision-tree classifier are provided
Other inventorsSee patent -
Strategies for providing diverse recommendations
US 7,542,951
-
Strategies for providing novel recommendations
US 7,584,159
-
System and method for selection of important attributes
US 6,026,399
A system and method determines how well various attributes in a record discriminate different values of a chosen label attribute.
Other inventors
Honors & Awards
-
59 A/B testing influencers you need to follow in 2023
Kameleoon
-
60 influencers in A/B testing you need to follow in 2022
Kameleoon
https://www.kameleoon.com/en/blog/60-influencers-ab-testing-you-need-follow-2022
-
82 influencers in A/B testing that you need to know in 2021
Kameleoon
https://www.kameleoon.com/en/blog/top-ab-testing-influencers
-
Over 50,000 citations to papers
-
https://scholar.google.com/citations?hl=en&user=O3RYHGwAAAAJ&view_op=list_works&pagesize=100
-
Experimentation lifetime achievement award
https://experimentationcultureawards.com/
https://experimentationcultureawards.com/#ronnykohavi
https://www.linkedin.com/posts/ronnyk_expca2020-abtest-experimentguide-activity-6714985210947207168-7Hec -
Quora Most Viewed Writer in A/B Testing (adjusts real-time, usually top 3)
Quora
https://www.quora.com/topic/A-B-Testing/writers
-
AMiner 5th most influential scholar in AI, 26th most influential scholar in Machine Learning
-
https://aminer.org/mostinfluentialscholar/ai
https://aminer.org/mostinfluentialscholar/ml
-
Forbes article: A Massive Social Experiment On You Is Under Way, And You Will Love It
Forbes
Quoted in http://www.forbes.com/sites/parmyolson/2015/01/21/jawbone-guinea-pig-economy/
-
IEEE Tools with Artificial Intelligence best paper award
IEEE
IEEE Tools With Artificial Intelligence Best Paper Award for the paper Data Mining using MLC++, a Machine Learning Library in C++ by Kohavi, Sommerfield, and Dougherty.
-
President's award (top 5%) each year of BA degree
Technion
Languages
-
Hebrew
Native or bilingual proficiency
-
English
Native or bilingual proficiency
Organizations
-
SIGKDD
-
Recommendations received
22 people have recommended Ron
Join now to viewMore activity by Ron
-
I cancelled my Loom membership last week. And I was shocked. Pleasantly shocked. Shocked at how easy it was to cancel. It took me 5 screens and 6…
I cancelled my Loom membership last week. And I was shocked. Pleasantly shocked. Shocked at how easy it was to cancel. It took me 5 screens and 6…
Liked by Ron Kohavi
-
I'm not big on the whole influencer thing but I'd like to thank Kameleoon for including me in the 2024 list with so many other kickass people. I'm…
I'm not big on the whole influencer thing but I'd like to thank Kameleoon for including me in the 2024 list with so many other kickass people. I'm…
Liked by Ron Kohavi
-
Happy Friday to everyone... but an ESPECAILLY happy Friday to those grinders out there who are being recognized for their work in helping shape /…
Happy Friday to everyone... but an ESPECAILLY happy Friday to those grinders out there who are being recognized for their work in helping shape /…
Liked by Ron Kohavi
-
𝗪𝗵𝗲𝗿𝗲 𝘁𝗼 𝗙𝗶𝗻𝗱 𝗜𝗱𝗲𝗮𝘀 𝗳𝗼𝗿 𝗔/𝗕 𝗧𝗲𝘀𝘁𝘀? I often encounter a problem where the team is stuck, not knowing where to get ideas for…
𝗪𝗵𝗲𝗿𝗲 𝘁𝗼 𝗙𝗶𝗻𝗱 𝗜𝗱𝗲𝗮𝘀 𝗳𝗼𝗿 𝗔/𝗕 𝗧𝗲𝘀𝘁𝘀? I often encounter a problem where the team is stuck, not knowing where to get ideas for…
Liked by Ron Kohavi
-
The two-minute video for my KDD 2024 paper on False Positives in A/B tests with Nanyu Chen was posted by ACM at https://lnkd.in/gNwnvhfE The paper…
The two-minute video for my KDD 2024 paper on False Positives in A/B tests with Nanyu Chen was posted by ACM at https://lnkd.in/gNwnvhfE The paper…
Shared by Ron Kohavi
-
A/B Testing myth: “Longer tests = more sample size = more power” The shared post gives a great formal intuition why “longer tests = more sample size…
A/B Testing myth: “Longer tests = more sample size = more power” The shared post gives a great formal intuition why “longer tests = more sample size…
Liked by Ron Kohavi
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore MoreOthers named Ron Kohavi
-
Ron Kohavi Nusbaum
Dynamics365 ERP consultant - Implementation
-
Ron Kohavi
Warehouse Operator at Comett progress
2 others named Ron Kohavi are on LinkedIn
See others named Ron Kohavi