A/B Testing Website of the Grocery Store Chain
Introduction
There is a large grocery chain. The company’s goal is to drive more customers to download the mobile app and register for the loyalty program. The manager is curious if changing the link to a button of the app store will improve the user’s ability to download the app. Here is the existing link button of the app store.
The manager asked to create an A/B testing plan for changing the link to a button of the app store with the expectation it will the user’s interest to download the app.
Setting Up Problem
1. Experiment Goal
To see if changes to the link to a button of the app store can increase user interest in downloading the mobile application.
2. Choosing Metrics
Goal Metrics : User counts
User count means the number of people who have downloaded the app and have installed it on their device [1].
-Represents the company’s purpose or core business.
- Objective : Drive more customers to download our mobile app and register for the loyalty program.
- Reason : User counts can measure how many interested customers have downloaded the mobile app on their devices.
-Simple to communicate with stakeholders.
- Stakeholder : Internal team, manager, executives.
- Reason : When using this metric, stakeholders can understand how far the mission/goal has been achieved.
Driver Metrics : Click-Through-Rate (CTR)
CTR is the ratio of the number of clicks on a specific link or an element of interface to the number of times people were exposed to the link or element [2].
Guardrail Metrics : Mobile app loading time
Mobile app load time can be characterized as the measure of time taken by the application to totally introduce before the interface opens and the application gets significant or interactive for the user [3].
If mobile app loading time increases a few ms -> decreased satisfaction -> abandon/uninstall mobile app -> lose users -> potential loss.
3. Define Variants
Control : Existing link.
Treatment : New link, such as in picture not text.
4. Define Hypothesis
H0 (Null Hypothesis) : CTR New link such as in the picture, not text equal to or less than the existing link.
H1 (Alternative Hypothesis) : CTR New link such as in the picture, not text more than the existing link.
Designing Experiments
1. Randomization Unit : User.
2. Target of Randomization Unit : All users who visit the web pages of the grocery store chain that contain links.
3. Sample Size :
- Significance level (α) is the probability that the experiment will produce a falsepositive result (a type I or an error)[4]. This is the error and must be minimized. Therefore, the smaller the alpha the better for experiments but the larger the sample required. The significance level value in this experiment is 5% or 0.05. This is the most commonly used significance level/industry standard which indicates that the acceptable risk of error is 5% or 0.05. A significance level of 0.05 means that we accept a 5% chance of making a Type I error that rejects the true null hypothesis, in exchange for 95% confidence that the result is not due to chance. It was chosen because it provides a balance between the risk of type I error and type II error. Type I error is an error that occurs when rejecting H0 (null hypothesis) when H0 is true, while type II error occurs when failing to reject H0 (null hypothesis) when H0 is false. On the other hand, a significance level of 0.01 is more conservative than a significance level of 0.05, conservative meaning when the desired experimental results are as accurate as possible.
- Power level (1-β) is the probability that the experiment will reject the null hypothesis when it is false[4] . This is the truth that must be brought up. Therefore, the greater the power level the better for the experiment but the larger the sample required. The power level value in this experiment is 80% or 0.8. This is because a high power level value indicates that the experiment has a strong ability to detect significant differences between the tested groups if such differences do exist. The higher the power level value, the more likely it is that the experiment will be able to detect true differences between the groups. If type I errors are more important, then a higher power level value can be chosen to minimize the risk of such errors.
- Standard deviation of population (σ) is a measure of how much variation there is among individual data points in a population. It’s a way of quantifying how spread out the data is from its mean. A small standard deviation means that the data points are generally close to the mean, while a large standard deviation means that the data is more dispersed[5] . Since there is no historical data in this experiment, the value of the standard deviation of the population will be assumed to be 0.5.
- Difference between control and treatment (δ) is minimum effect (difference) to be detected. The minimum detectable effect is the effect size set by the researcher that an impact evaluation is designed to estimate for a given level of significance. The minimum detectable effect is a critical input for power calculations and is closely related to power, sample size, and survey and project budgets[6] . MDE is generally expressed as a percentage or proportion of the difference between the groups being compared. For example, if we want to detect a difference in conversions between two groups with a 95% significance level, we might decide that the MDE that can be detected is 5%. That is, if the difference between the two groups is less than 5%, we may not be able to significantly distinguish between the two in the experiment. In this experiment, the difference between the control and treatment is 2%. In this case, we want to be able to detect a difference of 2%. Then the number of Sample Size :
Sample size 10.000 for 1 variant, so total for 2 variants : 10.000 x 2 = 20.000.
- Since this experiment requires a very large sample size, the length of time to run the experiment depends on the number of visitors to the website. If the experiment is run for 6 full weeks with the frequency of users visiting the website at least 500 times per day, then the total number of users involved in the experiment is 42 days x 500 = 21,000. From this, the time sufficient to collect data is at least 6–8 weeks, the length of this experiment is done to avoid primacy and novelty effects. The primacy and novelty effect occurs when the initial interest is high, but as time goes by it drops.
Analyzing and Interpreting the Data
The dataset used in this project comes from Grocery website data for AB test. The dataset has 184.588 records with 5 variables. Here is information from the data variables used:
- RecordID : identifier of the row of data.
- IP Address : address of the user, who is visiting website.
- LoggedInFlag : 1 — when user has an account and logged in.
- ServerID : one of the servers user was routed through.
- VisitPageFlag : 1 — when user clicked on the loyalty program page.
From the sample size calculation, 10,000 users are obtained for each variant. Therefore, Simple Random Sampling is carried out to get a sample. To analyze and interpret data, the following steps were taken:
1. Ensure the trustworthiness
- Check the data quality (missing value, duplicate data, distribution of data)
There is no missing value. Next, check for duplicate data.
There are 85,072 duplicate data, so delete the duplicate data.
Now, there are 99.516 total records without duplicates and the data is ready for analysis.
- Data exploration (how many users in each group, and other insight from dataset)
Then calculate the CTR, for both groups as follows.
To see more clearly, compare the control group and the treatment group. Create the following code.
Next make a visualization, to see the comparison of CTR on each variant.
- Perform SRM test with chi-square test
-Define the null and alternative hypothesis (H0 and H1)
H0 : No SRM detected
H1 : SRM detected
-Calculate chi-square statistic
-Define decision rules
In making statistical test decisions, we can use:
Comparison of chi-square statistics with critical value
Comparison of p-value with alpha
Degree of freedom (df) is calculated as:
Based on the detection of SRM, SRM was not detected.
2. Conduct hypothesis testing and analyze the result
- Define null hypothesis H0 and alternative hypothesis H1
H0 (Null Hypothesis) : CTR New link, such as in picture not text ≤ existing link
H1 (Alternative Hypothesis) : CTR New link, such as in picture not text > existing link
First, define Zcrit, Zstatistic, and p-value. To calculate Zstatistic and p-value use this function.
Create an alternative for this hypothesis test case, in this case use ‘larger’ because want to prove CRnew is greater than CRold.
There is a relative increase of 34.7%. Next, summarize the statistical test results.
Next, visualize the statistical test results above. The visualization is made in a z-value distribution graph. Therefore, find the z-value when alpha = 0.05. The results of the visualization obtained will be seen in the following figure.
From the visualization above, the area of the green region < the area of the red region (the region where H0 is rejected) means that the probability of getting H0 from the sample is even smaller than the set alpha limit. Statistically, there is not enough evidence to accept H0 (the p-value is smaller than alpha), so H0 is rejected.
3. Calculate confidence interval of difference between treatment and control
Based on this result, there is 95% confidence that the difference in the proportion of users who clicked on the new link (CTR) in the treatment (B) and control (A) groups is between 0.011 and 0.024. Or it can be said that the increase in users when downloading applications using new links such as in picture (not text) (treatment) increases by 0.011 to 0.024.
Conclusion and Recommendation
• P-value (6.061722707735026e-08) < α (0.05) -> Reject H0
• Z Statistic (5.2916) > Z Critical (1.644) -> Reject H0
With significance level 5%, there is sufficient evidence that CTR New link such as in picture not text (treatment) more than existing link (control). In other words, CTR New link, such as in picture not text will increase user interest in downloading the application.
Recommendations for website of the grocery store chain :
- Based on the statistical test results, the results are statistically significant. P-value = 0.05 indicates that there is a 5% probability that the observed difference is due to chance or other factors unrelated to the variable being observed.
- But to make a decision whether to change the link to a button of the app store or not, must be practically significant such as :
1. Resources and costs required to implement the change. If the cost required for the change to the link to a button of the app store on the website is very high and not proportional to the impact on mobile app downloads, then the change may not be considered practically significant.
2. It is also necessary to consider the difference between performance before and after the change. If the change to the link to a button of the app store on the website can increase mobile app downloads by 1% or more, then the change may be considered practically significant. However, if the change only increases mobile app downloads by 0.1% or less, then the change may not be considered practically significant.
- Based on the above considerations, the change is considered practically significant.
Recommendation for the next experiment :
1. Download page variants: change the layout or content of the mobile app download page, such as adding images or positive reviews from other users.
2. App description: Change the app description on the website, such as highlighting the benefits or advantages of the app.
3. Changes to the overall appearance and content of the website: Changing the overall layout, design, and content of the website.
4. Target audience: There may be certain groups of users who are more likely to download apps than others, so changing the look and content of the website to appeal more to certain target groups could be a recommendation for future experiments.
References
1. mobileappdaily.com, Top 8 App Engagement Metrics For Mobile Apps To Track in 2023. March 14, 2023. [Accessed on April 1, 2023]. https://www.mobileappdaily.com/top-metrics-to-measure-user-engagement.
2. Damaševiˇcius Robertas, Zailskaite-Jakšte Ligita. Usability and Security Testing of Online Links: A Framework for Click-Through Rate Prediction Using Deep Learning, 2022.
3. storyly.io, App Loading. [Accessed on April 1, 2023]. https://www.storyly.io/glossary/app-loading.
4. Festing Michael FW. On determining sample size in experiments involving laboratory animals, 2017.
5. Khanacademy.org, Population standard deviation. [Accessed on April 1, 2023]. https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/varianc
e-standard-deviation-population/v/population-standard-deviation#:~:text=The%20population%
20standard%20deviation%20is,data%20is%20from%20its%20mean.
6. Dimewiki.worldbank.org, Minimum Detectable Effect. [Accessed on April 9, 2023]. https://dimewiki.worldbank.org/Minimum_Detectable_Effect#:~:text=The%20minimum%20detectable%20effect%20is,and%20survey%20and%20project%20budgets.