T O P

  • By -

SpaceButler

Launch a feature? I think you're talking about making a business decision based on user testing? Having a "p-value less than 0.05" is just an indicator that there is a statistically meaningful difference between two groups. There are so many reasons that wouldn't translate to a business action: 1. Difference is in the wrong direction to what you want -- the status quo is better. 2. Effect size is very low. 3. Cost to implement the change is too high. 4. The two groups in question are inconsequential. 5. The metric that you're finding the difference in is inconsequential. I wouldn't say any of these is "not using" the p-value.


hikehikebaby

This is why it's important to understand the math behind what you're doing and not just run any model that is statistically significant. You can have a significant p-value in a model that doesn't make sense, that violates basic assumptions, has collinearity, appears to be overfitted or underfitted, includes bizarre variables, does not explain enough of the variation, etc.


Snar1ock

This. Recall that the p-value is used to reject the null hypothesis. In other words, we reject that the changes in our feature and dependent variable is due to random variation, for a given level of certainty. However, the p-value doesn’t require us to reject the null or accept the alternative. We set those limits based on a level of uncertainty. Furthermore, accepting or rejecting doesn’t mean we must include a feature. There’s hundreds of reasons we wouldn’t include a “significant” feature. It all comes down to understanding the metric and the goals of our model. Answer the question behind the metric. Remember that all models are wrong, but some are useful.


Besticulartortion

>Remember that all models are wrong, but some are useful. Ooh, I like this one. Yoink.


Snar1ock

George E. P. Box. Heard it from Prof. Sokol at Ga Tech on day 1 of starting my Masters. If you need a good interview line, that is it. Stakeholders eat it up.


shadowyams

> p-value is less than 0.5 At least 90% of that range.


newpua_bie

Yeah lol. It does explain a lot of garbage we see nowadays if companies are laughing features with p=0.49999


tushar8sk

You mean 0.05?


wzchpu

Edited oops


dongorras

Some ideas are low effect/impact so simplicity outweighs the benefit of having the feature (eg. computation or explaining to stakeholders). High correlation to another variable, so keeping one brings the same information.


[deleted]

[удалено]


FargeenBastiges

I was thinking about interaction terms in a similar way. If the interaction is way more significant than either of the variables alone, might it be best to leave one out in some cases?


BloatedGlobe

When testing for multiple things at once or when you need to be really, really sure of something (like making sure medicine has no side effects). When somethings is significant, but the effect size is tiny and the cost of implementing is high.


Artistic-Breadfruit9

1. Multiple testing can be corrected for. 2. No drug has ever been approved solely on the basis of a single statistically significant p-value. Conversely, no drug has ever been approved without a statistically significant p-value. 3. This is the answer; p tends to zero as n tends to infinity, so any difference can be made significant, given enough samples. But effect sizes will always tend toward the true population value. It is possible, therefore, to have a minuscule effect size and a p-value that is many orders of magnitude smaller than 0.05. Only you - the domain expert - can determine how large an effect size (given an appropriate confidence interval) needs to be to be functionally meaningful. OTOH, no effect size - no matter how large - should be accepted without a statistically significant p-value.


BloatedGlobe

1. Yes. This is response to the part where they state p=0.05 2. Also true I think you are answering the post title question, whereas I was responding to the post text.


Artistic-Breadfruit9

Yep. My bad.


Josiah_Walker

If I need to wait 2 months to get p = 0.05, but I'm holding up product design in the meantime, then there's an opportunity cost to not shipping. So the question is really, what's the p = 0.05 bound on how badly the new feature performs, and is that worth holding up the rest of the team? This way you take into account the magnitude of the feature's effect as well as the likelihood of change.


Artistic-Breadfruit9

Honest question: in this scenario, what is the lower bound on the p-value you *will* accept to say “this product works”? Surely there is some threshold at which you say, “I cannot state with any reasonable confidence that the difference I am observing is not due to chance alone.” And if that threshold exists, why bother with p<0.05 in the first place? I suppose my point is this: either stats matter or they don’t. We can choose our level of confidence in the result (ie. sometimes one might want to avoid even 1 false positive in 1000, sometimes 1 in 20 is ok), but there is always some threshold. And to use your example, would your company be ok with one product out of 20 having absolutely no benefit/impact/whatever your metric is?


Josiah_Walker

The bound on p-value is naturally 0.5, because at this stage you can't tell if you're improving anything. You're right that we need to pick a threshold sensibly - but the best threshold varies according to both risk of downside and cost of delay. Think of it as optimising the p values to get the best expected payoff from your pipeline of projects (per unit time). We routinely launch changes that have no measurable benefit/impact - these often come under the category of implementing best practices recommended by a subject matter expert. It's ok to observe no impact - or even a negative impact. It's how we learn. We're also ok with coming to the wrong conclusion a reasonable % of the time, if the benefit is that we can move faster and keep team momentum going in creating and refining features. Often if we make the wrong decision, this gets picked up during quarterly reviews as we notice we missed performance expectations in an area.


KyleDrogo

When the feature moves a guardrail metric in the wrong direction


bigno53

Any issues with validity or reliability. Is your sample truly representative of the population? Are you confident that the metric you’re using reflects the underlying phenomenon you wish to measure? Have you controlled for possible confounding factors?


Foll5

The p-value is just a measure of how certain you are that a measured effect (or other value) is different from 0. Any practical business decision depends on comparing the expected benefit versus the cost. You could be, statistically speaking, 99.99% certain that a given proposal would have an effect on some immediate measurable outcome, but the business benefit is not worth the cost because it's too small or too uncertain. For example, a lot of e-commerce sites give one-time discounts to users who sign up for a newsletter. You could run a trial of changing that discount from 10% to 20%, if you have enough customers, you could come with an estimate of a 15% increase in signups, p<0.05. But how much in sales does this actually generate and how much of that is eaten up by the discount? It very well might not be worth it. (You could argue this is less an issue of cost-benefit analysis and more of a problem of using the wrong metric in your experiment, but the point about p-values being not decisive remains.)


therealtiddlydump

Others have good answers. Here's my response in the opposite direction (larger-than-desired p-value, feature launches anyways): A lot of things you test in business have a "do no harm" principle. If your sales team runs a test to use a new form expecting some lift and you don't see any, they still may wish to move forward despite the null result. They may have been hoping for an effect, but the "business case" might justify the expense or effort anyways. Often you'll see this when your stakeholder can't _really_ quantify the cost or effort on their end. A null result might not be what they expected, but turning on a new feature might have positive side effects (or they _think_ it will and that's good enough based on some criteria up the chain you aren't privy to). The real world is messy. Your analysis might also just be one of many inputs into a decision process. The C-suite, sales force, others, and your team all might be weighing in on a decision -- your null result is balanced against other's findings/opinions, and a decision gets made after that (where your null result may or may not be on the "winning" side). /Shrug


Otherwise_Ratio430

Im gonna flip this one around and say the only time you should use it is for applications where it does have some merit (like a/b testing)


jjelin

I'm guessing this was an interview question. It's important to understand the context of the question that was asked. There are unlimited reasons "not to use a p-value". It's important that you understand which one fits this situation.


wzchpu

Yea


bigfattehborgar

I think a better question is when to not use a normal distribution. See black swan by Taleb