P-Hacking in Startups

Science has a problem.

It’s kind of broken.

Well. Not all of it. Mostly the social sciences and medicine. And I don’t just mean the fact that they consider Freud canon.

It started with a trickle. A retracted paper here. A study that couldn’t be repeated, there.

Then someone decided to get systematic. It opened the floodgates. A study in 2016 showed that 70% of scientists had failed to replicate another scientist’s work, and fully half had failed to reproduce their own work.

Reproducibility is fundamental to the scientific method — it’s supposed to be a study of the natural world, which doesn’t change all that often — so what does its absence mean? Are we incompetent? Can we trust anything? Do we know anything?

The high failure rate of venture-backed startups is its own kind of replication crisis: “How could my company fail? I followed the growth-hacking, blitz-scaling advice from the founders who made it big!” I don’t mean to give blogs and podcasts the weight of peer-reviewed science. But our industry seems to trust them as if they deserve it.

What does it mean if a founder can’t get similar results when following the practices of another?

Science has begun to heal itself. It’s time for startups to go through their own reckoning. Their methods are failing most people. It’s time to learn why and how to get better.

What’s wrong with science?

The crisis in science has multiple, interconnected causes. A lot of them come down to taking techniques from simpler systems and applying them to the far more complex study of humans. The practices useful for studying minerals also worked great on metals, but with people? Not so much.

One of the most famous examples of these studies that fizzle under scrutiny is the marshmallow experiment, conducted at Stanford University in 1972 on the children of students enrolled there. It produced original, important conclusions on the ability of children to endure delayed gratification, and later studies showed that ability was highly correlated to success later in life. Suddenly we’ve got a new tool for understanding how successful you’ll be at a very young age.

Or… maybe not. Further studies showed the original work was actually just exposing the socioeconomic background of the kids. If your family is well off, you are comfortable with delayed gratification and, just coincidentally, are also likely to be well off when you’re older. If you’re from a poor family, delayed gratification is harder to accept and, huh, you’re also more likely to be poor than those kids of rich parents.

Once someone reran the study with a larger group of kids (900 instead of 90) and controlled for socioeconomic background… the effect largely disappeared. It’s not all that surprising that kids with no food insecurity are better at delaying gratification and also will be more successful in life. It certainly doesn’t grab the headlines like announcing that kids who can wait five minutes to eat a marshmallow will earn more money than those who can’t. No HBR article for that one.

It’s been almost fifty years since this study was published. That’s five decades of science based on flawed work, five decades of science that has to be unwound and retried. The longer these mistakes last, the more expensive they are to fix. And like that HBR article above, many conclusions never get retracted.

One particular “technique” has helped trigger the crisis in science. Many a growth-hacking product manager has fallen into the same trap. They can only be rescued through discipline and rigor.

The how and why of P-hacking

Abusing data is a sure way to get bad results. Unlike startups, scientists rarely just make up their data. They make more subtle mistakes, like P-Hacking. This probably sounds pretty cool, but it’s actually a common form of data misuse. Wikipedia describes it this way:

…performing many statistical tests on the data and only reporting those that come back with significant results.

It works like this:

A researcher comes up with an idea for a study. He collects a bunch of data, runs the experiment and… no dice. The idea didn’t pan out.

Hmm. “I have all this data. I can’t just throw it away.”

So he starts slicing the data looking for something that stands out. After a while, sure enough, he finds some correlation that is strong enough to stand up — usually its P-value is under 0.05, and thus considered statistically significant. He publishes this in a paper and looks like a genius. It gets big exposure in the press. Journalists love weird and surprising science. They can report on it without understanding it.

But no one can reproduce the work. The paper gets retracted. He gets uninvited from the big conferences. (Don’t worry. The papers never follow up and publish the retraction.)

What went wrong?

He left out one key piece: How he got the data.

Let’s say he thinks breastfed kids are healthier than bottle-fed kids. He sets up a study that tries to isolate just these variables, which means he wants his population to be reasonably homogenous (similar quality of life, similar locations, etc). Put simply, the difference being researched should be the only material one in the population (unlike in the marshmallow experiment).

But then he looks at the data and — like most of these studies — find there’s no significant difference in health outcomes between breastfed and bottle-fed kids.

He could just toss the data. But, well, he’s already paid to collect it. He’s got all these graduate students who are working nearly for free. He might as well try something. So he puts a student or two on trying to find useful results.

They nearly always do, but… that success kills his work. All those controls to make it work for his original experiment fatally bias it for other studies.

Let’s say he discovers that the study participants who were bottle-fed tended to move around a lot more than people who were breastfed. He concludes, oh, wow, getting bottle-fed causes you to hate your parents and move away. (Yes, this is exactly the kind of headline that would get picked for a result like this.)

He has not proven that. All he has shown is in this particular — probably small, and certainly narrow — data set, that happens to be the case.

He should throw away all existing data. Start from scratch controlling for everything except this new variable under test. Only then can you look for correlations between how a baby was fed and mobility.

But he was too lazy or scared to do that. He found a match in that smaller, biased data set, and then published the results without admitting the problems in either his data or his methods. A few decades ago he would have gotten away with it: A big splashy result on publication, and then everyone just assuming this was true, with no attempt to reproduce and no real questioning of the result.

Today, no chance. Science has developed defenses against this kind of malpractice.

Preregistration of experiments is a key tool.

Researchers register with a central database that they are going to study the health of breastfed vs. bottle-fed babies. When they get results, they point to that registration and say, see, this is what led to my data collection.

If they then wanted to publish some other study, people would say, no, you didn’t pre-register this, which makes us suspect you’re p-hacking, so we’re going to do a deep dive on how you got your data. On second thought, we’re just going to reject your paper. Come back when the results hold on a clean dataset.

From social science to startups

This might not initially seem to have anything to do with startups. Product managers and marketers aren’t commissioning studies — and they certainly aren’t controlling for variables!

Hmm. If you look at it a bit funny… Every data-backed marketing campaign and feature launch is an experiment.

Let’s build an analogous example.

A product manager builds a new feature, and because he’s growth hacking, he has lots of telemetry to tell him exactly how people are using it.

His theory is that people will use this new feature in some specific way. But he builds it, ships it, and observes, well, hmm, no, almost no one is using it. It’s a bust. I’m sure you’ve never worked on a project like this, but trust me, it happens.

Except… hey, there’s this small group that is using it, and widely. He looks into it more closely, and realizes they’re using it at 10x the rate people use the rest of the product. So he changes plans, and he rebuilds the feature around the specific thing those few people were doing with it.

Wait, what? No one uses that feature, either, and even worse, the people who originally used it aren’t any more, now that it’s focused on their actual usage!

What went wrong?

You got caught p-hacking

The data set from his failed feature is bad data. He got the most important result: This feature did not work well for his users. He wasn’t willing to let go of failed work. Just like the scientists, he went looking for some other way to reuse it. And instead of developing new hypotheses and running new experiments, he took his biased data and tried to find new correlations cheaply.

Unfortunately for him, he did.

But when he published the new feature, he is faced with a harsh truth: Those few people who were using the feature in unexpected ways don’t look like the rest of his users. A new feature built for that purpose doesn’t help everyone else. And because he relied on data to make his decisions instead of talking to actual users, he learned too late that those unrepresentative users were doing something even more weird. His simplified feature actually removed that weirdness in the name of simplicity that everyone can use.

So now he’s two features in and nothing to show for it. So much for growth-hacking.

How do I fix it?

The solution is very similar to what science has done.

Connect your data to experiments. With discipline. You must get new, clean data for each new test. I know this is anathema to modern data-oriented product management. But it’s the only real way to trust your results.

That word discipline is key. You don’t need to build some international central registry. Whatever your mission statement says, you’re not really saving the world, and you’re not actually doing science. You’re just trying to build a product people love. What you need is rigorous internal practices, and to hold each other accountable so you can’t cheat at statistics.

Unfortunately, this requires you let go of one of Silicon Valley’s most cherished and wrong beliefs.

No, you don’t learn more from failure than success.

Experiments fail. This might be an important part of the process, but it’s not very valuable. Congratulations. Of all the possible ways you could fail, you’ve discovered one of them. Don’t let it go to your head.

Don’t work too hard to salvage that failure. You’re p-hacking, and just making it worse. Yes, obviously, you get personal lessons. You might be lucky enough to learn something that triggers your next experiment. But you have to go run that separately.

You can’t build on the detritus of failure.

So my data is now worthless?!

Of course not. I still rely on data for all kinds of problems. One of the great things about building a company today is how easily you can get information at scale.

But never let yourself forget that your data is heavily biased, especially by how it was collected. One of my favorite examples is from when YouTube dramatically reduced response time. Their average response times went up! Suddenly people with much worse connectivity found it worth using, making the average worse. The developers thought they were helping existing users, but the biggest impact was in creating new ones.

You have to recognize your job isn’t to find some way to make the data valuable. Your job is to make high-quality decisions. Use data when you can. If you don’t have data, go get it.

But the job of the data is to inform you, not give you answers. Use it to hone your instinct, to improve your decision-making. When something doesn’t add up, go talk to the actual humans who are the source of the data. And even, spend some time with people not represented in it.

If you’re working at a software startup, you’re not doing science (even if, like me, you have a science degree). But you should still take advantage of its discipline and practices.

Don’t stop at protecting yourself from P-hacking. One founder’s success might be hard to replicate for many reasons. Gain what lessons you can. But don’t blindly trust others’ story of their work.

Because failure on your part won’t be paired with the retraction of a Nature paper, it’ll be an announcement of layoffs in TechCrunch.

find the cost of your paper

Sep 13, Grand Remembrances

Today is Grandparents Day in the United States. Being a Grand is a special honor. I feel very blessed that my wife and I have two grandchildren. We were able to visit them today. Yes, we are still being cautious with the coronavirus, but we also find it very difficult to not see them when they live so close. So today we did drop by to visit Jacob (age 10) and Sophia (age 7) along with their parents. We brought donuts and caught up with them. Our grandchildren are still pretty young and this is a precious time in their lives – and ours!

I wish I had known my grandparents better. We never lived in the same place. Dad was a career Air Force pilot, so we moved around a lot. But we did get to see them once in a while when they would visit us, or we them.

A Plague of Giants

There are five known magical ‘kennings’ or types: air, water, fire, earth, and plants. Each nation specializes in of these kennings, and the magic influences the society. There’s a big pitfall with this diversity of ability and locale–not everyone gets along.

Enter the Hathrim giants, or ‘lavaborn’ whose kenning is fire. Where they live the trees that fuel their fire are long gone, but the giants are definitely not welcome anywhere else. They’re big, they’re violent, and they’re ruthless. When a volcano erupts and they are forced to evacuate, they take the opportunity to relocate. They don’t care that it’s in a place where they aren’t wanted.

I first read Kevin Hearne’s Iron Druid books and loved them (also the quirky The Tales of Pell), so was curious about this new venture, starting with A PLAGUE OF GIANTS. Think Avatar: The Last Airbender meets Jim Butcher’s Codex Alera series. Elemental magic, a variety of races, different lands. And it’s all thrown at you from page one.

But this story is told a little differently. It starts at the end of the war, after a difficult victory, and a bard with earth kenning uses his magic to re-tell the story of the war to a city of refugees. And it’s this movement back and forth in time and between key players in this war that we get a singularly grand view of the war as a whole. Hearne uses this method to great effect.

There are so many interesting characters in this book that I can’t cover them all here. Often in books like this such a large cast of ‘main’ character can make the storytelling suffer, especially since they don’t have a lot of interaction with each other for the first 3/4 of the book–but it doesn’t suffer, thankfully. And the characterization is good enough, despite these short bursts, that by the end we understand these people and care about what happens to them.

If there were a main character it would be Dervan, a historian who is assigned to record (also spy on?) the bard’s stories. He finds himself caught up in machinations he feels unfit to survive. Fintan is the bard from another country, who at first is rather mysterious and his true personality is hidden by the stories he tells; it takes a while to understand him. Gorin Mogen is the leader of the Hathrim giants who decide to find a new land to settle. He’s hard to like, but as far as villains go, you understand his motivations and he can be even a little convincing. There’s Abhi, the son of hunters, who decides hunting isn’t the life for him–and unexpectedly finds himself on a quest for the sixth kenning. And Gondel Vedd, a scholar of linguistics who finds himself tasked with finding a way to communicate with a race of giants never seen before (definitely not Hathrim) and stumbles onto a mystery no one could have guessed: there may be a seventh kenning.

There are other characters, but what makes them all interesting is that they’re regular people (well, maybe not Gorin Mogen or the viceroy–he’s a piece of work) who become heroes in their own little ways, whether it’s the teenage girl who isn’t afraid to share vital information, to the scholars who suddenly find how crucial their minds are to the survival of a nation, to the humble public servants who find bravery when they need it most. This is a story of loss, love, redemption, courage, unity, and overcoming despair to not give up. All very human experiences by simple people who do extraordinary things.

Hearne’s worldbuilding is engaging. He doesn’t bottle feed you, at first it feels like drinking from a hydrant, but then you settle in and pick up things along the way. Then he shows you stuff with a punch to the gut. This is no fluffy world with simple magic without price. All the magic has a price, and more often than not it leads you straight to death’s door. For most people just the seeking of the magic will kill you. I particularly enjoyed the scenes with Ahbi and his discovery of the sixth kenning and everything associated with it. But giants? I mean, really? It isn’t bad enough fighting people who can control fire that you have to add that they’re twice the size of normal people? For Hearne if it’s war, the stakes are pretty high, and it gets ugly.

The benefit of the storytelling style is that the book, despite its length, moves along steadily (Hearne is no novice, here). The bits of story lead you along without annoying cliffhangers (mostly), and I never got bored with the switch between characters. It was easy to move between them, and they were recognizable enough that I got lost or confused. The end of the novel felt a little abrupt, but I guess that has more to do with I was ready for the story to continue, despite the exiting climax.

If you’re looking for epic fantasy with fun storytelling and clever worldbuilding, check out A PLAGUE OF GIANTS.

The post A Plague of Giants appeared first on Elitist Book Reviews.

The Artwork Of Gary Choo

Gary Choo is a concept artist/illustrator based in Singapore. I’ve know Gary for a good many years ( 17, actually ), working together in animation studios in Singapore like Silicon Illusions and Lucasfilm. Gary currently runs an art team at Mighty Bear Games, but when time allows he also draws covers for Marvel comics, and they’re amazing –

The Art Of Gary Choo
The Art Of Gary Choo
The Art Of Gary Choo
The Art Of Gary Choo
The Art Of Gary Choo

To see more of Gary’s work or to engage him for freelance work, head down to his ArtStation.

The post The Art Of Gary Choo appeared first on Halcyon Realms – Art Book Reviews – Anime, Manga, Film, Photography.

27