As we've seen, predicting your batch attenuation based on the average strain attenuation given by the yeast manufacturer is a bad idea: there's not a very strong relationship between the strain's average attenuation in the lab and results in real world batches. But, as many of you pointed out in comments I received, advanced brewers are experienced enough to know that mash temperature, the amount of specialty grains in the recipe, the amount of simple sugars in the recipe, and original gravity also have an impact on attenuation. Fair point! Advanced brewers have a different standard method than beginners. So instead of looking at how actual attenuation relates to the yeast manufacturer's average, let's see what the data says about how actual attenuation relates to the factors that more advanced brewers tend to consider. This is going to get a bit technical, but I'll leave out all the Python code I used and stick to key concepts and results.

When we investigated the relationship between the yeast manufacturer's average attenuation rating and actual attenuation, we could look at a simple correlation score and see that there was only a weak relationship. Things are a bit harder when there's a whole cluster of factors we want to study rather than just a single one. Rather than correlation, what we can do is assume that X units of, say, simple sugar, leads to Y units of attenuation increase (and so on with the other factors), make some predictions, and see how accurate they are. In other words, we construct a 'linear model' and test it by looking at its R-squared score, which measures (very roughly speaking) how close our model's predictions are to reality. The closer we can get to an R-squared of 1, the more accurate our model is. If we're close to 0, our models sucks. In some cases, it can even drop below 0 due to the ways it's calculated.

There are lots of different types of linear model we could try. The classic is a 'linear regression'. Imagine that we recorded a bunch of data about how many cats people own and its relationship with how much cat fur is on their floors. We would end up with a bunch of dots, like the graph below:

What a linear regression does, basically, is to draw a straight line through those dots that is as close to each of them as you can get, at least on the whole. The R-squared captures how close the line was able to get, with scores closer to 1 indicating a more accurate model. Looking at our data, a good linear regression (this guy has an R-squared of .986) would look like this:

Now, having a regression that syncs closely with our data is good, but that's not the whole story. If we wanted to, we could skip the linear regression and just draw an ultra-complicated, zig-zag line directly through all the dots and our R-squared would be perfect:

The danger here is that our model so closely adheres to our available data that it won't accurately predict anything new. All the model has done is just memorized what we showed it, rather than finding a deeper pattern that will also show up elsewhere. This is called 'overfitting'. How do we avoid overfitting? One method is to subdivide the data we use to draw our line, reserving some to test whether the model can cope with cases it hasn't already seen. So, we divide our data into a training set (which we use to draw the original line) and a test set (which we use to see if it can predict new stuff) and look at the R-squared of both. If the R-squared is super high on the training set, but low on the test set, we probably have overfitting. If the R-squared is high on both, we have a good model that can deal with new cases. And if the R-squared is low on both, we just have a bad model. If we want to, we can even keep going by grabbing random subsets from our original data to see how closely the model can predict them. In other words, we draw a line through 2 or 3 randomly selected points from our original graph, then see how accurate it is when we predict the points from the original graph that we left out. This is called 'cross validating' the model. Ideally, the R-squared for our random subsets of data will also be high.

OK, enough theory. What happens when we run a regression on the data I have available, using average strain attenuation, sugar percentage, specialty grain percentage, mash temp, and original gravity to predict actual attenuation? Here are the results, with three rounds of cross-validation:

R-squared for training set: 0.315 |

R-squared for test set: 0.275 |

R-squared in cross-validation: 0.201, 0.370, 0.496 |

Now the fact that all of those numbers are closer to 0 than to 1 should already make you suspect that our model is not very good. But just to make sure, let's compare these numbers to some benchmarks. One good benchmark is a 'dummy regression'. As the name suggests, this is a model designed to be total shit so that you can see if your real model is better or worse than total shit. I used a dummy model that makes predictions by just assuming every batch will attenuate to the average attenuation of the whole data set. You might suspect that this is a stupid idea, and you'd be right:

R-squared for training set: 0.000 |

R-squared for test set: 0.000 |

R-squared in cross-validation: 0.000, -0.0281, -0.007 |

So, this model gets literally nothing right. At least our linear regression is more accurate than this. Notice that our new advanced model is also more accurate than the linear regression that just used manufacturer's average attenuation, the R-squared of which was only 0.047.

But maybe we've been too simple. There are a lot of very complicated machine learning algorithms out there, so maybe we could improve on our regression and get even farther from our ridiculous dummy model. To see if this is right, I also tried what's called a 'gradient boosting regression'. Though the math behind this approach is complex, the basic idea is that you start with something like our dummy regression above and slowly complicate it to make it more accurate. You make a first round of inaccurate predictions with your bad model, then make a second model that explains how far off you were. You do a second round of prediction taking that into account, then see how far off you were this time and construct a third model that explains how far off you still are. You can repeat this as long as you want, adjusting things over many generations until they get more and more accurate.

I ran a gradient boost 1000 generations long. Here are the results:

R-squared for training set: 0.999 |

R-squared for test set: -0.427 |

R-squared in cross-validation: -0.260, 0.0251, -0.175 |

You might be impressed with the super high score on the training set, since it means our model fits that data extremely well. But notice how low the other scores are. Looks like we ended up overfitting again. Our best model still turns out to be the linear regression version of the advanced method.

But what if the relationship between things like specialty grain and attenuation isn't a straight line, but a curvy one? For instance, what if you lose a certain amount of attenuation for every unit of specialty grain only up to a certain point, but then things flatten out? This too is testable. Instead of drawing a straight line through our dots, we use a line defined by a polynomial equation, giving us the required curve. Here's what it would look like in the cat hair example with a 5th degree polynomial:

I tried a few different polynomials on the batch data and the best I could do was this:

R-squared for training set: 0.422 |

R-squared for test set: -0.234 |

R-squared in cross-validation: 0.383, -0.140, 0.318 |

This is slightly better than the straight line model on the training data, but much worse than the straight line on the test data, so again we seem to have some overfitting. Just like our zig-zag line from before.

What does all this mean? If the more advanced homebrewer method of taking into account grist, mash temperature, and OG in addition to a yeast's average attenuation was solid, we would expect that one of the models I tested would have a high score on the test set. But even the model with the highest score was nowhere near the 1.0 that we'd like to see. In other words, the advanced method doesn't predict things very accurately. It's better than guessing at random or trusting the company that makes your yeast, but it's probably best not to put too much stock into it.