data-science versus system-engineering for nutrition.


Let me start this blog post with a big warning. On a scale from zero to one hundred, zero being vulkan style emotional detachment and 100 being a full out rant, this post is a fat ninety five.

Since I got interested in nutrition 6 years ago, after the disastrous effects that dietary advice and medication, given to me by medical and nutritional professionals, had on my personal health, and since I’ve been trying to apply my knowledge and experience with regards to data engineering in a forensics setting to my growing knowledge about nutrition, there have been some aspects of nutritional science that have led to a growing discontent with the whole field of science. To start of with I thought, well, I’m new to the field, so maybe I don’t get why things are done like this, but as my knowledge of nutrition, biochemistry and epidemiology has grown over the years, so has the seed grown that started of as the simple thought that there was something fundamentally wrong with the way nutritional science applies the tools handed to it by the field of data science. There are some great studies out there by people who clearly have a good grasp of data-sci, but many of the studies that come out, and many of the pop-science books about nutritional science, including those that are widely lauded as game changing seem to be using definitions of what constitutes solid proof that, as far as I can judge, align what the rest of science would agree only constitutes a possibly interesting link that may warrant further research. I will not be getting too technical as to not alienate the non technical readers, but I still need to touch on a few technical and data-science concepts.

Obsession with linearity

For some strange unexplained reason, nutritional science seems to have an obsession with linear associations. When I knew little about the subject and read nutritional science papers, after a while I was starting to assume that something fundamental about biochemistry made linear associations a fundamental truth of the field. After getting deeper into the subject however, there appears to be absolutely nothing to justify the obsession of large groups of nutritional scientists with the idea that associations need to somehow be linear. So what is a lineair relationship? Lets say there is a suspected association between the intake of for example a specific amino acid and mortality from a specific ailment, a study could look at intake of this amino acid in a number of regions and then look at mortality numbers and recorded cause of death in these regions. If you plot the per region mortality from our ailment against the average intake of our amino acid in given region, you get a scatterplot representing the relationship within our data set between the two variables. Well its a bit more involved than that as we will later see, but it is the base idea. Now if, as nutritional scientists seem to always do we define a straight line defined by:

Y=aX + b

In such a way as to minimize a the average of a simple function of the error between the line and the points on our scatter plot, this line and the errors as defined by our scatter plot define the hypothesis of  a linear relationship between X and Y. You can use other functions to fit against that are not linear, and you can, as other fields of science tend to do, fit Y against multiple variables at once using a random subset of the data and validate that fitting using the remaining data points, but for some reason these common techniques seem quite rare in both nutritional science and epidemiological science papers. Why is this? I really couldn’t tell. All I can guess is that there is some strange schism between these fields and the rest of science.


It isn’t just observational data where many scientists assume linearity, it is randomized controlled trails as well. This “golden standard” of research trails in nutritional science often leads to trail design that implicitly assumes linearity.






There are two problems with nutritional studies that aggravate each other. The first is the assumption that everything is lineair. The second are truly low standards and a wide spread misconception of the concept of p-values. The p-values issues seem to allow scatter plots that to visual inspection are clearly not best fit for ‘linear’ regression, to not only be falsely assumed to be linear, but even pass the arbitrary p-value threshold used to signify significance of that linear association.   So what are p-values anyway?

“The p-value is defined as the probability, under the assumption of the null hypothesis H, of obtaining a result equal to or more extreme than what was actually observed.”

While this definition is correct it is also the source of a whole lot of confusion that I wont go into for the sake of not going to technical.  The most important thing we shall need to focus on is the true p value versus the sample calculated p value. I am not a data scientist in the strict sense of the word. I’m not doing raw math most of the time. Instead most of my  data work revolves around writing and running simulations, many times over most of the time, from a data engineering rather than a data-science perspective. One of the things anyone who has run such simulations will know is that sample p-values can have rather erratic behaviour across runs. It came as no surprise thus when  Nassim N Taleb came with his meta distribution for p-values. I know a whole lot of people will probably stop reading here. Sure Taleb has a bit of an inflated ego and lacks the social sense to not refer to people below his intellectual level as imbeciles, but his work on the p-value meta distribution brings to data science what data-engineers have known through experience from simulations. The arbitrary threshold for the p-value as used by nutritional scientists is way to high if we account for the properties of the p-value meta distribution.  I’m probably already too technical here for most readers. The most important thing to take from this is that with a sample p-value marginally below the nut-science p-value threshold will have a very sizeable probability of having a true p-value well above the threshold value below what significance is denoted.  This apart from the fact that there is a general over reliance on p-values that isn’t just a nut-science issue though, and the widespread malpractice of p-hacking that appears to be unusually commonplace in nut-science.

missing variables


When looking into biochemistry and when studying actual data sets, there is one big chemistry variable that is omitted in most nutritional science data set: heat exposure. You don’t need more than a few months worth of chemistry lessons to know that exposure to heat can make a whole lot of difference. Studies looking at things like macronutrient consumption though hardly seem to record heat exposure or even differentiate between something as crucial as consumption of unheated PUFA versus consumption of PUFA in cooking oil used for baking stuff. Another often omitted variable is lean body mass. While loss of body fat, especially of the visceral type,  is beneficial to most people in western countries, loss of lean body mass is really bad news. Too many studies that record weight change however fail to also record body compositional values.

sub-population variance and skew

skewness-and-kurtosisFor epi studies there is quite a different type of missing variable problem. If we accept that an association is often nonlinear, and knowing that the average for a sub-population is just that, a sample average, we really need to pull some more advanced techniques out of the hat than just work with sample averages. The point is, we can’t just assume all sub populations to only differ in sample mean while sharing the exact same variance. Next, we can’t assume the sub-populations distribution to be skew-less.  Again these are a bit technical terms so lets look at them a bit. If a true association between a nutrient or food and mortality from a specific ailment is shaped like a bathtub, then many of the deaths in a population might arise from one of the extremes of the sub-population distributions. The volume of people contained in these extremes depends on distribution properties like variance, skew and possibly kurtosis that together define the shape of the distribution. If these moments differ between sub-populations, then any conclusion drawn from looking at the population through just the sample mean becomes basically something that is bound to yield misleading results. When recorded, differing variance across sub populations, or sub distributions with significant skew or kurtosis should be reason to call in the big guns.

Static versus differential markers.

Now let’s assume nutritional scientists found a solid association and got the shape of the association right. Let’s say higher levels of marker X are strongly associated with higher levels of mortality number Y. Does that imply that a negative ΔX for an individual will lead to a negative Δ in P(Y) ? Well, maybe, but then there might even be a possibility that it will lead to a positive P(Y). Point is: a marker, even a good marker isn’t necessarily a suitable variable to use as controlling variable for Y. For an engineer these things are blindingly obvious. For many nutritional scientists though, the assumption that all that is needed for a good controlling variable is a  strong correlation seems to be the default assumption.

surrogate endpoints and other die-hard misleading variables

In nutritional science there is one step beyond the wrongful use of  markers as control variable: surrogate endpoints. Nutritional science has an abundance of questionable surrogate endpoints that are stand ins for real endpoints (mortality) for different diseases. Things like LDL-c, BMI and high blood pressure. I’ ve talked about LDL-C and BMI in earlier blog posts and won’t rehash them here now. The big problem with surrogate endpoints is that they persist and contribute to poor follow-up studies. For example, using BMI as surrogate, one might conclude that an increase in lean body mass (LBM) would increase the risk of CVD-death. This while all data shows the actual reverse to be a very viable hypothesis.

Adding it all up


If we look at all the issues discussed above, we could ask if nutritional science is fixable. Looking at the actual data sets from many studies with a pair of data-engineering spectacles, we must come to the conclusion that from a data-sci perspective, most data is very much inconclusive with respect to the claims being made. If we fix nut-science, the field should become quite boring with very few claims being made either by observational or by RCTs. It seems the influence of individual components of nutritional profiles on mortality numbers is either very small or non existent. Possibly when missing variables such as heat exposure are added there might be interesting claims that could stand up to modern data-sci scrutiny and that is something that should definitely be looked at. But who is going to fund studies when nine out of ten studies results in inconclusiveness. So what other options are there?  While scientists tent to look down at the idea of N=1 stuff, engineering thrives on it. Using engineering practices, especially multivariate control-feedback theory, an individual might find his/her personal ideal nutrition and exercise program all by him/her-self. Do we still need nut-science then? Yes we do, but only at the extremes basically. An N-1 control-feedback loop can’t obviously factor in mortality numbers into the feedback loop, yet individual progress might lead to extremes (100% meat diet, 100% bananas diet) that in turn might result in sudden death of superficially healthy individuals.  If we accept that nutritional science hasn’t got that much useful + conclusive to say about the mid-range of most foods and nutrients, and if we accept the non-lineair nature of many things, including nutrition/mortality associations, the field could focus on the extremes as an important way of supplementing what should be a primarily N=1 engineering effort aimed at individuals.

So that rounds up what I already warned was a bit of a rant. I hope I’m being way to pessimistic with all this, but the more nut-sci papers I read and the more data-sets I actually end up being allowed to look at, the more pessimistic I’m growing about this field of science that as it seems can only excist and can only make funding-worthy claims as long as it keeps ignoring what data-science has to offer. So as it is my conviction that we truly need to develop the field of nutritional engineering to fill in the gaps that nutritional science is failing to fill, sadly just keeps growing with (almost) every paper I read.

This entry was posted on 1st August 2016. Bookmark the permalink.