A Bit About Statistics

Dr. Chris Floyd in front of Oh-Be-Joyful cabin on June 1, 1995
Dr. Chris Floyd in front of Oh-Be-Joyful cabin on June 1, 1995

In 1995, Gothic experienced both the greatest snowfall and latest snow melt date in billy barr’s observations. Not a statistician? What do all those numbers under the graphs mean anyway? The mean and range are both descriptive statistics used to summarize properties of a data set. In this activity, the mean (average) value is one measure of the central tendency of the data set and the range provides a measure of its dispersion or spread around the central tendency. The range is simply the maximum value in a data set minus the minimum value.

The plots you create in Excel (and those you see on the data visualizer) are all scatter plots that graphically relate two variables. The scatter plots show how billy barr’s observations of weather or phenological events (on the y-axis) change from year to year (x-axis) or phenological events change with meteorological events (figure 2 from Inouye et al., 2000).


If we assume that the y-axis data are somehow dependent on the x-axis data (something about the year in question influences the amount of snowfall or the timing of marmot emergence from hibernation), then the trend in the data can be used to make predictions. So how do we quantify the trend? Regression analysis fits a straight line (y = a + bx) to the messy scatter plot (where a is the y intercept of the line and b is the slope of the line). This line goes by many names (the least-squares line, the line of best fit, the prediction line, the estimated regression line, etc.) and the equation for the line is usually calculated using computer software. Why are data plotted in scatter plots messy? Because the variable on the x-axis is probably not the only factor influencing the variable shown on the y-axis. We measure how well future outcomes are likely to be predicted by the best fit line using the coefficient of determination, which is usually just referred to as r-squared and is calculated at the same time as the best fit line. r-squared is the proportion of variability in a data set that is accounted for by the statistical model, in this case, the best fit line. Values for r-squared range between 0 and 1. If r-squared = 1, the data fall perfectly on the best fit line; if r-squared = 0, the estimated regression line does not explain the variation in the y values. The variation in the y values explained by the x values is often expressed as a percentage. For example, in figure 2 above, 85.3% of the variation in the date of first flowering of bluebells is explained by the date of snow melt. The p value is used in statistical hypothesis testing. In this case, we want to know if there is no relationship between the variables (the null hypothesis) or if there is a statistically significant relationship between the variables we’ve chosen (the alternative hypothesis). In this case, if the p value is greater than 0.05, the null hypothesis is rejected and the regression is statistically significant – there is correlation between the variables.

In open systems, alternative hypotheses could be numerous and varied. For example, the date of first sighting of a marmots could be related to the snow melt date, or something completely different, or a combination of two or more factors. Remember correlation does not equal causation! Disclaimer: this discussion is not intended to be a complete treatment of the statistics used in Inouye et al., 2000 or a complete description of the statistical methods described above.