# R-squared Shrinkage and Power and Sample Size Guidelines for Regression Analysis

To avoid the prospect of drunks sucking on gas pumps, fuel ethanol is “denatured” with chemical additives (if you drink it, you’ll end up dead or, at best, in the hospital). It can be distilled from a variety of plants, including sugar cane and switch- grass. Most vehicles can’t run on pure ethanol, but E85, a mix of eighty-five percent ethanol and fifteen percent gasoline, requires only slight engine modifications. It’s possible that you’re including different forms of the same variable for both the response variable and a predictor variable.

Whether viewed as an opportunity to be seized or a problem to be solved, the energy sector is squarely focused on achieving measurable ESG results. And most of those results will come from reductions in emissions, particularly carbon dioxide. In 2020, the KPMG Survey of Sustainability Reporting found that there was a sustainability reporting rate of 96% for G250 companies — the world’s largest 250 companies. Within the oil and gas sector, the rate was 100% for G250 companies. This growing awareness has resulted in tremendous pressure on the energy sector to adapt to a new reality. Clean energy, decarbonization, and distributed power have become drivers of investment in the energy industry.

This approach directly assesses the model’s precision, which is far better than choosing an arbitrary R-squared value as a cut-off point. If your main goal is to determine which predictors are statistically significant and how changes in the predictors relate to changes in the response variable, R-squared is almost totally irrelevant. The fitted line plot shows that these data follow a nice tight function and the R-squared is 98.5%, which sounds great. However, look closer to see how the regression line systematically over and under-predicts the data (bias) at different points along the curve.

These intervals account for the margin of error around the mean prediction. In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.

## Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit?

For instance, let’s assume that an investor wants to purchase an investment fund that is strongly correlated with the S&P 500. The investor would look for a fund that has an r-squared value close to 1. To understand what r-square tells us you must understand the word variability. When I say variability, you should think of the word “differs.” Now, I’m going to explain to you what r-squared means.

- Ultimately, R-squared is only one measure of accuracy – other metrics such as Mean Absolute Error or Root Mean Square Error may be more appropriate for certain contexts.
- It represents the variability that is not explained by the independent variables.
- You can get a sense of this by looking at it, but the best way to know how well the model explains the relationship is with the r-squared number.
- Used together, R-squared and beta can give investors a thorough picture of the performance of asset managers.
- One data point that could be worth plugging into a regression is the start of a new bull market and what correlates with it.
- In response to this growing trend, most companies have developed policies on Environmental, Social, and Corporate Governance (ESG).

And a study by the International Institute for Sustainable Development found that ethanol subsidies amount to as much as $1.38 per gallon — about half of ethanol’s wholesale market price. Now we can set up a monitor for the model, perform root cause analysis, and also find the slice causing a dip in performance. Ultimately, the best way to use and understand R-squared is to experiment with different models and compare the results. With practice and experience, you will soon become familiar with this powerful metric and be able to leverage it for robust machine learning solutions.

## What does this p-value mean relative to our dataset?

Let’s use the example below to understand how the p-value applies to energy use analysis. But one sector that did take a step back in 2021 was the solar datacloud international sector. Even though the solar sector continues to grow — and is up by triple digits over the past five years — solar companies sold off in 2021.

## The Ethanol Scam

Businesses that fail to consider such metrics can experience a significant financial impact. MSCI Inc., a global provider of financial and portfolio analysis tools, conducted a four-year study on this issue. The study found that companies with high ESG scores experienced lower costs of capital, lower equity costs, and lower debt costs compared to companies with poor ESG scores. But in 2021, the fund lost 27% as rising input costs hit renewable energy companies across the board. However, I believe these input prices will start to abate in 2022, and the solar sector will get back on track.

## R Squared: Understanding the Coefficient of Determination

With a sample size of 40 observations for a simple regression model, the margin of error for a 90% confidence interval is +/- 20%. For multiple regression models, the sample size guidelines increase as you add terms to the model. You begin by squaring the difference between the predicted and the actual values. This difference (residual) represents the variation in the dependent variable, unexplained by the model. Adding all the squared residuals, dividing by the number of observations, and taking the square-root of the result gives us the metric, Root-Mean Squared Error. This indicates the absolute fit of the model and shows how close the predicted values are to the actual data points.

The latter helps to determine whether adding more variables improves the model’s accuracy and if the increase in explanatory power justifies adding additional variables. R-squared is not ideal when it comes to certain machine learning models such as those involving non-linear regression or time series prediction. Another metric called the root mean squared error (RMSE) might be used as an alternative in some cases. RMSE is a measure of model accuracy that takes into account the size of the errors in predictions made by a machine learning model. It measures the average of the difference between predicted and actual values and can be helpful for comparing machine learning models.

## The Link Between Oil Reserves and Oil Prices

Khosla is even higher on the prospects for cellulosic ethanol, a biofuel that can be made from almost any plant matter, including wood waste and perennial grasses like miscanthus and switchgrass. Among other virtues, cellulosic ethanol would not cut into the global food supply (nobody eats miscanthus or switchgrass), and it could significantly cut global-warming pollution. Even more important, it could provide a gateway to a much larger biotech revolution, including synthetic microbes that could one day be engineered to gobble up carbon dioxide or other pollutants. Because the whole point of corn ethanol is not to solve America’s energy crisis, but to generate one of the great political boondoggles of our time. Corn is already the most subsidized crop in America, raking in a total of $51 billion in federal handouts between 1995 and 2005 — twice as much as wheat subsidies and four times as much as soybeans. Ethanol itself is propped up by hefty subsidies, including a fifty-one-cent-per-gallon tax allowance for refiners.

It seems to be a reasonable bet to me that 2022 prices will average higher than they did last year, but I don’t expect us to see a huge price increase as we did a year ago. Thus, I am going to predict that the average annual price ends up being in the $70-$75/bbl range, which is where it is presently. I also think we will see significantly less volatility in oil prices than we did in 2021. Because oil is still the world’s most important commodity, I generally lead off with a prediction on the direction of oil prices.