Predicting Solar Energy Production with Machine Learning


Over the past 10 years, installation costs for solar energy technology have dropped an astonishing 60%¹. This form of renewable energy is more accessible now than ever before. Yet over that same period, soft costs, such as sales and marketing, have remained almost completely stagnant. According to the Solar Energy Industries Association (SEIA), in Q4 2016, soft costs accounted for 67% of installation costs for residential solar². This has moved the impetus for growth in the solar industry from the development of cheaper technologies to a focus on ways to attack soft costs by more efficiently spreading information to potential solar customers.

Executive Summary

My goal with this project is to build a tool that makes information regarding one’s potential for switching to solar available to a wider audience, bringing down the cost of sales and marketing. I accomplished this goal in three stages:

  1. Building a machine learning model that predicts the annual energy production of a prospective solar installation.
  2. Building a model that predicts installation cost.
  3. Implementing these models on a user-friendly web app that shows users how much they should expect to save on their energy bill each year by switching to solar.
A preview of the solar output calculator web app, which predicts a user’s expected return on solar panels.

Using a random forest model, I was able to predict expected annual savings to around +/- $15.00, a 75% increase in accuracy over predictions generated by the National Renewable Energy Laboratory (NREL).

The Data

The NREL is the U.S. government’s primary laboratory for research and development of renewable energy sources.

Data for this project came from two sources, both managed by the NREL. The first is The OpenPV Project, which contains data related to over one million solar panel installations across the U.S. This dataset includes the following:

  • Annual energy production
  • Installation cost
  • Size
  • Orientation
  • Tilt
  • Installer
  • Technology type
  • etc.

The second dataset comes from the National Solar Radiation Database (NSRDB) API. This dataset includes hourly measures of:

  • Radiation
  • Temperature
  • Wind speed
  • Position of the sun

Gathering the Data

I used an AWS EC2 instance to gather solar radiation data for fifteen thousand ZIP codes over two weeks.

The NSRDB API only allows one thousand daily queries, so in order to gather local radiation data for the roughly fifteen thousand ZIP codes in the OpenPV dataset, I wrote a python script, set it to run every twenty-four hours, and deployed it on a remote Amazon Web Services EC2 instance. The script pulled hourly data from all of 2015 (the most recent data available), averaged it for the year, and saved it to a local MySQL database. Once I had pulled down radiation data for all one million solar panel installations, I merged the two datasets according to ZIP code, the most granular location measure available.


Based on exploratory analysis, I chose the following variables to build a model for annual energy production.

Direct Normal Irradiance (DNI) is the amount of radiation that travels directly from the sun to the earth, whereas Diffuse Horizontal Irradiance (DHI) is the amount of radiation that reflects off particles in the air before hitting the surface of the earth. Diffuse irradiance is therefore higher in places with more cloud cover or more dust in the air to block the travel of sunlight to the earth.

An example of tilted solar panels.

In the northern hemisphere, the further north an installation is located, the more it ought to be tilted. The optimal tilt difference is simply the difference between the tilt of a panel and its latitude. Technology type refers to what type of silicon is used in the panel. Some panels move to track the sun, and tracking type refers to whether a panel is fixed or has some kind of tracking.

Exploring The Data

Figure 1. In this graph of the western U.S., size represents number of installations per ZIP code, and color represents amount of direct radiation. Dark red indicates higher radiation, and larger circles indicate more solar panel installations.

Though most solar installations are in places where direct irradiance (DNI) is high, it appears the size of an installation actually plays the biggest role in determining a panel’s energy output.

Figure 2. A scatter plot of the size of solar panels and their annual energy production.

Size is highly correlated with annual energy production. When we look at a scatter plot of size and annual production, the linear relationship between the two is clear. The larger an installation, the higher its capacity for transforming solar radiation into electricity, and thus the observed relationship.

Partial Correlations

Figure 3. The partial correlations between radiation factors and installation factors. The red box highlights partial correlations with annual energy production.

The above graph shows the partial correlations between installation factors and radiation factors. The red box highlights the partial correlations between these factors and annual energy production. Partial correlations are useful because they show the correlation between two variables when the effects of other variables are held constant. Dark red indicates high positive correlation, dark blue indicates high negative correlation, and white indicates no correlation.

Surprisingly, there appears to be no more than very slight correlations between the radiation measures (DNI and DHI) and energy production. We might expect — given that we’re talking about solar energy after all — that these factors would play a large role in determining how much energy a solar panel generates. In the U.S., this may not end up being the case because solar radiation is high enough that panels reach their capacity for energy production, and are unable to produce more energy even with more sun exposure. If we were to compare the performance of solar panels in the U.S. to solar panels in the arctic, we might see that radiation is more highly correlated with energy production than we see here. An alternative explanation is that the level to which size correlates with output is biasing the results of the partial correlations.

Like in the scatter plot, there’s again a very strong positive correlation between size and energy production. It is interesting to note that there is a slight negative partial correlation between temperature and energy production. Though temperature obviously correlates highly with areas of high sun exposure, when controlling for the effect of radiation, high temperatures actually cause solar panels to produce energy less efficiently, and degrade more quickly³.

Baseline Model

Figure 4. Comparing the predicted and actual measures of energy output from the OpenPV dataset.

The NREL OpenPV dataset includes two measures of annual energy production, one that is a self reported value, and another predicted value based on a user’s inputs. Using these two values, I established a baseline R² score of .915. An R² score measures how much of the variance around the mean a prediction captures. So, if predictions were as good as guessing the average of the predicted variable every time, the R² score would be zero, and if the predictions were perfect, the R² would be one. At an R² score of .915, the baseline predictions are very accurate. Above is a scatter plot of the predicted and reported values, showing how closely they correlate.

Figure 5. A distribution of the differences in predicted return from actual return shows a very strong positive skew, due to large prediction errors for utility scale solar panels.

At an average return rate from a utility company of $0.1024 per kilowatt hour of energy generated, the median error in the baseline is equivalent to +/- approximately $60.00 in predicted savings on energy. The above chart shows the distribution of the differences between predicted and actual output for the baseline model, in dollars. The median is a more robust measure of central tendency than the average in this situation, given the large positive skew in the data. It’s also a better measure of the prediction error for small-scale, residential solar customers. Over the course of the lifetime of a solar panel, this error compounds, amounting to thousands of dollars difference between the predicted and actual return.

Model Selection

Figure 6. The distribution of annual energy production is highly positively skewed, with some utility-scale installations producing many hundreds of times more energy per year than the median.

The final model I used for production was a random forest regression model, which has several benefits in this context. It is a non-parametric model, which means it can predict a variable that is non-normally distributed. Because there are a wide range of solar panels in the OpenPV dataset, with some utility-scale installations producing thousands of times more energy per year than small, residential panels, the data is very positively skewed. This means that in order to use a model like linear regression without having biased results, it would be necessary to log transform the data, or use a generalized linear model like a poisson regression. Random forest models also do well with categorical features, and in this case there were a few such features, including technology type and tracking type.

The performance of this model was compared against three others, including:

  1. Ordinary Least-Squares Regression
  2. Elastic Net Regression
  3. 2-Layer Feed Forward Neural Network

I built the regression models and the random forest model using Scikit-Learn, and the neural network using Keras with a Tensorflow backend.


Using the random forest model, I achieved an R² value of .973 on validation data. The validation data were the same values used to generate the baseline score, so the comparison is completely like-to-like. In terms of annual savings, the median error drops from the baseline of $60.00 to +/- approximately $15.00 in predicted savings on energy.

Figure 7. Feature importances of the random forest model

Feature importances of a random forest model are a clear and easy way to interpret how much different variables contribute to predictions. More specifically, in Figure 7. they show the percent increase to mean squared error were a variable to be excluded from the model. We see that size contributes the most to predicting energy production, with an 87% importance. Direct and diffuse irradiance also play a role. These findings indicate that while it’s most important to build as large an installation as possible, building in places with high direct irradiance and low diffuse irradiance will help to produce more energy. Wind speed also plays a role, likely because high wind speeds correlate with areas of less shade, and high winds keep debris from collecting on panels. The last feature here is optimal tilt difference, which plays a comparatively small role in determining energy output. The more off a solar panel is from its optimal tilt, the less energy it will generate. Other factors in the model have a negligible importance.

Comparing Model Performance

While the random forest model ultimately performs best, it has several tradeoffs. It took the longest time to hone in on the optimal parameters, and the interpretability of its results is lacking compared to linear regression and elastic net. The latter two models give a clear relationship between changes to factors like size and radiation and expected annual energy production, whereas random forest only provides feature importance. The neural network model suffers from both the drawback of time and a lack of interpretability. Given the relative simplicity of predicting energy output, a neural network appears to be an unnecessarily complex model for this situation. With more training time and parameter tuning, the neural network model would likely match the performance of the random forest model, but the marginal benefits are clearly limited.

Modeling Output/Size

Because size dominates the performance of the models when used as a predictive factor, I built a regression model predicting output per unit size to gain a more granular understanding of the relationship between other factors and energy production. Though this model performs more poorly than when size is included as a feature, it helps demonstrate the effect of radiation and installation factors on output. The model scores an R² of .683 on validation data.

Figure 8. The coefficients on an elastic net regression model predicting output per unit size. These values show the positive or negative effect a change to these factors has on the predicted variable.

I again used Scikit-Learn to build the elastic net regression for this model. Elastic net uses two types of regularization, lasso and ridge, which help to prevent overfitting due to outliers and inconsequential features. Put simply, overfitting is when a model learns the intricacies of its training data so specifically that it does not generalize well to other data.

The results of the model show how much positive or negative effect an increase in each factor has on annual energy production per unit size of an installation. Direct irradiance is the most important feature, with a large positive impact. Both diffuse irradiance and optimal tilt difference have substantial negative impacts on energy output, which is a fairly intuitive result. We see that a solar panel with fixed tracking will produce less energy as well, because a panel that does not track will have less exposure to radiation. Presumably it requires energy to operate a tracking solar panel, and this finding suggests that generally, the energy generated from tracking outweighs the energy required to operate the tracking system. Finally, mono-crystalline panels outperform poly-crystalline. We see this result because mono panels contain purer silicon, which leads to increased efficiency.

Building The Web App

Using the random forest model, I built a web app that allows users to input information about where they want to build solar panels, and learn how much they would save on their energy bill. The web app also uses a second model to predict installation cost based on the following factors:

  • Size
  • Installer
  • Tracking
  • Technology
“Django is a high-level Python web framework that encourages rapid development and clean, pragmatic design.”

I built the website using Django, a python-based web framework. I set it up to run on an Amazon Web Services EC2 instance with a Gunicorn HTTP server. When a user inputs their information, the app queries the NREL solar radiation database to find their local radiation data, and calculates their expected annual energy production and installation. I wrote a Scrapy web crawler that finds their local average per-kilowatt return from a utility company, and uses that to calculate their average savings per year, as well as how long it would take to pay off the installation cost.


On the whole, those looking to install solar panels ought to be doing everything in their power to maximize scale. If there is a choice between paying to install tracking, more tilt, etc., versus installing additional panels, the additional panels will almost always be the right move.

Using machine learning, I built a model that gives highly accurate predictions of the expected return on energy generated by a prospective solar panel, and made it easily accessible through a web app. Tools such as this, which use the machine learning techniques described above, will make information regarding one’s ability to switch to solar more widely available, ultimately bringing down soft costs of installation and accelerating the transition to renewable energy.

Data Scientist @ Calm, house music enthusiast, basketball nerd.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store