Forecasting Stock Market Index Prices Using Linear Regression

If there’s one chart I wish I could show to every investor, and particularly young investors who have a lot of time on their side to have money in the markets, it would be a long-term view of the growth in broad-market index funds. Taking a historical view, it becomes apparent why investing in these funds over a long period of time is a wise decision. It also demonstrates how a slow, steady, and simple approach to investing is such a potent way to develop wealth. Indeed, through good times and not-so-good times, with a term that is long enough, the growth in these types of funds can be quite impressive.

Let’s demonstrate this time/price relationship by performing linear regression on three common stock market indices and then applying our regression model on a set of future dates to forecast where the markets could be headed next.

Imports

First let’s import all of the python libraries and modules we will need for this project.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression

Data Acquisition and Loading

The three stock market indices we will work with are the Dow Jones Industrial Average, the S&P 500 Index, and the NASDAQ 100 Index. I downloaded all of the available historical data for all three indices, each in a CSV file, from Federal Reserve Economic Data (FRED). FRED is “an online database consisting of hundreds of thousands of economic data time series from scores of national, international, public, and private sources” and it is maintained by the Research Department at the Federal Reserve Back of St. Louis. When working with data, it is important to know where the data came from and that the source is reputable and authoritative in the domain of inquiry.

Now that we have the data, let’s load each file into its own Pandas DataFrame to begin our analysis.

# Create a df for the S&P 500 data
sp_df = pd.read_csv('F:/dataProject/2023/data/stockIndex/SP500.csv')
# Create a df for the NASDAQ 100 data
nas_df = pd.read_csv('F:/dataProject/2023/data/stockIndex/NASDAQ100.csv')
# Create a df for the Dow Jones data
dj_df = pd.read_csv('F:/dataProject/2023/data/stockIndex/DJIA.csv')

Data Wrangling, Exploration, and Feature Engineering

Now that we have our data in DataFrames, let’s take a look at the column structure and the first few rows of data to get a sense of what we are working with.

print(sp_df.head())
print(nas_df.head())
print(dj_df.head())

Python output

The good news is that column structure in each DataFrame is almost the same, which is helpful. However, you’ll notice that we have the name of each stock index as a column header and the values in those columns are an undefined variable. As I wasn’t sure if this was the daily close price, open price, or some other type of price, I went back to the FRED website and verified that these values are the daily close prices. While it makes sense that these values are the daily prices for the stock index indicated in the column header, we need to change this structure for our analysis. Ideally, each row of data should be a single observation and each column should be a different feature or attribute that describes each observation. So in this case, we need a column for the close price and a different column indicating which stock index we’re talking about. Let’s implemented those changes now.

# Create a column in the S&P df called "Fund".
# Set all values to "SP500".
sp_df['Fund'] = 'SP500'
# Rename columns in the S&P df
sp_df.rename(columns = {'SP500' : 'Close_Price', 'DATE' : 'Date'}, inplace=True)
# Create a column in the NASDAQ 100 df called "Fund".
# Set all values to "NASDAQ100".
nas_df['Fund'] = 'NASDAQ100'
# Rename columns in the NASDAQ 100 df.
nas_df.rename(columns = {'NASDAQ100' : 'Close_Price', 'DATE' : 'Date'}, inplace=True)
# Create a column in the Dow Jones df called "Fund".
# Set all values to "DJIA".
dj_df['Fund'] = 'DJIA'
# Rename columns in the Dow Jones df.
dj_df.rename(columns = {'DJIA' : 'Close_Price', 'DATE' : 'Date'}, inplace=True)

and now let’s re-check the structure of each DataFrame just as before.

print(sp_df.head())
print(nas_df.head())
print(dj_df.head())

Python output

Much better! Now that the column structure of each DataFrame is identical, let’s concatenate them together to create one big DataFrame for additional analysis and then take a look at the data types for the columns.

# Create the new df
all_df = pd.concat([sp_df, nas_df, dj_df])
# Inspect the data types in the new df
print(all_df.dtypes)

Python output

Well, there’s a couple things going on here. First, we’ll need to convert our “Date” column into a datetime object, and second, our “Close_Price” column is stored as an object rather than a float as we would expect. We need to see what’s going on with our close prices, so let’s first create a copy of our DataFrame that we can play around with and then sort it by the “Close_Price” column. If there are any string values in that column that are causing the mismatch in data type, they should get sorted before the numeric values, so we should catch them by inspecting the first few rows of the sorted DateFrame.

# Create a copy of the df
copy_df = all_df.copy()
# Sort by the close prices
copy_df = copy_df.sort_values('Close_Price')
# Inspect the first few rows of the sorted df
print(copy_df.head(10))

Python output

Seen above, there are periods in the “Close_Price” column where numerical values have not been provided. The dates on these first 10 rows seem to correlate with US holidays, which could mean these were dates that the markets weren’t open for trading. Let’s go ahead and drop all of these rows from our original DataFrame.

# Only keep the rows that do not have only a period 
# as the value for close price
new_df = all_df.loc[all_df['Close_Price'] != '.']
# Let's check to see how many rows we dropped
print('Nice work! You have removed ' + str(len(all_df) - len(new_df)) + ' rows.')

Python output

Now let’s convert the “Close_Price” and “Date” columns to the appropriate data type.

# convert the close prices to floats. Store in a variable.
float_price = new_df['Close_Price'].astype(float)
# Add the values from the float_price variable to the df 
# in a new column called "Float_Price".
new_df = new_df.assign(Float_Price=float_price.values)
# Get rid of the original column called "Close_Price"
new_df = new_df.drop(columns=('Close_Price'))
# Take the "Float_Price" column and rename it to "Close_Price".
new_df = new_df.rename(columns={'Float_Price' : 'Close_Price'})
# Convert the "Date" column to a datetime object.
new_df['Date'] = pd.to_datetime(new_df['Date'])
# Inspect the data types again to make sure they are good now.
print(new_df.dtypes)

Python output

Very nice! Next we want to sort our DataFrame by date and continue our analysis.

# Sort the df by the date field
new_df = new_df.sort_values('Date')

Now let’s check to see that we have the the same number of observations for each stock market index, as any differences in the amount of data could lead to different levels of accuracy in our model and ultimately our predictions. It is a good practice in data analysis to protect apples-to-apples comparisons when we can.

new_df.Fund.value_counts()

Python output

As it turns out, we have a lot more data for the NASDAQ 100 than the other stock market indices. In fact, the history for the NASDAQ 100 dataset dates back more than 18 years before the other two indices start! To make them all the same, we will work with a 10-year history with all 3 indices starting on March 4, 2013.

It is important also to take note that 2519 days of data in the DJIA and SP500 datasets is, in fact, 10-years of data. The stock market is not open for trading on weekends so we have five days in each week instead of seven.

# Create a new df. Do not include rows that have dates earlier
# than March 4, 2013.
mod_df = new_df[~(new_df['Date'] < '2013-03-04')]
# Check the value counts again to make sure we have the same 
# number of entries for each stock market index now
mod_df.Fund.value_counts()

Python output

Great! Let’s re-index the DataFrame now.

# re-index the df and get rid of the old index.
mod_df = mod_df.reset_index(drop=True)

Now our data are clean, appropriately structured, and ready to be visualized.

plt.rcParams['figure.figsize'] = (20,10)
sns.set(font_scale=2)
sns.lineplot(data=mod_df.loc[mod_df['Fund']== 'DJIA'], x='Date', y='Close_Price', label="Dow Jones Industrial Average", color='magenta')
sns.lineplot(data=mod_df.loc[mod_df['Fund']== 'NASDAQ100'], x='Date', y='Close_Price', label="NASDAQ 100 Index", color='blue')
sns.lineplot(data=mod_df.loc[mod_df['Fund']== 'SP500'], x='Date', y='Close_Price', label="S&P 500 Index", color='red').set(title='Historical Stock Market Index Close Prices', xlabel='Date', ylabel='Close Price (USD)')
plt.legend(facecolor='white')

Training the Linear Regression Models

Now that all of the data are cleaned and properly structured, let’s re-create the three DataFrames, one for each stock market index. This will make handling easier. We will also reset the indices within each of the new DataFrames.

# Create a df with the S&P 500 data in it.
sp500 = mod_df.loc[mod_df['Fund']== 'SP500']
# Create a df with the NASDAQ 100 data in it.
nasdaq = mod_df.loc[mod_df['Fund']== 'NASDAQ100']
# Create a df with the Dow Jones data in it.
djia = mod_df.loc[mod_df['Fund']== 'DJIA']
# Reset the indices within each of the dfs
sp500 = sp500.reset_index(drop=True)
nasdaq = nasdaq.reset_index(drop=True)
djia = djia.reset_index(drop=True)

In order to train our linear regression models, we need to create axes variables, which will be stored as Numpy arrays. Regression only works with numerical values so we must represent our dates as an array of numbers. Just as we saw in the graph above, close prices, which are variable for each stock market index, will be on the y-axis, and dates, which are the same for each stock market index, will be on the x-axis. So we will have one x-axis variable and three y-axis variables.

# Create the 3 y-axis variables as numpy arrays
y_sp500 = np.asarray(sp500['Close_Price'])
y_nasdaq = np.asarray(nasdaq['Close_Price'])
y_djia = np.asarray(djia['Close_Price'])
# Create the the x-axis variable as a numpy array
Num_Dates = [*range(1, len(sp500) + 1, 1)]
X = np.asarray(Num_Dates)

Now that we have our numerical axis variables in arrays, we can train the models.

# Train 3 linear regression models. One for each stock index.
lr_sp500 = LinearRegression().fit(X.reshape(-1,1), y_sp500)
lr_nasdaq = LinearRegression().fit(X.reshape(-1,1), y_nasdaq)
lr_djia = LinearRegression().fit(X.reshape(-1,1), y_djia)

Let’s go ahead and compute the R-squared scores for each index to know how well we performed.

print('For the S&P 500 dataset, the R-squared score is ' + str(score_sp500) + '.')
print('For the NASDAQ 100 dataset, the R-squared score is ' + str(score_nasdaq) + '.')
print('For the Dow Jones dataset, the R-squared score is ' + str(score_djia) + '.')

Python output

R-squared, also known as the Coefficient of Determination, is a statistical measure commonly used in regression that tells us the proportion of variance in the dependent variable that can be explained by the independent variable. In other words, R-squared tells us the strength of the relationship between our two variables (Time and Price). Scores are between 0 and 1 and can be represented as percentages. For example, in the Dow Jones dataset, we can say that the model we just trained explains 91% of the data. Across the three models, we can conclude that the models explain the data very well.

Now let’s run the data through the model we just built in order to produce an array of values (the line of best fit) that we can inspect visually.

# Run the S&P 500 data
y_pred_sp500 = lr_sp500.predict(X.reshape(-1,1))
# Run the NASDAQ 100 data
y_pred_nasdaq = lr_nasdaq.predict(X.reshape(-1,1))
# Run the Dow Jones data
y_pred_djia = lr_djia.predict(X.reshape(-1,1))

And create a Seaborn Lineplot for each stock market index with lines of best fit.

# Seaborn Lineplots
sns.set(font_scale=2)
sns.lineplot(djia.Date, y_djia, color='magenta', label='Dow Jones Industrial Average').set(title='Lines of Best Fit for Historical Stock Market Index Close Prices')
sns.lineplot(djia.Date, y_pred_djia, color='black')
sns.lineplot(nasdaq.Date, y_nasdaq, color='red', label='NASDAQ 100 Index')
sns.lineplot(nasdaq.Date, y_pred_nasdaq, color='black')
sns.lineplot(sp500.Date, y_sp500, color='blue', label='S&P 500 Index')
sns.lineplot(sp500.Date, y_pred_sp500, color='black', label='Line of Best Fit for Each Index').set(xlabel='Date', ylabel='Close Price (USD)')
plt.legend(facecolor='white')

Forecasting Prices

In order to use our models to predict future close prices (y-axis), we’ll need to create an array of future dates (x-axis). We can represent the work we need to do in the following equation, which is known as Slope-Intercept Form.

Seen in Slope-Intercept Form, in order to solve for y (Price) we need to know x (Date). This is what our model is doing for us; it is storing the values of m and b, and when we input the values for x, the model solves the system of equations for y and outputs the results. Let’s demonstrate this by printing the values of m and b for our three models.

# Variables that store values of b and m for the S&P 500 model
sp500_b = lr_sp500.intercept_
sp500_m = lr_sp500.coef_
# Variables that store values of b and m for the NASDAQ 100 model
nasdaq_b = lr_nasdaq.intercept_
nasdaq_m = lr_nasdaq.coef_
# Variables that store values of b and m for the Dow Jones model
djia_b = lr_djia.intercept_
djia_m = lr_djia.coef_
# Print these results for further fun!
print('For the S&P 500 line, the slope (m) is ' + str(sp500_m)[1:-1] + ' and the Y-Intercept (b) is ' + str(sp500_b) + '.')
print('For the NASDAQ 100 line, the slope (m) is ' + str(nasdaq_m)[1:-1] + ' and the Y-Intercept (b) is ' + str(nasdaq_b) + '.')
print('For the Dow Jones line, the slope (m) is ' + str(djia_m)[1:-1] + ' and the Y-Intercept (b) is ' + str(djia_b) + '.')

Python output

Now that we know how the models work, let’s create the Numpy arrays that will serve as our input values for x so our models can solve for y. I am going to create a two-year close price forecast, but any desired length of time would work; we would just need to modify the number of days to be passed to the models and make sure the date column stays consistent for further analysis and building the graphs.

# Count 2 years of days starting at the end of the existing data
Num_Future_Dates = [*range(len(sp500), (len(sp500) + 730))]
# Convert this list of days into the appropriate dates
## Note the origin parameter. This was trial and error 
## to select a starting date that produced the correct output. 
X_Future_Date = pd.to_datetime(Num_Future_Dates, origin='2016-04-10', unit='D')
# Create a df with the future dates as values in a column 
future_df = pd.DataFrame(X_Future_Date, columns=['X_Future_Date'])

Okay, great. We have a DataFrame with two-years worth of dates in it. However, the stock market isn’t open for trading every day of the week; as we noted earlier, the markets are closed on weekends and this was reflected in our training data. For our future dataset, we will need to engineer a way to omit the weekends, thus reducing the number of input values for x while also preserving the original dates for each x value.

To do this, let’s use Pandas datetime module to create a series that contains the day of week, with values ranging from 0 (Monday) to 6 (Sunday). We will then place this series into our DataFrame as a new column along with the dates column, and finally drop all of the rows from the DataFrame in which the values in the “Day_Of_Week” column is a 5 (Saturday) or a 6 (Sunday).

# Create a day of week series
day_of_week = future_df.X_Future_Date.dt.dayofweek
# Add the series to future_df
future_df['Day_Of_week'] = day_of_week
# Drop the weekend rows, please.
future_df_mod = future_df.loc[future_df['Day_Of_week'] != 5]
future_df_mod = future_df_mod.loc[future_df_mod['Day_Of_Week'] != 6]
# Check what we've done so far.
print(future_df_mod.head(10))

Jupyter Notebook output

As we can see in the Day_Of_Week column above, we have effectively removed weekends from the future dates DataFrame. We aren’t quite ready to run predictions yet, though. If you’ll remember earlier, we had periods as values in the close price column. These periods seemed to correlate with holidays – days that the market would not have been open for trading. We also need to remove the dates we know the stock market will be closed due to holidays over the next two years. Fortunately, this information was readily available from a variety of sources (such as NYSE, Investopedia, Marketbeat, and AARP) and I was able to cross-reference for accuracy – also a good practice in data analysis.

Let’s go ahead and create a python list with all of these weekday dates that the stock market will be closed. After that, we will remove the rows from the DataFrame if the date in those rows match a date in the holidays list.

# Create a list with all of the holiday closure dates in our df
holidays = ['2023-04-07', '2023-05-29', '2023-06-19', '2023-07-04', '2023-09-04', '2023-11-23', '2023-12-25',
           '2024-01-01', '2024-01-15', '2024-01-19', '2024-03-29', '2024-05-27', '2024-06-19', '2024-07-04',
           '2024-09-02', '2024-11-28', '2024-12-25', '2025-01-01', '2025-01-20', '2025-02-17']
# If a date in the df matches an item in the holidays list,
# remove that date from the df.
future_df_mod = future_df_mod.loc[~future_df_mod['X_Future_Date'].isin(holidays)]

Now our DataFrame has only the dates in it that we expect the stock market to be open for trading. Finally, we are ready to run the predictions!

# Create a list of date numbers that starts one day after 
# the original dates end
Num_Future_Dates = [*range(len(sp500) + 1, (len(sp500) + len(future_df_mod)) + 1)]
# Create a Numpy array from this list of date numbers
X_Future = np.asarray(Num_Future_Dates)
# Run predictions using the these new dates and all three models
y_pred_future_sp500 = lr_sp500.predict(X_Future.reshape(-1,1))
y_pred_future_nasdaq = lr_nasdaq.predict(X_Future.reshape(-1,1))
y_pred_future_djia = lr_djia.predict(X_Future.reshape(-1,1))
# Graph it up!
# Add the new prediction lines to the graph from earlier
sns.set(font_scale=2)
sns.lineplot(djia.Date, y_djia, color='magenta', label='Dow Jones Industrial Average')
sns.lineplot(djia.Date, y_pred_djia, color='black')
sns.lineplot(future_df_mod.X_Future_Date, y_pred_future_djia, color='black', style=True, legend=False, dashes=[(2,2)])
sns.lineplot(nasdaq.Date, y_nasdaq, color='red', label='Nasdaq 100 Index')
sns.lineplot(nasdaq.Date, y_pred_nasdaq, color='black')
sns.lineplot(future_df_mod.X_Future_Date, y_pred_future_nasdaq, linestyle='dotted', color='black', style=True, legend=False, dashes=[(2,2)])
sns.lineplot(sp500.Date, y_sp500, color='blue', label='S&P 500 Index')
sns.lineplot(sp500.Date, y_pred_sp500, color='black', label='Line of Best Fit for Each Index')
sns.lineplot(future_df_mod.X_Future_Date, y_pred_future_sp500, linestyle='dotted', color='black', style=True, legend=False, dashes=[(2,2)], label='Predicted Close Price for Each Index').set(xlabel='Date', ylabel='Daily Close Price (USD)', title='Predicted Daily Stock Market Index Close Prices')
plt.legend(facecolor='white')

We have successfully implemented linear regression on three stock market indices and forecast the next two years of close prices! This will conclude our work on this demonstration, but if we wanted to take this one step further, we could export the arrays of predicted values into a text file and load them into Excel or some other dashboard-type visualization tool (such as Tableau or Power BI) and compare the predicted values to the real-world data as we progress through the coming months. Really, the sky’s the limit!

Conclusion

Linear regression is a powerful tool in a data scientist’s tool belt; there are many potential applications and predictions can be made relatively simply and often quite effectively. With that said, using these results or linear regression in general to predict stock market performance is not a good idea. The market trends are fairly reliable but only when looking at the performance of these indices over a long period of time – 10 years in the case of this project. Shorter time periods of negative growth can also be seen in the graphs presented here; these models do not predict when those negative growth periods will happen.

Additionally, these linear models do represent the historical data pretty well, but there is no guarantee that past performance should indicate future performance. As investors, we accept some risk with every investment and it is up to each of us to decide when and if we will accept that risk. While I do consider index funds to be a relatively safe investment, one must be willing and able to leave the money in the market for a long period of time to realize any significant growth.

As a means of implementing machine learning algorithms, practicing data analysis, and gaining exposure to financial and statistical terms and concepts, however, using stock market data for data science projects is a great idea! In this project, we have done just that. The purpose of this project was to demonstrate a real-world implementation of linear regression, for me to share my knowledge on the algorithm and domain, and for my own practice.

A wise person once told me when I was a young data analyst that “good analysis often leads to more questions than answers”. Through working this project, I have new questions I’d like to answer and already have a few new ideas for future projects! This has been a fun and insightful project for me; I hope anyone reading this can say the same.

Forecasting Stock Market Index Prices Using Linear Regression

Imports

Data Acquisition and Loading

Data Wrangling, Exploration, and Feature Engineering

Training the Linear Regression Models

Forecasting Prices

Conclusion

Share this:

Leave a comment Cancel reply