Logistic Regression is a popular statistical method used for predicting binary outcomes, such as predicting whether an email is spam or not, whether a student will pass or fail a test, and many more. In this article, we will discuss how to use Logistic Regression to predict whether a stock’s opening price on the next trading day will be a gap up, gap down, or no gap based on historical data. We will use Python’s scikit-learn library to build and evaluate the model.
The goal of logistic regression is to predict the probability of a binary outcome (such as yes/no, true/false, or 1/0) based on input features. The algorithm models this probability using a logistic function, which maps any real-valued input to a value between 0 and 1.
Since our prediction has three outcomes “gap up” or gap down” or “no gap” we will be using the Multi Logistic Regression which is an extended version of the Logistic Regression algorithm.
Note : Logistic regression is a classification algorithm, not a regression algorithm, despite the name “regression” in its title.
Stock Market Prediction Classification Vs Regression Which Models to Use?
Examples of regression algorithms for this type of problem include linear regression, support vector regression (SVR), and neural networks.
Classification problem: If the goal is to predict the direction of the stock price movement (e.g., whether the stock price will go up or down), it can be treated as a classification problem. In this case, the model is trained to predict discrete classes (e.g., “up” or “down”)
Examples of classification algorithms for this type of problem include logistic regression, decision trees, k-Nearest Neighbors (KNN), and support vector machines (SVM).
Data Preparation
First, let’s import the required libraries and load the historical data of a stock into a pandas DataFrame.
Feature Inputs are “open”, high”, “low”, “close” – daily data. Example uses historical Nifty Index Data Since the start of the exchange.
Download Historical Features Dataset of Nifty Index data
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import numpy as np
# load historical data into a pandas dataframe
df = pd.read_csv('historical_data.csv')
Next, we need to create a new column in the DataFrame indicating whether the opening price is a gap up, gap down, or no gap.
Gapup represents the price opening above the previous day’s high.
Gap down represents the price opening below the previous day’s low.
No Gap represents the price opening between yesterday’s high and yesterday’s low.
# Create the target variable (label) based on the gap up/down conditions
data['gap'] = 'nogap'
data.loc[data['open'] > data['high'].shift(1), 'gap'] = 'gapup'
data.loc[data['open'] < data['low'].shift(1), 'gap'] = 'gapdown'
We’ll use the LabelEncoder from scikit-learn to encode the ‘gap’ column with numeric labels.
# create a new column indicating whether the next day's opening price is a gap up or gap down or no gap
data['next_gap'] = np.where(data['open'].shift(-1) > data['high'], 'gapup', np.where(data['open'].shift(-1) < data['low'], 'gapdown', 'nogap'))
# Prepare the feature matrix (X) and the target vector (y). Use LabelEncoder for the target variable
features = ['open', 'high', 'low', 'close']
X = data[features]
y = data['next_gap']
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
Similarly, we need to create a new column indicating whether the next day’s opening price is a gap up, gap down, or no gap and encode it with numeric labels.
# create a new column indicating whether the next day's opening price is a gap up or gap down or no gap
data['next_gap'] = np.where(data['open'].shift(-1) > data['high'], 'gapup', np.where(data['open'].shift(-1) < data['low'], 'gapdown', 'nogap'))
# encode the 'next_gap' column with numeric labels
df['next_gap'] = label_encoder.transform(df['next_gap'])
Model Training and Evaluation
Now, let’s split the data into training and testing sets, and train a multinomial logistic regression model. 80% of the dataset is used for training the model and the remaining 20% of the dataset is used for testing the model.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the Multinomial Logistic Regression model
logreg = LogisticRegression(multi_class='multinomial', solver='lbfgs')
logreg.fit(X_train, y_train)
# Make predictions using the test set
y_pred = logreg.predict(X_test)
Here is the Complete Python Source code for Gap Up/Gap Down Prediction using Logistic Regression
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
# Load the dataset with historical price data
data = pd.read_csv('historical_data.csv')
# Create the target variable (label) based on the gap up/down conditions
data['gap'] = 'nogap'
data.loc[data['open'] > data['high'].shift(1), 'label'] = 'gapup'
data.loc[data['open'] < data['low'].shift(1), 'label'] = 'gapdown'
# create a new column indicating whether the next day's opening price is a gap up or gap down or no gap
data['next_gap'] = np.where(data['open'].shift(-1) > data['high'], 'gapup', np.where(data['open'].shift(-1) < data['low'], 'gapdown', 'nogap'))
# Prepare the feature matrix (X) and the target vector (y). Use LabelEncoder for the target variable
features = ['open', 'high', 'low', 'close']
X = data[features]
y = data['next_gap']
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the Multinomial Logistic Regression model
logreg = LogisticRegression(multi_class='multinomial', solver='lbfgs')
logreg.fit(X_train, y_train)
# Make predictions using the test set
y_pred = logreg.predict(X_test)
# Evaluate the model using classification_report and confusion_matrix
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))
# Predict the next day's label
next_day_data = data[features].iloc[-1].values.reshape(1, -1)
next_day_pred = logreg.predict(next_day_data)
next_day_label = label_encoder.inverse_transform(next_day_pred)
print(f"Next day's prediction: {next_day_label[0]}")
Python Output
Classification Report:
precision recall f1-score support
0 0.20 0.01 0.01 197
1 0.55 0.62 0.58 359
2 0.82 0.93 0.87 1024
accuracy 0.75 1580
macro avg 0.52 0.52 0.49 1580
weighted avg 0.68 0.75 0.70 1580
Confusion Matrix:
[[ 1 117 79]
[ 2 224 133]
[ 2 67 955]]
Next day's prediction: gapup
Prediction Accuracy
The output for accuracy in our example is 0.75, which means our model is able to predict the correct outcome approximately 75% of the time.
In this confusion matrix:
- The first row represents the gapdown class (0). The model predicted 1 instance correctly, 117 instances as gapup, and 79 instances as nogap.
- The second row represents the gapup class (1). The model predicted 224 instances correctly, 2 instances as gapdown, and 133 instances as nogap.
- The third row represents the nogap class (2). The model predicted 955 instances correctly, 2 instances as gapdown, and 67 instances as gapup.
- Next day’s prediction: The model predicts that the next day’s gap will be a gapup.
In summary, the model has an overall accuracy of 75%, which means it is correct 75% of the time when predicting the gap types. The f1-scores for gapdown, gapup, and nogap are 0.01, 0.58, and 0.87, respectively. This indicates that the model performs best on the nogap class and poorly on the gapdown class.
You can’t do a simple train_test_split on a time series data, otherwise your model will have a forward bias. You need to hold-off data beyond a certain point within the dataset to be later used for evaluation, this is called Walk Forward Validation.