Predicting the stock market has been a challenging task due to its complex, dynamic, and non-linear nature. Many researchers have tried various machine learning techniques to improve the accuracy and reliability of stock market predictions. One promising approach is the use of ensembling methods, which combine multiple models to achieve better performance.
In this article, we will discuss how ensembling methods, specifically bagging, boosting, stacking, and blending, can be applied to enhance stock market prediction. And How AdaBoost improves the stock market prediction using a combination of Machine Learning Algorithms Linear Regression (LR), K-Nearest Neighbours (KNN), and Support Vector Regression (SVR) and How the models are combined using the ensemble method, AdaBoostRegressor, to improve overall prediction accuracy.
Bagging, or Bootstrap Aggregating, is an ensemble method that involves generating multiple models from different bootstrapped subsets of the training data. These models are trained independently, and their predictions are combined through averaging (for regression problems) or voting (for classification problems). Bagging helps reduce the variance in predictions by averaging out the errors from multiple models. In stock market prediction, bagging can be applied by training multiple models, such as decision trees or neural networks, on different subsets of historical stock data. The final prediction is obtained by aggregating the individual model predictions, resulting in a more stable and accurate forecast.
Boosting is another ensemble method that focuses on reducing bias in the model by iteratively adjusting the weights of misclassified data points. This technique creates a sequence of weak learners, each attempting to correct the errors made by its predecessor. The final prediction is a weighted combination of the individual weak learners. For stock market prediction, boosting techniques like AdaBoost or Gradient Boosting can be employed to train a series of models on historical stock data. The boosting algorithm assigns higher importance to instances where previous models have made incorrect predictions, ensuring that subsequent models focus on these challenging cases. This results in an overall improvement in prediction accuracy.
Stacking, also known as Stacked Generalization, is an ensemble method that combines multiple models with different learning algorithms to maximize their complementary strengths. In stacking, base models are trained on the same dataset, and their predictions are used as input for a higher-level model, called the meta-model. The meta-model learns how to optimally combine the base model predictions to generate the final output. For stock market prediction, one can train various base models, such as linear regression, support vector machines, and neural networks, on historical stock data. A meta-model, like a logistic regression or another neural network, can then be trained on these base model predictions to achieve a more accurate and robust forecast.
Ensembling methods in machine learning, such as bagging, boosting, and stacking, have shown great potential in improving the accuracy and reliability of stock market predictions. By combining multiple models and leveraging their complementary strengths, ensemble techniques can mitigate the shortcomings of individual models, resulting in a more robust and accurate prediction.
Adaboost – Ensembling Method
AdaBoost, short for Adaptive Boosting, is an ensemble learning method that combines multiple weak learners to form a stronger, more accurate model. Initially designed for classification problems, it can be adapted for regression tasks like stock market price prediction. The algorithm works iteratively, training a sequence of weak learners (such as linear regression) and updating their weights based on the prediction errors. The final model is a weighted combination of these weak learners.
Here’s how AdaBoost can help in stock market price prediction:
- Enhancing predictive power: By combining multiple weak learners, AdaBoost can capture complex relationships in the stock market data, potentially resulting in more accurate predictions.
- Handling noisy data: Stock market data can be noisy, with many irrelevant features and outliers. AdaBoost’s adaptive learning mechanism can be more robust against noise, focusing on the most informative features and down-weighting the impact of outliers.
- Interpretability: The implicit feature selection performed by AdaBoost can result in a more interpretable model, making it easier to identify the most relevant factors driving stock market price movements.
- Versatility: AdaBoost can be combined with various base learners, making it a flexible method that can be tailored to different stock market prediction problems.
- Scalability: The algorithm can be parallelized and is relatively fast to train, making it scalable to large datasets.
Advantages of using AdaBoost for stock market price prediction include:
Improved accuracy: The ensemble approach can potentially provide better predictive accuracy compared to individual base models, reducing the chances of overfitting and capturing a broader range of patterns in the data.
Robustness to noise: The iterative nature of the AdaBoost algorithm enables it to be more robust against noise and outliers, improving the overall performance on diverse data distributions.
Adaptive learning: AdaBoost assigns higher weights to misclassified or poorly predicted instances in each iteration, encouraging subsequent models to focus more on these challenging examples.
Simple base learners: AdaBoost can work effectively with simple base models, such as linear regression, making the overall ensemble computationally efficient while still achieving good performance.
Feature selection: AdaBoost can implicitly perform feature selection by focusing on the most informative features during the learning process, resulting in a more interpretable and efficient final model.
AdaBoost can be sensitive to noisy data and outliers, so it’s crucial to preprocess and clean the data carefully before using it for prediction.
Adaboost Ensembling using the combination of Linear Regression, Support Vector Regression, K Nearest Neighbors Algorithms – Python Source Code
This Python script is using various machine learning algorithms to predict the closing prices of a stock, given its historical features dataset and almost 34 features (Technical Indicators) stored in the features dataset file “NIFTY_EOD.csv“.
The code imports necessary libraries and modules for data manipulation, visualization, and machine learning. The primary algorithms used for predictions are Linear Regression, K-Nearest Neighbors(KNN), and Support Vector Regression(SVR).
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, accuracy_score, r2_score from sklearn.metrics import mean_absolute_percentage_error from sklearn.svm import SVR from sklearn.linear_model import LinearRegression # <-- Import LinearRegression from sklearn.neighbors import KNeighborsRegressor from sklearn.ensemble import AdaBoostRegressor import joblib from numpy.random import seed import tensorflow as tf seed(42) tf.random.set_seed(42) # Load the data stock = pd.read_csv("NIFTY_EOD.csv") data = stock # Split the data into features and target y = data['close_forecast'] X = data.drop(columns=['Ticker','Date/Time','close_forecast'], axis=1) X_copy = X # Normalize the features scaler = MinMaxScaler() X = scaler.fit_transform(X) # Split the data into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create and train the AdaBoost Regressors lr_model = AdaBoostRegressor(estimator=LinearRegression(), n_estimators=50, random_state=42) # <-- Change to LinearRegression lr_model.fit(X_train, y_train) knn_model = AdaBoostRegressor(estimator=KNeighborsRegressor(n_neighbors=3), n_estimators=50, random_state=42) knn_model.fit(X_train, y_train) svr_model = AdaBoostRegressor(estimator=SVR(kernel='rbf', C=1e3, gamma=0.3), n_estimators=50, random_state=42) svr_model.fit(X_train, y_train) joblib.dump(lr_model, './ensemble/lr_model.joblib') # <-- Change to lr_model joblib.dump(knn_model, './ensemble/knn_model.joblib') joblib.dump(svr_model, './ensemble/svr_model.joblib') def ensemble_predict(X): lr_pred = lr_model.predict(X) # <-- Change to lr_pred knn_pred = knn_model.predict(X) svr_pred = svr_model.predict(X) # Get the importance weights of the regressors lr_weight = lr_model.estimator_weights_ # <-- Change to lr_weight knn_weight = knn_model.estimator_weights_ svr_weight = svr_model.estimator_weights_ # Compute the weighted average of the predictions weighted_pred = (lr_weight * lr_pred + knn_weight * knn_pred + svr_weight * svr_pred) # <-- Change to lr_weight and lr_pred return weighted_pred / (lr_weight + knn_weight + svr_weight) # Calculate predictions y_pred_close = ensemble_predict(X_test) # Calculate accuracy metrics mse = mean_squared_error(y_test, y_pred_close) mape = mean_absolute_percentage_error(y_test, y_pred_close) * 100 rmse = np.sqrt(mse) r2 = r2_score(y_test, y_pred_close) print("Mean Squared Error (MSE):", mse) print("Root Mean Squared Error (RMSE):", rmse) print("Mean Absolute Percentage Error (MAPE):", mape) print("R2 Score:", r2) #Actual Vs Predicted Results train_split = int(X_copy.shape * 0.8) actual = X_copy[train_split:]['Close'] actual = actual.reset_index(drop=True) #reset the index actual = actual.to_frame(name='Close') # Convert the Series to a DataFrame and set the column name to 'Close' next_day_pred = X_copy[train_split:] next_day_pred = scaler.transform(next_day_pred) next_day_forecast = ensemble_predict(next_day_pred) next_day_forecastdata = pd.DataFrame(next_day_forecast, columns=['predicted']) # Print results print("Previous Day Close:", actual.iloc[-1]['Close']) print("Predicted Next Day Close:", next_day_forecastdata.iloc[-1]['predicted']) #print("length - actual :" + str(actual.shape) + " predicted :" + str(y_pred_close.shape)) # Plot the predicted vs actual close values plt.figure(figsize=(10, 5)) plt.plot(actual, label='Actual Close') plt.plot(next_day_forecastdata, label='Predicted Close') plt.xlabel('Time') plt.ylabel('Price') plt.title('Actual vs Predicted Close') plt.legend() plt.show()
The code uses the ensemble method to combine predictions from three different models (Linear Regression, K-Nearest Neighbors, and Support Vector Regression). The ensemble_predict function computes the weighted average of the predictions based on the importance weights of the models. Finally, the script visualizes the actual and predicted closing prices, allowing you to compare the model’s performance.
Mean Squared Error (MSE): 7696.585379615255 Root Mean Squared Error (RMSE): 87.73018511102809 Mean Absolute Percentage Error (MAPE): 2.105069920226974 R2 Score: 0.9996662213566196 Previous Day Close: 17599.15 Predicted Next Day Close: 17715.394023985627
The Python script calculates several evaluation metrics to assess the performance of the ensemble model for predicting stock closing prices. Here’s an explanation of the output:
- Mean Squared Error (MSE): 7696.585379615255 MSE is the average of the squared differences between the actual and predicted closing prices. It’s a common measure for evaluating regression models’ performance. A lower value indicates better performance, with 0 being a perfect fit. In this case, the MSE is 7696.59.
- Root Mean Squared Error (RMSE): 87.73018511102809 RMSE is the square root of the MSE. It measures the average deviation of the predicted values from the actual values. The lower the RMSE, the better the model’s performance. In this case, the RMSE is 87.73, which means, on average, the predictions deviate from the actual values by 87.73 units.
- Mean Absolute Percentage Error (MAPE): 2.105069920226974 MAPE is the average of the absolute percentage errors between the actual and predicted closing prices. It’s expressed as a percentage and is useful for comparing errors across different scales. A lower MAPE indicates better performance. In this case, the MAPE is 2.11%, which means the predictions deviate from the actual values by an average of 2.11%.
- R2 Score: 0.9996662213566196 R2 score, also known as the coefficient of determination, measures how well the predicted values fit the actual data. It ranges from 0 to 1, with 1 indicating a perfect fit and 0 meaning the model doesn’t explain any variability in the data. In this case, the R2 score is 0.9997, which suggests that the ensemble model explains approximately 99.97% of the variability in the closing prices.
- Previous Day Close: 17599.15 This is the actual closing price of the stock on the last day of the dataset.
- Predicted Next Day Close: 17715.394023985627 This is the predicted closing price of the stock for the next day, as estimated by the ensemble model.
The output suggests that the ensemble model performs well in predicting stock closing prices, as evidenced by the low error metrics and high R2 score. The plot of actual vs. predicted closing prices would provide a visual representation of the model’s performance over time.