Forecasting the trajectory of the stock market remains an elusive endeavor for both investors and traders alike. A myriad of methods and algorithms have emerged over time, striving to address this complex issue, and have achieved varying levels of success. In this blog post, we will discuss the XGBoost algorithm and how it performs better than linear regression for predicting market direction. We will also provide a Python code example for predicting the next day’s NIFTY close and direction using XGBoost.

**XGBoost Algorithm**

**XGBoost, short for eXtreme Gradient Boosting**, is a powerful machine-learning algorithm that has been gaining significant attention in recent years. XGBoost is an ensemble technique that uses a collection of decision trees to make predictions. It is particularly effective in handling large datasets and can efficiently manage missing values, outliers, and multicollinearity.

The dataset used for the implementation of the model is the **NIFTY_EOD.csv** file, which consists of **open, high, low ,close, volume, PClose, lreg5, lreg7, lreg9, hma5, hma7, hma9, trsi, atr **values for a particular stock.

Download the XGBost Regression Features Dataset prepared using Amibroker

Download NIFTY EOD csv data set

**Why XGBoost is Better than Linear Regression**

**Non-linearity:** Unlike linear regression, which assumes a linear relationship between features and the target variable, XGBoost can model complex, non-linear relationships. This is particularly helpful for predicting stock market direction, as the underlying relationships between variables are often non-linear.

**Robustness**: XGBoost is more robust to noise and outliers in the data compared to linear regression. This makes the algorithm better suited for predicting market direction, which can be influenced by various factors that are not always apparent in the historical data.

**Regularization**: XGBoost includes regularization, which helps prevent overfitting by penalizing complex models. This helps the algorithm generalize better to new data, making it more reliable for predicting market direction.

**Handling missing values**: XGBoost can automatically handle missing values, making it easier to work with incomplete datasets. In contrast, linear regression often requires imputation or other preprocessing techniques to handle missing values.

**Python Code for Predicting NIFTY Close and Direction**

```
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error, r2_score, explained_variance_score, mean_absolute_error
import matplotlib.pyplot as plt
import seaborn as sns
# Load the data
stock = pd.read_csv("NIFTY_EOD.csv")
data = stock
# Split the data into features and target
y = data['close_forecast']
X = data.drop(columns=['Ticker','Date/Time','close_forecast'], axis=1)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train an XGBoost model
model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42, booster='gbtree')
model.fit(X_train, y_train)
# Predict on the test set
y_pred = model.predict(X_test)
# Add predicted prices to test data
predicted_prices = X_test.copy()
predicted_prices['Close'] = y_pred
# Calculate evaluation metrics
mape = mean_absolute_percentage_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
explained_var = explained_variance_score(y_test, y_pred)
# Print the evaluation metrics and directional accuracy
print("MAPE:", mape)
print("Mean squared error:", mse)
print("Root mean squared error:", rmse)
print("Mean absolute error:", mae)
print("R-squared:", r2)
print("Explained variance:", explained_var)
# Predict the next day close and direction
next_day = X.tail(1)
next_day_pred = model.predict(next_day)
next_day_close = next_day_pred[0]
print("Next day predicted close:", next_day_close)
# Store the predicted vs actual values and direction in a separate csv
predictions = pd.DataFrame({'Date/Time': X_test.index, 'Actual': y_test.values, 'Predicted': y_pred})
predictions.to_csv("NIFTY_xgboost_predictions.csv", index=False)
# Plot feature importance using Matplotlib
fig, ax = plt.subplots(figsize=(10, 8))
xgb.plot_importance(model, ax=ax, importance_type='gain')
plt.title('Feature Importance')
plt.show()
```

The provided Python code imports necessary libraries loads NIFTY historical data, and preprocesses the dataset. It then splits the data into training and test sets and trains an XGBoost model. The model’s performance is evaluated using various metrics, including mean absolute percentage error (MAPE), mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), R-squared, explained variance, and directional accuracy. The code then predicts the next day’s NIFTY close and direction.

**Python Output**

```
MAPE: 0.012514069979435037
Mean squared error: 7843.5028405418425
Root mean squared error: 88.56355255149741
Mean absolute error: 52.471655232747395
R-squared: 0.9996397979447728
Explained variance: 0.9996400254819038
Next day predicted close: 17407.646
```

**Output and Interpretation**

The output metrics for the XGBoost prediction algorithm provide valuable insights into the model’s performance in predicting the NIFTY close prices and market direction. Let’s analyze these metrics in detail:

**MAPE (Mean Absolute Percentage Error):**0.012514069979435037- MAPE is a measure of prediction accuracy in a forecasting model, expressed as a percentage. It is calculated by taking the average of the absolute percentage errors. The value of 0.0125 means that, on average, the model’s predictions deviate by about 1.25% from the actual values.

**Mean Squared Error (MSE):**7843.5028405418425- MSE is a measure of the difference between the predicted and actual values. It is calculated by taking the average of the squared differences between the predictions and actual values. A lower MSE indicates a better fit of the model. In this case, the MSE is 7843.5.

**Root Mean Squared Error (RMSE)**: 88.56355255149741- RMSE is the square root of MSE. It is another measure of the differences between predicted and actual values, and it is useful because it has the same unit as the target variable. In this case, the RMSE is 88.56, which means that, on average, the model’s predictions are off by about 88.56 units from the actual values.

**Mean Absolute Error (MAE)**: 52.471655232747395- MAE is the average of the absolute differences between the predicted and actual values. It is a measure of prediction accuracy that is less sensitive to large errors than MSE or RMSE. In this case, the MAE is 52.47, which means that, on average, the model’s predictions are off by about 52.47 units from the actual values.

**R-squared:**0.9996397979447728- R-squared, also known as the coefficient of determination, is a measure of the proportion of the variance in the target variable that is predictable from the input features. It ranges from 0 to 1, with 1 indicating a perfect fit. In this case, the R-squared value of 0.9996 suggests that the model explains approximately 99.96% of the variation in the target variable.

**Explained Variance**: 0.9996400254819038- Explained variance is a measure of how well the model captures the variance in the target variable. It ranges from 0 to 1, with a higher value indicating better performance. In this case, the explained variance of 0.9996 suggests that the model captures approximately 99.96% of the variance in the target variable.

**Next day predicted close:**17407.646- This is the predicted closing price of the stock for the next day based on the model’s forecast.

**Feature Importance**

Feature importance values represent the relative contribution of each feature to the model’s prediction. In the context of the XGBoost algorithm, these values are computed based on the number of times a feature appears in the trees across all the decision trees and the improvement it brings to the model, typically measured by a metric like Gini impurity or information gain.

The XGBoost algorithm offers significant advantages over linear regression for predicting the stock market and market direction, particularly in handling non-linear relationships and providing robustness against noise and outliers. The provided Python code demonstrates how to use XGBoost for predicting the next day’s NIFTY close and direction, and while the model performs well in predicting close prices, it may require further optimization to improve its ability to predict market direction.