it is one of the most powerful techniques for building predictive models.
Gradient boosting is a machine learning technique for regression, classification and other tasks, which produces a prediction model in the form of an ensemble of weak prediction models(Weak Learner), typically decision trees or Linear Regression.
Weak learners are models that perform slightly better than random guessing. Strong learners are models that have arbitrarily good accuracy. Weak and strong learners are tools from computational learning theory and provide the basis for the development of the boosting class of ensemble methods.
example of weak learner is Decision Tree. when we use one Decision Tree in problem statement, it perform poorly(with high variance) but when we use more number of Decision Tree it performs better.so, here you have question why? why it performs better when more number of Decision Tree are used?
Because when we use more number of Decision tree, they cancel out each others error and make Strong learner.
so, basically boosting algorithm we’re going to find out output of each DT and going to decrease it rather than predicting actual output.
Why algorithm called as Gradient Boosting?
Gradient here refers to the gradient of loss function, and it is the target value for each new tree to predict.(you’ll know in the algorithm part)
Boosting means we are going to boost weak learner by adding more weak learner and make it strong learner.
Deep Dive in algorithm
here is the algorithm.
it looks like out of the planet, right?. let me tell you in a simple way with example.
Gradient of loss function means we’re going to decrease our loss every time and try to make as low as possible like in the example.
Hyper Parameter Tuning
- tree-specific parameters: affect on each individual trees
- Boosting Parameters: affect on boosting operation in model.
- Miscellaneous parameters: other parameter for overall functioning.
- min_sample_split: minimum number of samples (or observations) which are required in a node to be considered for splitting.
- min_sample_leaf: minimum samples require in a terminal node or leaf.
- max_depth: maximum depth of the tree.
- max_leaf_nodes: maximum number of terminal nodes or leaves in tree.
- max_features: number of features to consider while searching for a best split. These will be randomly selected.
- learning_rate: determines the impact of each tree on the final outcome. learning parameter controls the magnitude of this change in the estimates.
- n_estimators: number of sequential trees to be modeled.
- subsample: fraction of observations to be selected for each tree. Selection is done by random sampling.
- loss: refers to the loss function to be minimized in each split. it can have various values for classification and regression case.
- random_seed: random number seed so that same random numbers are generated every time.
- verbose: type of output to be printed when the model fits.
0 :no output generated
1: output generated for trees in certain intervals.
> 1: output generated for all trees.
handle missing values- imputation not required
no data preprocessing require, often work with categorical and numerical values.
Often provides predictive accuracy that cannot be beat.
GBMs will continue improving to minimize all errors. This can overemphasize outliers and cause overfitting. Must use cross-validation to neutralize.
Computationally expensive, it requires more number of trees which can take time and memory exhaustive.
Thanks for reading this blog :)