這是給自己的一份學習紀錄,以免日子久了忘記這是甚麼理論XD
🦹 XGBoost Boost
What is XGBoost?
Think of XGBoost as a team of smart tutors, each correcting the mistakes made by the previous one, gradually improving your answers step by step.
🗝 Key Concepts in XGBoost Tree Building
- Start with an initial guess (e.g., average score).
- Measure how far off the prediction is from the real answer (this is called the residual).
- The next tree learns how to fix these errors.
- Every new tree improves on the mistakes of the previous trees.
🥢 How to Divide the Data (Not Randomly)
- XGBoost doesn’t split data based on traditional methods like information gain.
- It uses a formula called Gain, which measures how much a split improves prediction.
- A split only happens if:
(Left + Right Score) > (Parent Score + Penalty)
❓ How do we know if a split is good?
- Use a value called Similarity Score
- The higher the score, the more consistent (similar) the residuals are in that group
🐢 Two Ways to Find Splits: Accurate- Exact Greedy Algorithm
- Try all possible features and split points
- Very accurate but very slow
🐇 Two Ways to Find Splits: Fast- Approximate Algorithm
- Uses feature quantiles (e.g., median) to propose a few split points
- Group the data based on these splits and evaluate the best one
- Two options:
- Global Proposal: use global info to suggest splits
- Local Proposal: use local (node-specific) info
🏋 Weighted Quantile Sketch
- Some data points are more important (like how teachers focus more on students who struggle)
- Each data point has a weight based on how wrong it was (second-order gradient)
- Use these weights to suggest better and more meaningful split points
🕳 Handling Missing Values
- What if some feature values are missing?
- XGBoost learns a default path for missing data
- This makes the model more robust even when the data isn’t complete
🧚♀️ Controlling Model Complexity: Regularization
-
Gamma (γ)
- Penalizes overly complex trees
- A split only happens if Gain > Gamma
- Helps stop the model from splitting when it’s not really helpful
-
Lambda (λ)
- Shrinks leaf node prediction values
- Prevents overconfident and overfit models
✂ Pruning
- After building the tree, XGBoost may prune parts that don’t help
- If a split’s gain is less than Gamma, that branch is cut off
- This leads to simpler trees that generalize better
🧞♂️ Extra Tricks: Learn Smoothly and Fast
-
Shrinkage (Learning Rate)
- Only take a small step with each new tree
- Makes learning slower but more stable
-
Column Subsampling
- Only use a subset of features for each tree
- This speeds up training and reduces overfitting