What I talk about when I talk about trees

Seed of the trees.

Trees are among the first ideas when people realize that putting the rules by hand onto the machines takes toooo much work (Prolog, anyone ?). Would it be nice if we can have the machine infer the rules structure for us, given a bunch of examples.

Variances of the trees and Random Forest

Trees are notoriously know for having high variance. When you have a complex problem with many predictors, it gets unweildy and produce erratic output. People don’t like that, they want their trees to have some shape i.e certain property, they prune the trees. This act of introducing some bias into the tree estimator is to make it behave in a certain desirable way. Later on, we will use this same idea, so keep this in mind.

Also people who love trees learned that if they build a bunch of trees and average them all over, then they can reduce the variance of the prediction. This is random forest. RandomForest is a essentialy a variance reduction.

Boosted trees

And boosting, as an old master said it, is one of the finest idea that come out of field. The idea that of boosting has its deep roots (pun!) in Mathematics, stemming (another pun) from the idea of function integrals of the ancient master in Newton or Leibnitz. Where you can sum up some functions over some domains if you slice the domain thin enough. It’s the same idea applied here in boosting. You in theory, can use very thin slices estimators in tandem to model some function with great accuracy. (In practice, you have to deal with real world problem like measurement errors or such leading to noise, then you have to know when to stop, but more on that later).

Boosted trees take all the cake nowadays over other kind of trees. And it make sense. Modern implementation like xgboost is the stuff worthy of the best moneys. And the people who created it gave it away for frees. If anyone in the world ever used it to make a profit or save a dime from using Xgboost pay the creator some money, they would be so fucking rich.

Boosting is like summation. We can never be wrong letting computer doing sums, or can’t we? I’ve seens people coming in with their trees, tune it upside down and fuck off claim it could not be done.

How should you tune a boosted tree then ?

First, know the god damn trees. Really learn it. And my best advice for you is to learn from the old masters. Learn from different old masters. Learn from the one who created the algorithm, the one who translate it to code. Drink from source. You only need to know this once for your whole life. And if you choose a career in this field, some time along the way you’ll need to fit a tree. This is a must!

Second: Know what’s available to tune and how to tune it. This depends on the implementation. for XGboost and contemporary boosted trees, go look at this. https://sites.google.com/view/lauraepp/parameters. This is your navigation map.

Third: Apply your theories. Usually, set your Learning rate first. Smaller the better, but slower. You pick your balance. Then the next most important thing is the depth of the individual trees. Too shallow then It’s take slow. Too deep and your trees are complex and have high variances. Remember it’s a bias reduction technique. Big complicated trees will get too much in the details of the learning function. Then it overshoots and the subsequent trees have to correct for it. This is why you have negative values from xgboost when all your target is non-negative. One tree cannot extrapolate beyond the range it sees. It’s the trees correcting each others.

Do not do these thing below. These are sins, may you burn in hell if you do this:

  • Tune the god damn random seed. I have seen this many times. This is for Kaggle fuckwad and crooks. Please god wash my eyes with your holy water.
  • Random search the whole parameter set, especially important things like learning rate and depth of the trees. Technically speaking, doing this will find you a combination of parameter that could produce dangerous results. Like bunch of very deep trees (high variance and over correcting each other) just happen to minimize your CV. Who need you when you what you do is doing random search on the thing that you suppose to provide your value. You will corrupt your own purpose.
  • Listen to fuckwad like me. Remember better to learn from first hand knowledge.

What should you do then ? This is hard, as it will depend on your problem at hand. Use the scientific method, start with a theory and validate your theory with your testing schemes. Random search is ok. But remember it’s very very dangerous tool to use. One day your model will break and by random searching the param set, you do not know what causes it. This is the matter of having principle. This is the science in data science. Know your ride your bike, and know how to fix it.

bike

Next up:

  • Coding up some trees
  • Imbalances and methods that use subsampling
  • MARS algorithms and fit it with trees and lr
Written on February 13, 2024