How to summarise data? - General

Aleksey Nikolayev 2023.03.02 15:39 #29431

Stanislav Korotky #:

Please explain how the following formula is obtained in the algorithm of classification on trees(you can link to PDF):

In all materials that I could find in the Internet, the formula is just magically "taken from the ceiling".

If summarising by classes, the denominator is the Gini index or node purity. The smaller it is, the better. In the numerator is the number of rows in the sheet.

The bigger the criterion, the better - classes are separated more cleanly, but without excessive chopping of sheets.

The Gini index seems to be chosen because it is considered more sensitive than the classification error rate.

Reason of EA affect How much have you Machine Learning and Neural

Stanislav Korotky 2023.03.02 15:47 #29432

Aleksey Nikolayev #:

If summarised by class, the denominator is the Gini index or node purity. The smaller it is, the better. The numerator is the number of rows in the sheet.

The bigger the criterion, the better - classes are separated more cleanly, but without excessive sheet shredding.

The Gini index seems to be chosen because it is considered more sensitive than the classification error rate.

No, summarising over the records that hit the node. The question is not about the measure of informativeness. It's about transferring "residuals" between trees - there is a constant recalculation from probability to logit and back again.

HFT, Arbitrage The Sultonov Regression Model Econometrics: one step ahead

Aleksey Nikolayev 2023.03.02 16:14 #29433

Stanislav Korotky #:

No, summarising by the records that hit the node. The question is not related to the measure of informativeness. It's about transferring "residuals" between trees - there is a constant recalculation from probability to logit and back again.

And how can frequency be counted for a record in general? For a class it is clear how.

Aleksey Nikolayev 2023.03.02 16:38 #29434

Stanislav Korotky #:

No, summarising by the records that hit the node. The question is not related to the measure of informativeness. It's about transferring "residuals" between trees - there is a constant recalculation from probability to logit and back again.

Or is it about classification by logistic regression? Either way, a formula plucked from somewhere is not enough, you need the whole text.

Stanislav Korotky 2023.03.02 17:39 #29435

Aleksey Nikolayev #:

Or are we talking about classification by logistic regression? In any case, a formula plucked from somewhere is not enough, you need the whole text.

Logit function in the sense of ln(odds). You need it to translate the region of probability values [0,1] to plus or minus infinity - otherwise you can't train by gradient.

For example, here is the text - https://medium.com/swlh/gradient-boosting-trees-for-classification-a-beginners-guide-596b594a14ea

And here is the video - https://www.youtube.com/watch?v=hjxgoUJ_va8.

PS. IMHO, both there and there are errors in the material.

Gradient Boosting Trees for Classification: A Beginner’s Guide

Aratrika Pal
medium.com

Introduction Machine learning algorithms require more than just fitting models and making predictions to improve accuracy. Nowadays, most winning models in the industry or in competitions have been using Ensemble Techniques to perform better. One such technique is Gradient...

Initial deposit size - Firebird EA Golden theme

Forester 2023.03.02 18:12 #29436

Aleksey Nikolayev #:

If summarised by class, the denominator is the Gini index or node purity. The smaller it is, the better. The numerator is the number of rows in the sheet.

The bigger the criterion, the better - classes are separated more cleanly, but without excessive sheet shredding.

The Gini index seems to be chosen because it is considered more sensitive than the classification error rate.

Oh!
Finally someone knows about the Gini index.... I looked it up back in '18, the code for it. https://www.mql5.com/ru/blogs/post/723619

Нужна ли деревьям и лесам балансировка по классам?

www.mql5.com

Я тут читаю: Флах П. - Машинное обучение. Наука и искусство построения алгоритмов, которые извлекают знания из данных - 2015 там есть несколько страниц посвященных этой теме. Вот итоговая: Отмеченный

Aleksey Nikolayev 2023.03.02 18:21 #29437

Stanislav Korotky #:

Logit function in the sense of ln(odds). It is needed to translate the region of probability values [0,1] to plus or minus infinity - otherwise it will not be possible to train by gradient.

Yes, it is used for logistic regression when you are looking for the probability (logit function from it) of belonging to a class.

Stanislav Korotky #:

For example, here is the text - https://medium.com/swlh/gradient-boosting-trees-for-classification-a-beginners-guide-596b594a14ea

It seems that the author wants to present the insides of bousting in a popular way, but he has taken a too complicated variant of the problem. He mixes logit regression, trees and bousting, which are not easy to understand by themselves. The essence of bousting cannot be stated logically without funcan. To understand the essence of logit regression, you need a theorist (binomial distribution, probably).

create an expert for PLO Has anyone made Automatic

Aleksey Nikolayev 2023.03.02 18:28 #29438

Forester #:
Oh!
Finally someone knows about the Gini index... I was back in '18 looking for the code for it. h ttps:// www.mql5.com/ru/blogs/post/723619

There's also the Gini coefficient. It's also used in the MOE, but that's different.)

СанСаныч Фоменко 2023.03.02 19:07 #29439

Stanislav Korotky #:

Please explain how the following formula is obtained in the algorithm of classification on trees with bousting(you can link to PDF):

In all materials that I could find in the Internet, the formula is just magically "taken from the ceiling".

Where did you get the formula from? Judging by the "from the ceiling" usual collective farming, most likely Soviet.

You need to use professional maths, for which there are well-established algorithms.

R has a huge number of wooden models, and the difference between professional R language and very many others is obligatory references to the authors of the algorithm and the corresponding publication. At a quick glance, I can't remember any more or less complex function from R packages that doesn't have corresponding references.

Forget about everything but R. Today it is the only professional environment for statistical calculations.

Writing code in Russian. Why is Python so Time to convert libraries

mytarmailS 2023.03.02 19:15 #29440

I love R, for me it's the best language in the world, but Sanych's constant adverts in his every post make me really sick.

Machine learning in trading: theory, models, practice and algo-trading - page 2944