Machine learning in trading: theory, models, practice and algo-trading - page 2037

 
Rorschach:

The last column is the target, the rest of the input

In general, I have cut the sample into 3 parts - 60% - training and 20 control training and sample not involved in the training.

Memory eats a lot - 18 gigabytes - I am surprised. How much memory do you have?

The learning process is running with almost default settings, but I see that the training sample is rapidly improving results, and the control does not improve after the first tree.

So the question is, are you sure there is a pattern there?

There is a suggestion that the classes are not well balanced at all, seems to be a percentage of units in the neighborhood of 10%?

 
Igor Makanu:

So we can't formalize the concept of TC?

So it turns out that TC is inspiration? Or playing a musical instrument?

As soon as we manage to formalize it and write it in a language, then some smart guys will invent a compiler for this language and traders will disappear into oblivion)

Igor Makanu :

Or let's get back to our ... - It turns out that TS is first of all the analysis of market information and decision making.

Yes, realizing that it is impossible to clearly and unambiguously formalize what these words mean and understanding that for this reason the results of the same information analysis may be different for different people and that only the future can show who was right)

 
dr.mr.mom:

Why such global pessimism? ))) I "looked" how they are trained before all modern packages in NeuroShell Day Pro. And even then I got robust results, which I don't know how it works inside and it was hard, almost impossible to tighten to MT4.

I agree that it would be desirable to bolt on the GPU.

The question is what kind of NS they are and what paradigm they have been built/learned in, mine are evolving.

Yes, the first robust version can be trained even for 24 hours (though in practice it takes 8 hours on an ancient home laptop). But to return to the necessity of further evolution of the first variant at the expense of its robustness will be necessary in a month. I.e. even at ten working tools in real life beforehand there will be a new variant.

Now about the architecture, the NEAT algorithm is taken as the basis, supplemented with its own features. At the output the architecture will evolve, including the architecture.

So it goes like this.

At the same time I recommend you to read books/lectures on microbiology etc.

In arguments unfortunately one is a fool (arguing without knowledge), the other is a bastard (arguing with knowledge), I prefer an exchange of opinions with arguments/reasoning.

After all, the main thing that was helpful, to hell with them with the draughtsman, let's go)))

nothing to argue about, because in any normal framework did and showed, with a minimum of code

self-made models are not particularly discussed here, only mature models like catbust or modern neural networks

It's not even interesting to discuss this mouse fuss with mql neural networks, because the world is far ahead, and every year it doubles the gap.

Suppose you tell me: "I have such-and-such a model on tensorflow"... I say "good, I can make the same model on a torch for 5 minutes and check it. And you tell me that you dug something in mql. What do I need this information for? How can I recreate it?

 
Aleksey Vyazmikin:

In general, cut the sample into 3 parts 60% - training and 20 control training and sample did not participate in the training.

Memory eats a lot - 18 gigabytes - surprised. How much memory do you have?

The learning process is running with almost default settings, but I see that the training sample is rapidly improving results, and the control does not improve after the first tree.

So the question is, are you sure there is a pattern there?

There is a suggestion that the classes are not balanced at all well, seems to be a percentage of units in the neighborhood of 10%?

Tree systems don't need class balancing in a large sample. Neural networks get jammed by imbalance, and trees clearly scatter everything on the leaves.
That's one of the reasons I switched to trees.

https://www.mql5.com/ru/blogs/post/723619

Нужна ли деревьям и лесам балансировка по классам?
Нужна ли деревьям и лесам балансировка по классам?
  • www.mql5.com
Я тут читаю: Флах П. - Машинное обучение. Наука и искусство построения алгоритмов, которые извлекают знания из данных - 2015 там есть несколько страниц посвященных этой теме. Вот итоговая: Отмеченный
 
Aleksey Nikolayev:

Well, yes, only realizing the impossibility of clear and unambiguous formalization of what these words mean) and understanding that for this reason the results of analysis of the same information can vary greatly from one person to another and that only the future can show who was right)

with the analysis of market information, in general there is no problem... except for the greed of the researcher who believes that the market gives information only to him and it is necessary to process all data, i.e. here the task is formalized as looking for a repeating pattern, other data should be discarded (not used)

With the decision is sad - to generate TS that will pass the test and forward is possible, but to find links between the strategy tester statistics and the lifetime of the TS or the possibility of determining the relevance of the TS to the market context - that's the problem

i.e., as you write the problem is in the future


I think that in general we have made a little progress in the formalization of the problem,

In principle it is not difficult to make an unloading of testing statistics and try to teach TC in Python,

Determination of market context, imho, as you wrote - is only a trader's decision, i.e. I doubt that it is possible to formalize or algorithmize or research

 
elibrarius:
Tree systems don't seem to need class balancing. Neural networks get jammed by imbalance, while trees clearly scatter everything on leaves.
That's one of the reasons I switched to trees.

CatBoost requires it, but it has its own balancer, but apparently can not cope.

In general, if there is a strong imbalance, the training will go, but statistically with more zeros in the leaves will only zeros, ie if there are few clear rules for pulling a small class, it can work, but otherwise it will be spread all over the leaves.

 
Aleksey Vyazmikin:

CatBoost is required, but there is its own balancer, but apparently does not cope.

In general, if there is a strong imbalance, the learning will go, but statistically with more zeros in the leaves will only zeros, that is, if there are few clear rules for pulling a small class, it may work, but otherwise it will be smeared all over the leaves.

Or as always there are almost no patterns in the data.

Aleksey Vyazmikin:

Generally, if there is a strong imbalance, then learning will go, but statistically with more zeros in the leaves will only be zeros, i.e. if there are few clear rules for pulling a small class, then it may work, but otherwise it will be smeared out on all the leaves.

The rule of thumb is clear - take the split that makes the leaves cleanest from the impurities of the other class.

I've added a link to a blog, with a large sample will be something to form leaves with a small class, plus you can use the root of the Gini index (only I have not found the formula for it).

 
Aleksey Vyazmikin:

I think for such a huge amount of data you have to make the trees deeper, so that the leaves get cleaner.
If you have 10 thousand examples in a leaf, of course, it will be smeared, but if you divide it to 100, I think it will be clearer.

The Alglib forest is up to 1 example per sheet, the separation is 100%. In leaves only 0 or 1 will remain.
 
elibrarius:
Aleksey Vyazmikin:

Or, as always, there is almost no pattern in the data.

The rule of thumb is clear - you take the split that makes the leaves cleanest from impurities of another class.

I added a link to the blog, with a large sample will be from which to form leaves with a small class, plus you can use the root of the Gini index (only I have not found his formula).

So he has few predictors - small dimensionality turns out, so the options for combinations of trees as little.

I took a 1% sample - there's 100% learning on the test - I just don't think there's a pronounced pattern.

And, CatBoost takes predictors somewhat randomly to build - so the fit, by their understanding, is reduced.

elibrarius:

I think for such a huge amount of data, you need to make the trees deeper so the leaves clean up better.
If you're left with 10k examples in a sheet, it will naturally be smeared, but if you bring the separation to 100, I think it will already be clearer.

The tree is 6 deep, and I think the depth is needed for a larger number of predictors.

I have made a grid of 256.

 
Aleksey Vyazmikin:

The tree is 6 deep, and I think the depth is needed with more predictors.

The grid is 256.

The more rows, the more depth is needed.
If there are gigabytes, then millions of rows. At a depth of 6, the final sheet will be 1/64th of the full number of examples/rows, i.e. tens of thousands if there are millions of inputs.

Try a depth of 15 (this seems to be the maximum, the sheet will have 1/32768 fraction of lines).

Reason: