Machine learning in trading: theory, models, practice and algo-trading - page 3522

 
It's a "symbol fri" study
 
Maxim Dmitrievsky #:
It's a "symbol fri" study

When you increase the forecast horizon, you need to increase the length of the sub-series (order) for PE. Then the readings are equalised.

It is necessary to read why it is needed and what is investigated with it.
 

Yeah, the entropy of the tags doesn't show anything. Still, the chip-tag relationship is stronger.

Let's investigate mutual information between chips and labels then :)

The point is to find a faster way to search datasets than through model training. For example, if you want to test a million different labels.
 
Maxim Dmitrievsky #:
When increasing the forecast horizon, it is necessary to increase the length of sub-series (order) for PE.

I made a small re-selection of settings - the data is the same as last time - 100 models

for order in range(2,5+1):
    for delay in range(1,5+1):

Looked for any dependence on PE on the sample train on the exam sample balance - used popular metrics.

PE_train 2:1: К_Пирсона=0.2120, К_R=10668.2320, MSE=1809110.285444, MAE=1052.3752, R²=0.0450"
PE_train 2:2: К_Пирсона=0.2130, К_R=10607.4303, MSE=1808347.306197, MAE=1052.1491, R²=0.0454"
PE_train 2:3: К_Пирсона=0.2134, К_R=10480.4923, MSE=1808023.869826, MAE=1052.2776, R²=0.0455"
PE_train 2:4: К_Пирсона=0.2137, К_R=10530.8140, MSE=1807766.732252, MAE=1052.1285, R²=0.0457"
PE_train 2:5: К_Пирсона=0.2143, К_R=10492.1156, MSE=1807255.172838, MAE=1051.9337, R²=0.0459"
PE_train 3:1: К_Пирсона=0.2105, К_R=5123.4862, MSE=1810313.083552, MAE=1052.7133, R²=0.0443"
PE_train 3:2: К_Пирсона=0.2108, К_R=5116.5434, MSE=1810128.847783, MAE=1052.6238, R²=0.0444"
PE_train 3:3: К_Пирсона=0.2106, К_R=5079.5618, MSE=1810278.256578, MAE=1052.8251, R²=0.0443"
PE_train 3:4: К_Пирсона=0.2117, К_R=5118.5520, MSE=1809366.235165, MAE=1052.4615, R²=0.0448"
PE_train 3:5: К_Пирсона=0.2111, К_R=5080.3215, MSE=1809877.017708, MAE=1052.4972, R²=0.0446"
PE_train 4:1: К_Пирсона=0.2098, К_R=3377.9085, MSE=1810918.840630, MAE=1052.7970, R²=0.0440"
PE_train 4:2: К_Пирсона=0.2099, К_R=3371.8602, MSE=1810850.553547, MAE=1052.8565, R²=0.0440"
PE_train 4:3: К_Пирсона=0.2104, К_R=3356.9281, MSE=1810406.609599, MAE=1052.7722, R²=0.0443"
PE_train 4:4: К_Пирсона=0.2107, К_R=3380.6115, MSE=1810141.272369, MAE=1052.7274, R²=0.0444"
PE_train 4:5: К_Пирсона=0.2104, К_R=3345.9473, MSE=1810389.919147, MAE=1052.5876, R²=0.0443"
PE_train 5:1: К_Пирсона=0.2092, К_R=2516.2054, MSE=1811376.593502, MAE=1052.9032, R²=0.0438"
PE_train 5:2: К_Пирсона=0.2099, К_R=2520.5883, MSE=1810831.538400, MAE=1052.8098, R²=0.0441"
PE_train 5:3: К_Пирсона=0.2100, К_R=2510.1603, MSE=1810713.252781, MAE=1052.8697, R²=0.0441"
PE_train 5:4: К_Пирсона=0.2106, К_R=2525.6652, MSE=1810221.028916, MAE=1052.7761, R²=0.0444"
PE_train 5:5: К_Пирсона=0.2105, К_R=2508.2508, MSE=1810340.775101, MAE=1052.5982, R²=0.0443"

And the same, but PE on the test sample.

PE_test 2:1: К_Пирсона=0.2673, К_R=29896.0754, MSE=1758957.667488, MAE=1034.6691, R²=0.0714"
PE_test 2:2: К_Пирсона=0.2742, К_R=29421.3679, MSE=1751891.115380, MAE=1034.1283, R²=0.0752"
PE_test 2:3: К_Пирсона=0.2697, К_R=28188.3758, MSE=1756534.838509, MAE=1035.5260, R²=0.0727"
PE_test 2:4: К_Пирсона=0.2706, К_R=28345.7328, MSE=1755524.788869, MAE=1035.3251, R²=0.0732"
PE_test 2:5: К_Пирсона=0.2700, К_R=28274.8000, MSE=1756225.386620, MAE=1035.7970, R²=0.0729"
PE_test 3:1: К_Пирсона=0.2652, К_R=14368.3983, MSE=1761077.011646, MAE=1035.4601, R²=0.0703"
PE_test 3:2: К_Пирсона=0.2713, К_R=14396.7686, MSE=1754831.979656, MAE=1034.9431, R²=0.0736"
PE_test 3:3: К_Пирсона=0.2687, К_R=14033.0994, MSE=1757484.879622, MAE=1035.6576, R²=0.0722"
PE_test 3:4: К_Пирсона=0.2681, К_R=13986.9691, MSE=1758165.233913, MAE=1036.0484, R²=0.0719"
PE_test 3:5: К_Пирсона=0.2686, К_R=14012.8619, MSE=1757640.773055, MAE=1035.8400, R²=0.0721"
PE_test 4:1: К_Пирсона=0.2660, К_R=9588.0751, MSE=1760202.147769, MAE=1035.2405, R²=0.0708"
PE_test 4:2: К_Пирсона=0.2705, К_R=9534.4856, MSE=1755664.591866, MAE=1035.1102, R²=0.0732"
PE_test 4:3: К_Пирсона=0.2685, К_R=9349.8630, MSE=1757710.526145, MAE=1035.7041, R²=0.0721"
PE_test 4:4: К_Пирсона=0.2680, К_R=9316.3301, MSE=1758269.952043, MAE=1036.0452, R²=0.0718"
PE_test 4:5: К_Пирсона=0.2683, К_R=9324.2983, MSE=1757953.343632, MAE=1035.9019, R²=0.0720"
PE_test 5:1: К_Пирсона=0.2663, К_R=7186.8870, MSE=1759966.629349, MAE=1035.2232, R²=0.0709"
PE_test 5:2: К_Пирсона=0.2696, К_R=7108.9714, MSE=1756603.174675, MAE=1035.4235, R²=0.0727"
PE_test 5:3: К_Пирсона=0.2682, К_R=7001.0514, MSE=1757992.903600, MAE=1035.8041, R²=0.0719"
PE_test 5:4: К_Пирсона=0.2679, К_R=6985.1808, MSE=1758366.487036, MAE=1036.0487, R²=0.0717"
PE_test 5:5: К_Пирсона=0.2682, К_R=7002.3867, MSE=1758019.910487, MAE=1035.9909, R²=0.0719"

And Recall instead of balance.

PE_train 2:1: К_Пирсона=0.6915, К_R=0.1802, MSE=0.000027, MAE=0.0036, R²=0.4782"
PE_train 2:2: К_Пирсона=0.6938, К_R=0.1790, MSE=0.000026, MAE=0.0036, R²=0.4813"
PE_train 2:3: К_Пирсона=0.6936, К_R=0.1764, MSE=0.000026, MAE=0.0036, R²=0.4810"
PE_train 2:4: К_Пирсона=0.6938, К_R=0.1771, MSE=0.000026, MAE=0.0036, R²=0.4813"
PE_train 2:5: К_Пирсона=0.6940, К_R=0.1760, MSE=0.000026, MAE=0.0036, R²=0.4817"
PE_train 3:1: К_Пирсона=0.6926, К_R=0.0873, MSE=0.000026, MAE=0.0036, R²=0.4797"
PE_train 3:2: К_Пирсона=0.6942, К_R=0.0873, MSE=0.000026, MAE=0.0036, R²=0.4819"
PE_train 3:3: К_Пирсона=0.6936, К_R=0.0867, MSE=0.000026, MAE=0.0036, R²=0.4810"
PE_train 3:4: К_Пирсона=0.6941, К_R=0.0869, MSE=0.000026, MAE=0.0036, R²=0.4818"
PE_train 3:5: К_Пирсона=0.6941, К_R=0.0865, MSE=0.000026, MAE=0.0036, R²=0.4818"
PE_train 4:1: К_Пирсона=0.6932, К_R=0.0578, MSE=0.000026, MAE=0.0036, R²=0.4805"
PE_train 4:2: К_Пирсона=0.6940, К_R=0.0578, MSE=0.000026, MAE=0.0036, R²=0.4816"
PE_train 4:3: К_Пирсона=0.6943, К_R=0.0574, MSE=0.000026, MAE=0.0036, R²=0.4821"
PE_train 4:4: К_Пирсона=0.6940, К_R=0.0577, MSE=0.000026, MAE=0.0036, R²=0.4817"
PE_train 4:5: К_Пирсона=0.6943, К_R=0.0572, MSE=0.000026, MAE=0.0036, R²=0.4820"
PE_train 5:1: К_Пирсона=0.6934, К_R=0.0432, MSE=0.000026, MAE=0.0036, R²=0.4808"
PE_train 5:2: К_Пирсона=0.6941, К_R=0.0432, MSE=0.000026, MAE=0.0036, R²=0.4818"
PE_train 5:3: К_Пирсона=0.6942, К_R=0.0430, MSE=0.000026, MAE=0.0036, R²=0.4819"
PE_train 5:4: К_Пирсона=0.6945, К_R=0.0431, MSE=0.000026, MAE=0.0036, R²=0.4823"
PE_train 5:5: К_Пирсона=0.6945, К_R=0.0429, MSE=0.000026, MAE=0.0036, R²=0.4823"
PE_test 2:1: К_Пирсона=0.6793, К_R=0.3936, MSE=0.000027, MAE=0.0037, R²=0.4614"
PE_test 2:2: К_Пирсона=0.6786, К_R=0.3772, MSE=0.000027, MAE=0.0037, R²=0.4605"
PE_test 2:3: К_Пирсона=0.6796, К_R=0.3680, MSE=0.000027, MAE=0.0037, R²=0.4618"
PE_test 2:4: К_Пирсона=0.6793, К_R=0.3685, MSE=0.000027, MAE=0.0037, R²=0.4614"
PE_test 2:5: К_Пирсона=0.6790, К_R=0.3684, MSE=0.000027, MAE=0.0037, R²=0.4611"
PE_test 3:1: К_Пирсона=0.6799, К_R=0.1908, MSE=0.000027, MAE=0.0037, R²=0.4622"
PE_test 3:2: К_Пирсона=0.6789, К_R=0.1866, MSE=0.000027, MAE=0.0037, R²=0.4609"
PE_test 3:3: К_Пирсона=0.6792, К_R=0.1837, MSE=0.000027, MAE=0.0037, R²=0.4613"
PE_test 3:4: К_Пирсона=0.6791, К_R=0.1835, MSE=0.000027, MAE=0.0037, R²=0.4612"
PE_test 3:5: К_Пирсона=0.6788, К_R=0.1835, MSE=0.000027, MAE=0.0037, R²=0.4608"
PE_test 4:1: К_Пирсона=0.6797, К_R=0.1269, MSE=0.000027, MAE=0.0037, R²=0.4620"
PE_test 4:2: К_Пирсона=0.6789, К_R=0.1239, MSE=0.000027, MAE=0.0037, R²=0.4609"
PE_test 4:3: К_Пирсона=0.6792, К_R=0.1225, MSE=0.000027, MAE=0.0037, R²=0.4613"
PE_test 4:4: К_Пирсона=0.6788, К_R=0.1222, MSE=0.000027, MAE=0.0037, R²=0.4607"
PE_test 4:5: К_Пирсона=0.6790, К_R=0.1222, MSE=0.000027, MAE=0.0037, R²=0.4610"
PE_test 5:1: К_Пирсона=0.6797, К_R=0.0950, MSE=0.000027, MAE=0.0037, R²=0.4620"
PE_test 5:2: К_Пирсона=0.6789, К_R=0.0927, MSE=0.000027, MAE=0.0037, R²=0.4610"
PE_test 5:3: К_Пирсона=0.6789, К_R=0.0918, MSE=0.000027, MAE=0.0037, R²=0.4610"
PE_test 5:4: К_Пирсона=0.6787, К_R=0.0917, MSE=0.000027, MAE=0.0037, R²=0.4607"
PE_test 5:5: К_Пирсона=0.6797, К_R=0.0919, MSE=0.000027, MAE=0.0037, R²=0.4621"

It seems that the impact is within the measurement error....

 

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html

mi = mutual_info_classif(x,y).sum()

You can't see an explicit dependency either.

Iteration: 0, Cluster: 7, MI: 1.2189377208353847
R2: 0.9406613620025471
Iteration: 0, Cluster: 2, MI: 0.8117673610913525
R2: 0.9710414538275104
Iteration: 0, Cluster: 8, MI: 0.8146994795423588
R2: 0.9614192040590535
Iteration: 0, Cluster: 5, MI: 0.774313498935937
R2: 0.9718430787767102
Iteration: 0, Cluster: 3, MI: 1.2033466468094358
R2: 0.9522122802061364
Iteration: 0, Cluster: 6, MI: 0.676973923285926
R2: 0.9641691565374338
Iteration: 0, Cluster: 0, MI: 0.7839258649684904
R2: 0.9086756811000309
Iteration: 0, Cluster: 1, MI: 1.0351451757748995
R2: 0.9418879595151826
Iteration: 0, Cluster: 4, MI: 1.84960423590218
R2: 0.8148987348464255
Iteration: 0, Cluster: 9, MI: 0.750999533271425
R2: 0.9591022504168845
sklearn.feature_selection.mutual_info_classif
  • scikit-learn.org
Examples using sklearn.feature_selection.mutual_info_classif: Selecting dimensionality reduction with Pipeline and GridSearchCV
 
Maxim Dmitrievsky #:
The point of the fuss is to find a faster way to enumerate datasets than through model training.

I think it can be done through my method. It is enough to estimate the probability bias in quantum segments, and to develop a metric to summarise the result for all predictors, for example, the percentage of quantum segments that passed the selection. If there are few of them, learning will be difficult, which indirectly means that the partitioning is not of high quality (if we believe that).

A single dataset is enough and one can oversample the markings.

However, this will tell about the ease of training on the train, but what happens next - you will not get an answer.

However, I think to collect statistics on predictors on different samples, then it will be clearer whether a predictor can be considered successful separately from the sample, or only by evaluating it relative to the markup you can make such a conclusion and choice.

 
Aleksey Vyazmikin #:

I think, just through my method it is realistic to do it. It is enough to estimate the probability bias in quantum segments, and develop a metric to summarise the result for all predictors, for example the percentage of quantum segments that passed selection. If there are few of them, learning will be difficult, which indirectly means that the partitioning is not of high quality (if we believe it).

One dataset is enough and we can search markings.

However, this will tell you about the ease of training on the train, but what happens next - you will not get an answer.

However, I think to collect statistics on predictors on different samples, then it will be clearer whether a predictor can be considered successful separately from the sample, or only by evaluating it relative to the markup you can make such a conclusion and choice.

In principle, I have everything optimised and works fast, I'm just dabbling in what can be improved. The tester has been rewritten, now it counts quickly.
 
If someone likes to dig into the signs, you can quickly look at them in this kind of dissection
Parallel
Parallel
  • plotly.com
Copyright © 2024 Plotly. All rights reserved.
 
Maxim Dmitrievsky #:
In principle, I have everything optimised and works fast, I'm just dabbling in what can be improved. The tester has been rewritten, now it calculates quickly.

I have a basic calculation (one iteration - which will be enough for a quick evaluation), takes about 2 seconds for a sample of 27000 rows by 5000 columns.

 
Aleksey Vyazmikin #:

I have a basic calculation (one iteration - which would be enough for a quick estimate), takes about 2 seconds for a sample of 27000 rows by 5000 columns.

Iteration: 0, Cluster: 7
R2: 0.9577954746167225
Iteration: 0, Cluster: 6
R2: 0.9852171635934878
Iteration: 0, Cluster: 1
R2: 0.9710216139827996
Iteration: 0, Cluster: 9
R2: 0.9684366979272643
Iteration: 0, Cluster: 0
R2: 0.9496932074276948
Iteration: 0, Cluster: 5
R2: 0.8265173798554108
Iteration: 0, Cluster: 8
R2: 0.9585784944252715
Iteration: 0, Cluster: 3
R2: 0.9684322477781597
Iteration: 0, Cluster: 4
R2: 0.9790328641882593
Iteration: 0, Cluster: 2
R2: 0.9401401207526833

Время выполнения: 7.3924901485443115 секунд) 

10 models (two in each model, main and meta)

And immediately ready TC.


I run in batches of 20-100 re-training with different parameters. The biggest influence is the markup.

So I want to find the way of the most correct markup.