How to find a faster way to search datasets than through model training - General

Maxim Dmitrievsky 2024.05.16 03:16 #35211

It's a "symbol fri" study

Maxim Dmitrievsky 2024.05.16 04:51 #35212

Maxim Dmitrievsky #:
It's a "symbol fri" study

When you increase the forecast horizon, you need to increase the length of the sub-series (order) for PE. Then the readings are equalised.

It is necessary to read why it is needed and what is investigated with it.

Maxim Dmitrievsky 2024.05.17 11:28 #35213

Yeah, the entropy of the tags doesn't show anything. Still, the chip-tag relationship is stronger.

Let's investigate mutual information between chips and labels then :)

The point is to find a faster way to search datasets than through model training. For example, if you want to test a million different labels.

FOREX - Trends, forecasts [Archive!] Any rookie question, Any rookie question, so

Aleksey Vyazmikin 2024.05.17 11:59 #35214

Maxim Dmitrievsky #:
When increasing the forecast horizon, it is necessary to increase the length of sub-series (order) for PE.

I made a small re-selection of settings - the data is the same as last time - 100 models

for order in range(2,5+1):
    for delay in range(1,5+1):

Looked for any dependence on PE on the sample train on the exam sample balance - used popular metrics.

PE_train 2:1: К_Пирсона=0.2120, К_R=10668.2320, MSE=1809110.285444, MAE=1052.3752, R²=0.0450"
PE_train 2:2: К_Пирсона=0.2130, К_R=10607.4303, MSE=1808347.306197, MAE=1052.1491, R²=0.0454"
PE_train 2:3: К_Пирсона=0.2134, К_R=10480.4923, MSE=1808023.869826, MAE=1052.2776, R²=0.0455"
PE_train 2:4: К_Пирсона=0.2137, К_R=10530.8140, MSE=1807766.732252, MAE=1052.1285, R²=0.0457"
PE_train 2:5: К_Пирсона=0.2143, К_R=10492.1156, MSE=1807255.172838, MAE=1051.9337, R²=0.0459"
PE_train 3:1: К_Пирсона=0.2105, К_R=5123.4862, MSE=1810313.083552, MAE=1052.7133, R²=0.0443"
PE_train 3:2: К_Пирсона=0.2108, К_R=5116.5434, MSE=1810128.847783, MAE=1052.6238, R²=0.0444"
PE_train 3:3: К_Пирсона=0.2106, К_R=5079.5618, MSE=1810278.256578, MAE=1052.8251, R²=0.0443"
PE_train 3:4: К_Пирсона=0.2117, К_R=5118.5520, MSE=1809366.235165, MAE=1052.4615, R²=0.0448"
PE_train 3:5: К_Пирсона=0.2111, К_R=5080.3215, MSE=1809877.017708, MAE=1052.4972, R²=0.0446"
PE_train 4:1: К_Пирсона=0.2098, К_R=3377.9085, MSE=1810918.840630, MAE=1052.7970, R²=0.0440"
PE_train 4:2: К_Пирсона=0.2099, К_R=3371.8602, MSE=1810850.553547, MAE=1052.8565, R²=0.0440"
PE_train 4:3: К_Пирсона=0.2104, К_R=3356.9281, MSE=1810406.609599, MAE=1052.7722, R²=0.0443"
PE_train 4:4: К_Пирсона=0.2107, К_R=3380.6115, MSE=1810141.272369, MAE=1052.7274, R²=0.0444"
PE_train 4:5: К_Пирсона=0.2104, К_R=3345.9473, MSE=1810389.919147, MAE=1052.5876, R²=0.0443"
PE_train 5:1: К_Пирсона=0.2092, К_R=2516.2054, MSE=1811376.593502, MAE=1052.9032, R²=0.0438"
PE_train 5:2: К_Пирсона=0.2099, К_R=2520.5883, MSE=1810831.538400, MAE=1052.8098, R²=0.0441"
PE_train 5:3: К_Пирсона=0.2100, К_R=2510.1603, MSE=1810713.252781, MAE=1052.8697, R²=0.0441"
PE_train 5:4: К_Пирсона=0.2106, К_R=2525.6652, MSE=1810221.028916, MAE=1052.7761, R²=0.0444"
PE_train 5:5: К_Пирсона=0.2105, К_R=2508.2508, MSE=1810340.775101, MAE=1052.5982, R²=0.0443"

And the same, but PE on the test sample.

PE_test 2:1: К_Пирсона=0.2673, К_R=29896.0754, MSE=1758957.667488, MAE=1034.6691, R²=0.0714"
PE_test 2:2: К_Пирсона=0.2742, К_R=29421.3679, MSE=1751891.115380, MAE=1034.1283, R²=0.0752"
PE_test 2:3: К_Пирсона=0.2697, К_R=28188.3758, MSE=1756534.838509, MAE=1035.5260, R²=0.0727"
PE_test 2:4: К_Пирсона=0.2706, К_R=28345.7328, MSE=1755524.788869, MAE=1035.3251, R²=0.0732"
PE_test 2:5: К_Пирсона=0.2700, К_R=28274.8000, MSE=1756225.386620, MAE=1035.7970, R²=0.0729"
PE_test 3:1: К_Пирсона=0.2652, К_R=14368.3983, MSE=1761077.011646, MAE=1035.4601, R²=0.0703"
PE_test 3:2: К_Пирсона=0.2713, К_R=14396.7686, MSE=1754831.979656, MAE=1034.9431, R²=0.0736"
PE_test 3:3: К_Пирсона=0.2687, К_R=14033.0994, MSE=1757484.879622, MAE=1035.6576, R²=0.0722"
PE_test 3:4: К_Пирсона=0.2681, К_R=13986.9691, MSE=1758165.233913, MAE=1036.0484, R²=0.0719"
PE_test 3:5: К_Пирсона=0.2686, К_R=14012.8619, MSE=1757640.773055, MAE=1035.8400, R²=0.0721"
PE_test 4:1: К_Пирсона=0.2660, К_R=9588.0751, MSE=1760202.147769, MAE=1035.2405, R²=0.0708"
PE_test 4:2: К_Пирсона=0.2705, К_R=9534.4856, MSE=1755664.591866, MAE=1035.1102, R²=0.0732"
PE_test 4:3: К_Пирсона=0.2685, К_R=9349.8630, MSE=1757710.526145, MAE=1035.7041, R²=0.0721"
PE_test 4:4: К_Пирсона=0.2680, К_R=9316.3301, MSE=1758269.952043, MAE=1036.0452, R²=0.0718"
PE_test 4:5: К_Пирсона=0.2683, К_R=9324.2983, MSE=1757953.343632, MAE=1035.9019, R²=0.0720"
PE_test 5:1: К_Пирсона=0.2663, К_R=7186.8870, MSE=1759966.629349, MAE=1035.2232, R²=0.0709"
PE_test 5:2: К_Пирсона=0.2696, К_R=7108.9714, MSE=1756603.174675, MAE=1035.4235, R²=0.0727"
PE_test 5:3: К_Пирсона=0.2682, К_R=7001.0514, MSE=1757992.903600, MAE=1035.8041, R²=0.0719"
PE_test 5:4: К_Пирсона=0.2679, К_R=6985.1808, MSE=1758366.487036, MAE=1036.0487, R²=0.0717"
PE_test 5:5: К_Пирсона=0.2682, К_R=7002.3867, MSE=1758019.910487, MAE=1035.9909, R²=0.0719"

And Recall instead of balance.

PE_train 2:1: К_Пирсона=0.6915, К_R=0.1802, MSE=0.000027, MAE=0.0036, R²=0.4782"
PE_train 2:2: К_Пирсона=0.6938, К_R=0.1790, MSE=0.000026, MAE=0.0036, R²=0.4813"
PE_train 2:3: К_Пирсона=0.6936, К_R=0.1764, MSE=0.000026, MAE=0.0036, R²=0.4810"
PE_train 2:4: К_Пирсона=0.6938, К_R=0.1771, MSE=0.000026, MAE=0.0036, R²=0.4813"
PE_train 2:5: К_Пирсона=0.6940, К_R=0.1760, MSE=0.000026, MAE=0.0036, R²=0.4817"
PE_train 3:1: К_Пирсона=0.6926, К_R=0.0873, MSE=0.000026, MAE=0.0036, R²=0.4797"
PE_train 3:2: К_Пирсона=0.6942, К_R=0.0873, MSE=0.000026, MAE=0.0036, R²=0.4819"
PE_train 3:3: К_Пирсона=0.6936, К_R=0.0867, MSE=0.000026, MAE=0.0036, R²=0.4810"
PE_train 3:4: К_Пирсона=0.6941, К_R=0.0869, MSE=0.000026, MAE=0.0036, R²=0.4818"
PE_train 3:5: К_Пирсона=0.6941, К_R=0.0865, MSE=0.000026, MAE=0.0036, R²=0.4818"
PE_train 4:1: К_Пирсона=0.6932, К_R=0.0578, MSE=0.000026, MAE=0.0036, R²=0.4805"
PE_train 4:2: К_Пирсона=0.6940, К_R=0.0578, MSE=0.000026, MAE=0.0036, R²=0.4816"
PE_train 4:3: К_Пирсона=0.6943, К_R=0.0574, MSE=0.000026, MAE=0.0036, R²=0.4821"
PE_train 4:4: К_Пирсона=0.6940, К_R=0.0577, MSE=0.000026, MAE=0.0036, R²=0.4817"
PE_train 4:5: К_Пирсона=0.6943, К_R=0.0572, MSE=0.000026, MAE=0.0036, R²=0.4820"
PE_train 5:1: К_Пирсона=0.6934, К_R=0.0432, MSE=0.000026, MAE=0.0036, R²=0.4808"
PE_train 5:2: К_Пирсона=0.6941, К_R=0.0432, MSE=0.000026, MAE=0.0036, R²=0.4818"
PE_train 5:3: К_Пирсона=0.6942, К_R=0.0430, MSE=0.000026, MAE=0.0036, R²=0.4819"
PE_train 5:4: К_Пирсона=0.6945, К_R=0.0431, MSE=0.000026, MAE=0.0036, R²=0.4823"
PE_train 5:5: К_Пирсона=0.6945, К_R=0.0429, MSE=0.000026, MAE=0.0036, R²=0.4823"
PE_test 2:1: К_Пирсона=0.6793, К_R=0.3936, MSE=0.000027, MAE=0.0037, R²=0.4614"
PE_test 2:2: К_Пирсона=0.6786, К_R=0.3772, MSE=0.000027, MAE=0.0037, R²=0.4605"
PE_test 2:3: К_Пирсона=0.6796, К_R=0.3680, MSE=0.000027, MAE=0.0037, R²=0.4618"
PE_test 2:4: К_Пирсона=0.6793, К_R=0.3685, MSE=0.000027, MAE=0.0037, R²=0.4614"
PE_test 2:5: К_Пирсона=0.6790, К_R=0.3684, MSE=0.000027, MAE=0.0037, R²=0.4611"
PE_test 3:1: К_Пирсона=0.6799, К_R=0.1908, MSE=0.000027, MAE=0.0037, R²=0.4622"
PE_test 3:2: К_Пирсона=0.6789, К_R=0.1866, MSE=0.000027, MAE=0.0037, R²=0.4609"
PE_test 3:3: К_Пирсона=0.6792, К_R=0.1837, MSE=0.000027, MAE=0.0037, R²=0.4613"
PE_test 3:4: К_Пирсона=0.6791, К_R=0.1835, MSE=0.000027, MAE=0.0037, R²=0.4612"
PE_test 3:5: К_Пирсона=0.6788, К_R=0.1835, MSE=0.000027, MAE=0.0037, R²=0.4608"
PE_test 4:1: К_Пирсона=0.6797, К_R=0.1269, MSE=0.000027, MAE=0.0037, R²=0.4620"
PE_test 4:2: К_Пирсона=0.6789, К_R=0.1239, MSE=0.000027, MAE=0.0037, R²=0.4609"
PE_test 4:3: К_Пирсона=0.6792, К_R=0.1225, MSE=0.000027, MAE=0.0037, R²=0.4613"
PE_test 4:4: К_Пирсона=0.6788, К_R=0.1222, MSE=0.000027, MAE=0.0037, R²=0.4607"
PE_test 4:5: К_Пирсона=0.6790, К_R=0.1222, MSE=0.000027, MAE=0.0037, R²=0.4610"
PE_test 5:1: К_Пирсона=0.6797, К_R=0.0950, MSE=0.000027, MAE=0.0037, R²=0.4620"
PE_test 5:2: К_Пирсона=0.6789, К_R=0.0927, MSE=0.000027, MAE=0.0037, R²=0.4610"
PE_test 5:3: К_Пирсона=0.6789, К_R=0.0918, MSE=0.000027, MAE=0.0037, R²=0.4610"
PE_test 5:4: К_Пирсона=0.6787, К_R=0.0917, MSE=0.000027, MAE=0.0037, R²=0.4607"
PE_test 5:5: К_Пирсона=0.6797, К_R=0.0919, MSE=0.000027, MAE=0.0037, R²=0.4621"

It seems that the impact is within the measurement error....

Maxim Dmitrievsky 2024.05.17 12:16 #35215

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html

mi = mutual_info_classif(x,y).sum()

You can't see an explicit dependency either.

Iteration: 0, Cluster: 7, MI: 1.2189377208353847
R2: 0.9406613620025471
Iteration: 0, Cluster: 2, MI: 0.8117673610913525
R2: 0.9710414538275104
Iteration: 0, Cluster: 8, MI: 0.8146994795423588
R2: 0.9614192040590535
Iteration: 0, Cluster: 5, MI: 0.774313498935937
R2: 0.9718430787767102
Iteration: 0, Cluster: 3, MI: 1.2033466468094358
R2: 0.9522122802061364
Iteration: 0, Cluster: 6, MI: 0.676973923285926
R2: 0.9641691565374338
Iteration: 0, Cluster: 0, MI: 0.7839258649684904
R2: 0.9086756811000309
Iteration: 0, Cluster: 1, MI: 1.0351451757748995
R2: 0.9418879595151826
Iteration: 0, Cluster: 4, MI: 1.84960423590218
R2: 0.8148987348464255
Iteration: 0, Cluster: 9, MI: 0.750999533271425
R2: 0.9591022504168845

sklearn.feature_selection.mutual_info_classif

scikit-learn.org

Examples using sklearn.feature_selection.mutual_info_classif: Selecting dimensionality reduction with Pipeline and GridSearchCV

Aleksey Vyazmikin 2024.05.17 12:26 #35216

Maxim Dmitrievsky #:
The point of the fuss is to find a faster way to enumerate datasets than through model training.

I think it can be done through my method. It is enough to estimate the probability bias in quantum segments, and to develop a metric to summarise the result for all predictors, for example, the percentage of quantum segments that passed the selection. If there are few of them, learning will be difficult, which indirectly means that the partitioning is not of high quality (if we believe that).

A single dataset is enough and one can oversample the markings.

However, this will tell about the ease of training on the train, but what happens next - you will not get an answer.

However, I think to collect statistics on predictors on different samples, then it will be clearer whether a predictor can be considered successful separately from the sample, or only by evaluating it relative to the markup you can make such a conclusion and choice.

Is there a pattern How to form the Neuro-forecasting of financial series

Maxim Dmitrievsky 2024.05.17 12:33 #35217

Aleksey Vyazmikin #:

I think, just through my method it is realistic to do it. It is enough to estimate the probability bias in quantum segments, and develop a metric to summarise the result for all predictors, for example the percentage of quantum segments that passed selection. If there are few of them, learning will be difficult, which indirectly means that the partitioning is not of high quality (if we believe it).

One dataset is enough and we can search markings.

However, this will tell you about the ease of training on the train, but what happens next - you will not get an answer.

However, I think to collect statistics on predictors on different samples, then it will be clearer whether a predictor can be considered successful separately from the sample, or only by evaluating it relative to the markup you can make such a conclusion and choice.

In principle, I have everything optimised and works fast, I'm just dabbling in what can be improved. The tester has been rewritten, now it counts quickly.

Maxim Dmitrievsky 2024.05.17 12:37 #35218

If someone likes to dig into the signs, you can quickly look at them in this kind of dissection

Parallel

plotly.com

Aleksey Vyazmikin 2024.05.17 12:42 #35219

Maxim Dmitrievsky #:
In principle, I have everything optimised and works fast, I'm just dabbling in what can be improved. The tester has been rewritten, now it calculates quickly.

I have a basic calculation (one iteration - which will be enough for a quick evaluation), takes about 2 seconds for a sample of 27000 rows by 5000 columns.

Maxim Dmitrievsky 2024.05.17 12:55 #35220

Aleksey Vyazmikin #:

I have a basic calculation (one iteration - which would be enough for a quick estimate), takes about 2 seconds for a sample of 27000 rows by 5000 columns.

Iteration: 0, Cluster: 7
R2: 0.9577954746167225
Iteration: 0, Cluster: 6
R2: 0.9852171635934878
Iteration: 0, Cluster: 1
R2: 0.9710216139827996
Iteration: 0, Cluster: 9
R2: 0.9684366979272643
Iteration: 0, Cluster: 0
R2: 0.9496932074276948
Iteration: 0, Cluster: 5
R2: 0.8265173798554108
Iteration: 0, Cluster: 8
R2: 0.9585784944252715
Iteration: 0, Cluster: 3
R2: 0.9684322477781597
Iteration: 0, Cluster: 4
R2: 0.9790328641882593
Iteration: 0, Cluster: 2
R2: 0.9401401207526833

Время выполнения: 7.3924901485443115 секунд)

10 models (two in each model, main and meta)

And immediately ready TC.

I run in batches of 20-100 re-training with different parameters. The biggest influence is the markup.

So I want to find the way of the most correct markup.

Is there a pattern Making a crowdsourced project Econometrics: one step ahead

Machine learning in trading: theory, models, practice and algo-trading - page 3522