Machine learning in trading: theory, models, practice and algo-trading - page 3604

 
Bibla from google (or not from google, google also had a similar one) for the same purposes (I haven't tried it, I couldn't get my hands on it). I like to come up with my own ideas.
GitHub - cleanlab/cleanlab: The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
GitHub - cleanlab/cleanlab: The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
  • cleanlab
  • github.com
cleanlab helps you clean data and lab els by automatically detecting issues in a ML dataset. To facilitate machine learning with messy, real-world data , this data-centric AI package uses your existing models to estimate dataset...
 
Aleksey Nikolayev #:
Imho, one should also want to make a normal forum on MO in trading ))

Want to do, but do not want to do )

 

If you don't make a big forward, it works like candy. Then retrain, taking into account the new data.

Any signs, but preferably close to the prices. You don't need a lot, 10 is enough. It is desirable to normalise for clustering, if the values are very different.

It is desirable to make a dense markup, that is, for each change of signs there should be a mark. But you can do it on a sparse dataset. Labels should correspond to profitable deals in most cases. Otherwise, there will be an opposite effect.
 
Maxim Dmitrievsky #:
Apparently nobody reads, only flud in the thread.

I think I already wrote then that I have something similar....

# Настройка источника данных
#dir_cb = "C:\\FX\\MT5_02\\MQL5\\Files\\00_Standart_50_Test"
dir_cb = "E:\\FX\\MT5_CB\\MQL5\\Files\\iDelta_D1_TP_4_SL_4_Random\\CB_Bi_Setup"

arr_viborka = ["train.csv", "test.csv", "exam.csv"]
strok_total_data = 1000
stolb_total_data = 100
Load_Strok = strok_total_data
Load_Stolb = stolb_total_data
# Задание параметров кластеризации
arr_Tree = [3,3,3] # Звдвём структуру дерева - указывается число кластеров/листьев на каждом уровне/ярусе
n_level = len(arr_Tree)
arr_filter = []
arr_centroid_arhiv = []

scaler = None #Переменная для нормализации
Variant_Norm=3 #Вариант нормализации
iteration=0;# Переменная для индексации перебора

# Задание параметров обработки данных
Use_Convert_CSV = True # Если необходимо открыть CSV и пробразовать и сохранить потом в формат Use_Convert_CSV
Use_Load_CSV = True # Загружать данные из файла или использовать случайно сгенерированные - для отладки кода   
Not_Bi_Data = False # Если выборка не бинарная, то True
Use_Creat_New_Viborka = True # Создать новую выборку через фильтрацию по стат. данным кластеризации
Variant_Filter_Klaster = 2 # Вариант условий фильтрации

# Проверяем наличие директории
directory = dir_cb+'\\Rez_Tree_K-Means'
if not os.path.exists(directory):# Если директории нет, создаем ее
    os.makedirs(directory)
    print(f"Директория {directory} была создана.")
else:
    print(f"Директория {directory} уже существует.")

if Use_Load_CSV == False:# Генерация случайных данных
    np.random.seed(100)
    arr_data_load = np.random.randint(0, 2, (strok_total_data, stolb_total_data))
    arr_Target = np.random.randint(2, size=Load_Strok)

if Use_Convert_CSV == True:
    f_Load_Data(dir_cb+'\\Setup','train',False,True,Not_Bi_Data)#Открыть в CSV и преобразовать в feather
    f_Load_Data(dir_cb+'\\Setup','test',False,True,Not_Bi_Data)#Открыть в CSV и преобразовать в feather
    f_Load_Data(dir_cb+'\\Setup','exam',False,True,Not_Bi_Data)#Открыть в CSV и преобразовать в feather

# Создаем пустой DataFrame для хранения статистики
#columns = ['random_state', 'cluster',
#           'Precision_train','Recall_train',#Итерация 0
#           'Precision_test','Recall_test',  #Итерация 1
#           'Precision_exam','Recall_exam']  #Итерация 2
columns = ['random_state', 'cluster',
           'Precision_train','Precision_test','Precision_exam',
           'Recall_train','Recall_test', 'Recall_exam',
           'Balans_train','Balans_test', 'Balans_exam'
           ]  
table_stat = pd.DataFrame(columns=columns)
#Svod_Statistics

arr_Name_Pred=[]

arr_Set_Viborka=['train','test','exam']
n_Set = len(arr_Set_Viborka)
for Set in range(n_Set):   
    if Use_Load_CSV == True:
        arr_data_load,arr_Target,arr_Info=f_Load_Data(dir_cb+'\\Setup',arr_Set_Viborka[Set],True,False,Not_Bi_Data)#Открыть файл в формате feather
        arr_Name_Pred=arr_data_load.columns.tolist()
        if Not_Bi_Data == True: #Если выборка не бинарная, то проведём нормализацию
            arr_data_load,scaler=f_Normalization(arr_data_load, Variant_Norm, Set, scaler)
            if Variant_Norm==0:
                arr_data_load = arr_data_load.values # Преобразуем полученные данные из Pandas в массив NumPy

        else:
            arr_data_load = arr_data_load.values # Преобразуем полученные данные из Pandas в массив NumPy

        #if Set == 0:#Нормализая - для не бинарной выборки
        #    scaler.fit(arr_data_load)
        #arr_data_load = scaler.transform(arr_data_load)

        #!!!arr_data_load = arr_data_load.values # Преобразуем полученные данные из Pandas в массив NumPy
        Load_Strok, Load_Stolb = arr_data_load.shape
    if Set == 0:
        # Вызовим функцию для кластеризации
        arr_filter=clusterize_and_filter(arr_data_load, arr_Tree, Load_Stolb, Load_Strok, n_level)
        arr_filter = arr_filter.reshape(-1, n_level)#Преобразуем в двухмерный массив

        # Сохранение результата кластеризации на каждом уровне дерева в файл CSV
        csv_data = pd.DataFrame(arr_filter)
        csv_data.to_csv(dir_cb + "\\Rez_Tree_K-Means\\f_Klaster_" + "train" + ".csv", index=False, header=False,sep=';', decimal='.')

        # Сохранение всех центройдов в файл CSV
        csv_data = pd.DataFrame(arr_centroid_arhiv)
        csv_data.to_csv(dir_cb + "\\Rez_Tree_K-Means\\f_Centroid_Arhiv.csv", index=False, header=False,sep=';', decimal='.')

        if Use_Load_CSV == False:#Только для тестов без чтения выборки!
            arr_data_load = arr_data_load.reshape(Load_Stolb, -1)# Преобразуем одномерный массив в двумерный массив, обратите внимание на то, что мы используем метод reshape()
            arr_data_load = pd.DataFrame(arr_data_load)# Преобразуем массив в DataFrame

    # Применяем модель на входных данных
    arr_Otvet = np.full(Load_Strok, -1)   
    for s in range(Load_Strok):
        arr_Get_Data = arr_data_load[s]#Копируем всю строку двухмерного массива
        Get_Klaster_N = f_Klaster_Load(arr_Tree, arr_Get_Data, Load_Stolb,arr_centroid_arhiv)
        arr_Otvet[s]=Get_Klaster_N

    # Сохранение результата применения модели в файл CSV
    csv_data = pd.DataFrame(arr_Otvet)
    csv_data.to_csv(dir_cb + "\\Rez_Tree_K-Means\\f_Klaster_Load_" + arr_Set_Viborka[Set] + ".csv", index=False, header=False,sep=';', decimal='.')

    # Добавление сохраненной информации к таблицам
    #arr_Info = pd.concat([arr_Info, csv_data], axis=1)
    # Находим уникальные значения и их количество
    #unique_values = np.unique(arr_Otvet)
    #num_unique_values = len(unique_values)
        
    # Подсчитываем статистику откликов и смещения процента целевой "1"
    df_Statistics=f_Statistics(arr_Otvet,arr_Target)
    #print("df_Statistics=",df_Statistics)
    # Сохранение результата в файл CSV
    csv_data = pd.DataFrame(df_Statistics)
    csv_data.to_csv(dir_cb + "\\Rez_Tree_K-Means\\f_Klaster_Statistics_" + arr_Set_Viborka[Set] + ".csv", index=False, header=False,sep=';', decimal='.')
    
    if Set == 0:
        # Получаем все уникальные значения из столбца 'Номер кластера'
        unique_clusters = df_Statistics['Номер кластера'].unique()
        # Создаем новый DataFrame
        new_data = pd.DataFrame({
            'random_state': [iteration]* len(unique_clusters),  # Заполняем первый столбец константой
            'cluster': unique_clusters  # Заполняем второй столбец уникальными значениями
        })
        table_stat = pd.concat([table_stat, new_data], ignore_index=True)# Используем pd.concat() для объединения DataFrame
    # Добавляем статистику для данных train в DataFrame
    matching_rows = table_stat[table_stat['random_state'] == iteration].index#Находим строки, где значение в столбце "random_state" равно переменной iteration        
    for index in matching_rows:# Для каждой найденной строки в table_stat
        second_column_value = table_stat.loc[index, 'cluster']  
        matching_row = df_Statistics[df_Statistics['Номер кластера'] == second_column_value]# Находим строку во второй таблице, где значение в столбце "Номер кластера" равно значению из столбца "cluster" в первой таблице
        if not matching_row.empty: # Если соответствующая строка найдена, копируем значения из второй таблицы в соответствующие столбцы первой таблицы
            table_stat.loc[index, 'Precision_'+arr_Set_Viborka[Set]] = matching_row.iloc[0]['Процент значений 1']
            table_stat.loc[index, 'Recall_'+arr_Set_Viborka[Set]] = matching_row.iloc[0]['Процент строк в кластере']                
    # Добавляем значение баланса для каждего кластера в DataFrame
    max_value = np.max(arr_Otvet)
    arr_Klaster_Balans = np.full(max_value+1, 0.0)
    print("arr_Info=",arr_Info)
    for i in range(arr_Info.shape[0]):
        Target_P=arr_Info['Target_P'][i]
        if Target_P>0:
            arr_Klaster_Balans[arr_Otvet[i]]+=arr_Info['Target_100_Buy'][i]
        else:
            arr_Klaster_Balans[arr_Otvet[i]]+=arr_Info['Target_100_Sell'][i]                        
    print('arr_Klaster_Balans=',arr_Klaster_Balans)
    for index in table_stat.index:#Перебор по строкам таблицы table_stat
        cluster_value = table_stat.loc[index, 'cluster']
        if cluster_value < len(arr_Klaster_Balans):#Если значение кластера меньше размера массива, то выполняем код
            table_stat.loc[index, 'Balans_'+arr_Set_Viborka[Set]] = arr_Klaster_Balans[cluster_value]
        else:
            # Обработка случаев, когда индекс выходит за границы массива
            pass  # Здесь можно добавить какую-то логику или значения по умолчанию для таких случаев

    if Set == 2:
        # Применяем lambda-функцию для преобразования столбцов в тип "float"
        table_stat = table_stat.apply(lambda col: pd.to_numeric(col, errors='coerce'))# Выводим информацию о типах данных
        # Сохранение результата в файл CSV
        csv_data = pd.DataFrame(table_stat)
        csv_data.to_csv(dir_cb + "\\Rez_Tree_K-Means\\f_Klaster_Svod_Statistics.csv", index=False, header=True,sep=';', decimal=',')

    if Use_Creat_New_Viborka==True:
        #Создадим отфильтрованную выборку
        arr_Data_New=f_Creat_Viborka(arr_data_load, arr_Name_Pred, arr_Info,table_stat,arr_Otvet,Load_Strok, Load_Stolb,Variant_Filter_Klaster)
        # Применяем lambda-функцию для преобразования столбцов в тип "float"
        #arr_Data_New = arr_Data_New.apply(lambda col: pd.to_numeric(col, errors='coerce'))# Выводим информацию о типах данных
        #arr_Data_New = arr_Data_New.apply(lambda col: pd.to_numeric(col, errors='coerce') if col.name != 'Time' else col)
        # Сохранение результата в файл CSV
        csv_data = pd.DataFrame(arr_Data_New)
        csv_data.to_csv(dir_cb + "\\Rez_Tree_K-Means\\"+arr_viborka[Set], index=False, header=True,sep=';', decimal='.')


    del csv_data
    if Use_Load_CSV == True:
        del arr_data_load
        del arr_Target
Maxim Dmitrievsky #:
The more subset_size

What is this parameter responsible for?

Maxim Dmitrievsky #:
Parameters can be selected.

So when you search, you are sure to find a good result - you should always find a good result with any parameters :).

Maxim Dmitrievsky #:
Then retrain, taking into account new data.

How to understand what doesn't work anymore is an eternal question.

Maxim Dmitrievsky #:
You don't need a lot, 10 is enough.

I was planning to make a variant of selecting random attributes from my assortment - until I did it, I didn't touch python at all for almost a month.

 
Aleksey Nikolayev #:

I had a task to find not the average, but the average-maximum spread - the maximum spread within a minute was taken, and then they were averaged for each minute of the day.

In the highlighted place I substituted the time-weighted average spread of the minute. Spikes at 15:15, 15:30 and 17:00.

 
Aleksey Vyazmikin #:

I think I posted at the time that I have something similar.....

My code fixes labels, your code does what?

Now we'll start going round in circles again, because we're looking for analogies where there are none :)
 
Aleksey Vyazmikin #:

How to realise what's no longer working is an age-old question.

It is very simple to understand - it stops working :)

The idea of the approach has already been described in the links. Dataset is grouped into similar clusters (patterns, if you like) by attributes. And labels for a given (subset_size) number of clusters are fixed, i.e. all labels become 1 or 0, depending on what is more in the cluster. This removes ambiguity for the final model, it stops overtraining for noise and making unnecessary splits.

In the sorted list of clusters, by "probability bias", the most biased clusters are at the very beginning. These are corrected first to become fully unambiguous for subsequent model training. The others, which are in the tail and whose probability is close to 0.5, are not touched in any way and continue to introduce noise into the model.

By varying the number of clusters and subset_size, we find a balance between good clusters and bad clusters that satisfies the user.

The work of the function is transparent in the sense that it gives a proven expected result: the more clusters are corrected, the more stable the model is, but less "beautiful" and vice versa. Therefore, an additional setting is added to adjust it.

As a result, such a small f-ya does almost all the work of searching for stable patterns in the data and improving the model. If there are no stable patterns, there will always be a predictable (bad) result even on the traine, whereas without this function the model would have retrained and shown the grail on the traine.

Statistics starts to work, depending on the amount of data. The more trains/tests, the more confidence. Without this function this criterion is meaningless, because trees grow with increasing sample length (always retrained).
 
Maxim Dmitrievsky #:

It's very simple to understand - it stops working :)

The idea of the approach has already been described in the links. Dataset is grouped into similar clusters (patterns, if you like) by attributes. And labels for a given (subset_size) number of clusters are fixed, i.e. all labels become 1 or 0, depending on what is more in the cluster. This removes ambiguity for the model, it stops overtraining for noise and making unnecessary splits.

In the sorted list of clusters, by "probability bias", the most biased clusters are at the very beginning. These are corrected first to become fully unambiguous for subsequent model training. The others, which are in the tail and whose probability is close to 0.5, are not touched in any way and continue to introduce noise into the model.

By varying the number of clusters and subset_size, we find a balance between good clusters and bad clusters that satisfies the user.

In the end, such a small f-ya does almost all the work of finding stable patterns in the data and improving the model.

 
Maxim Dmitrievsky #:

My code fixes the labels, your code does what?

Your code removes rows from the sample that belong to clusters that have a mean of (as you write 0.5) units.

My code does the same thing, in brief:

1. Open the sample train

2. We do clustering according to the tree principle, i.e. sequentially going deeper into each cluster.

3. Evaluate the metrics of each cluster on all samples.

4. Select clusters from the train sample where there is a bias in the probability of meeting the target "1" greater than 5%.

5. We form a new sample from the selected clusters.

But, I haven't tried to use so many clusters....

There is no special stability after that, there is improvement if well partitioned by random clustering, but without peeking into new data it is not guaranteed.

 
Maxim Dmitrievsky #:
And labels for a given (subset_size) number of clusters are fixed, i.e. all labels become 1 or 0, depending on what is more in the cluster. This removes ambiguity for the final model, it stops overtraining for noise and making unnecessary splits.

That's something I haven't done. But it's essentially binarisation. Again, if the probabilities are preserved on new data - the effect will be there, but if not, it's fucked.

I get a similar effect through quantisation.