Clamping in clusters can be a good thing, but how does it affect clustering? - General

mytarmailS 2024.03.21 22:06 #34421

Maxim Dmitrievsky #:

Nowadays it is common to add the word "causal" to everything - and it reads beautifully and with a hint of magic :)

Yes, there is such a thing, even in R chat in the cart often this causal is mentioned

Maxim Dmitrievsky 2024.03.23 18:29 #34422

If you're into feature enumeration, here's a list with formulas, for time series:

https://tsfresh.readthedocs.io/en/latest/text/list_of_features.html

Overview on extracted features — tsfresh 0.20.2.post0.dev1+ga7e14f8 documentation

tsfresh.readthedocs.io

tsfresh calculates a comprehensive number of features. All feature calculators are contained in the submodule: The following list contains all the feature calculations supported in the current version of tsfresh : Calculates a linear least-squares regression for values of the time series that were aggregated over chunks versus the sequence from...

Aleksey Vyazmikin 2024.03.23 22:39 #34423

Earlier I published graphs of probability shifts in clusters here, but there the sample was on leaves, and now I decided to see how the situation would look like if I just took the sample, and I used different normalisation methods (the name of the method from the sklearn library is in brackets).

1. without normalisation

2. Normalises the feature values to a range from 0 to 1 (MinMaxScaler). 3.

3. converts feature values to a distribution with mean 0 and standard deviation 1 (StandardScaler)

4. converts feature values to a range that is robust to the presence of outliers (RobustScaler).

I found it curious how normalisation affects clustering.

If we filter by two criteria - probability bias from 5% and the number of examples in the cluster from 1% of rows, the variant without normalisation gives nothing at all, while the others are orders of magnitude higher:

MinMaxScaler - total percentage of sample lines train 4%

StandardScaler - total percentage of sample rows train 5.6%

RobustScaler - total percentage of sample lines train8,83% .

Yes, according to my criteria of string selection it is not enough sample for further training, except to try selection after clustering with RobustScaler normalisation method.

Here is what ChatGPT reports:

"

RobustScaler is a data normalisation method that uses the median and interquartile range to scale the data. This method is more robust to the presence of outliers in the data than the standard MinMaxScaler or StandardScaler.

Here's how RobustScaler works:

Calculating the median and interquartile range: Unlike MinMaxScaler or StandardScaler, which use the mean and standard deviation respectively, RobustScaler uses the median and interquartile range (IQR). The median is the value that divides the data distribution in half, and the IQR is the difference between the 75% quartile and 25% quartile values.
Data normalisation: Each trait value is then subtracted from the median and divided by the IQR. This scales the data so that it has a median of 0 and a spread of 1.

Benefits of RobustScaler :

Outlier robustness: The use of median and interquartile range makes RobustScaler more robust to outliers in the data. This allows it to better preserve the structure of the data in the presence of outliers.
Does not require assumptions about the distribution of the data: Because RobustScaler uses median and IQR, it does not require assumptions about the normal distribution of the data.

"

Programming tutorials Practical advice, please. Distribution of price increments

Maxim Dmitrievsky 2024.03.23 23:05 #34424

Aleksey Vyazmikin #:

Earlier I published graphs of probability shifts in clusters here, but there the sample was on leaves, and now I decided to see how the situation would look like if I just took a sample, and I used different normalisation methods (the name of the method from the sklearn library is in brackets).

It looks like normalisation and scaling is done on the whole sample, and then the model is trained on the subsamples. You get peeking and improving the results.

Aleksey Vyazmikin 2024.03.23 23:33 #34425

Maxim Dmitrievsky #:

It looks like normalisation and scaling is done on the whole sample and then training the model on the subsamples. The result is peeking and improving the results.

I don't think it should

n_Set = len(arr_Set_Viborka)
for Set in range(n_Set):   
    if Use_Load_CSV == True:
        arr_data_load,arr_Target,arr_Info=f_Load_Data(dir_cb+'\\Setup',arr_Set_Viborka[Set],True,False)#Открыть файл в формате feather

        if Set == 0:#Нормализая - для не бинарной выборки
            scaler.fit(arr_data_load)
        arr_data_load = scaler.transform(arr_data_load)

Maxim Dmitrievsky 2024.03.24 00:03 #34426

Aleksey Vyazmikin #:

I don't think it's supposed to

Well, agg data load contains the whole history? So clustering is done on the whole history.

You're as perceptive as a vanga.

Aleksey Vyazmikin 2024.03.24 00:07 #34427

Maxim Dmitrievsky #:

Well, does the agg data load contain the whole history? So clustering is done on the whole history.

You don't fool me, you're as perceptive as a vanga.

# Функция для загрузки выборки - можно выбрать тип файлов CSV или feather
def f_Load_Data(Work_Dir,Name_File,Use_feather,Save_feather):
    # Булева переменная, которая определяет способ загрузки данных
    #Use_feather = True  # Измените на False, если хотите использовать CSV

    if Use_feather:
        feather_file_Load = Work_Dir+'\\'+Name_File+'.feather'
        # Чтение данных из файла Feather
        Load_data=feather.read_feather(feather_file_Load)
        # Вывод информации о типах данных для каждого столбца
        print(Load_data.dtypes)
        # Разделение на предикторы и целевой столбец
        x_Load = Load_data.drop(columns=['Time','Target_P','Target_100','Target_100_Buy','Target_100_Sell'])
        y_Load = Load_data['Target_100']
        Target_Info_Load=Load_data[['Time', 'Target_P', 'Target_100', 'Target_100_Buy', 'Target_100_Sell']].copy()
        del Load_data
    else:
        # Чтение данных из файла CSV
        csv_file_Load = Work_Dir+'\\'+Name_File+'.csv'
        with open(csv_file_Load, 'r', newline='', encoding='utf-8') as file:
            # Создаем объект чтения CSV файла
            csv_reader = csv.reader(file, delimiter=';')
            # Читаем только первую строку (заголовки)
            headers = next(csv_reader)
            # Находим индекс столбца "Время"
            index_of_time_column = headers.index("Time")
            # Подсчитываем количество столбцов до столбца "Time"
            num_columns_before_time = index_of_time_column
        # Теперь переменная num_columns_before_time содержит количество столбцов до столбца "Время"
        print(f"Количество столбцов до 'Время': {num_columns_before_time}")
        # Удалим ссылку/закроем файл
        del file

        # Определение типов данных для бинарных столбцов
        LAST_BINARY_COLUMN_INDEX = num_columns_before_time
        dtype_dict = {i: 'int8' for i in range(LAST_BINARY_COLUMN_INDEX)}#Указывается значение от 0 до числа, но не включительно!
        dtype_dict.update({'Time': 'object', 'Target_P': 'int8', 'Target_100': 'int8', 'Target_100_Buy': 'float64', 'Target_100_Sell': 'float64'})

        # Загрузка данных из файла CSV с указанием типов данных и срезом столбцов
        #Load_data = pd.read_csv(csv_file_Load, sep=';',decimal='.', dtype=dtype_dict, usecols=range(LAST_BINARY_COLUMN_INDEX + 5))
        Load_data = pd.read_csv(csv_file_Load, sep=';',decimal='.')
        # Преобразование столбца "Time" в тип данных datetime64
        Load_data['Time'] = pd.to_datetime(Load_data['Time'], errors='coerce')
        # Вывод информации о типах данных для каждого столбца
        print(Load_data.dtypes)

        # Сохраняем в формате Feather
        if Save_feather:
            feather.write_feather(Load_data, Work_Dir+'\\'+Name_File+'.feather')

        # Разделение на предикторы и целевой столбец
        x_Load = Load_data.drop(columns=['Time','Target_P','Target_100','Target_100_Buy','Target_100_Sell'])
        y_Load = Load_data['Target_100']
        Target_Info_Load=Load_data[['Time', 'Target_P', 'Target_100', 'Target_100_Buy', 'Target_100_Sell']].copy()
        print('Target_Info_Load=',Target_Info_Load)
        del Load_data
    return x_Load,y_Load,Target_Info_Load

No, the data is from files that have been subsampled before.

Maxim Dmitrievsky 2024.03.24 00:12 #34428

Aleksey Vyazmikin #:

No, the data is from files split into subsamples earlier.

Where is the aplai on another subsample (test, exam) of the same scaler?

Aleksey Vyazmikin 2024.03.24 00:21 #34429

Maxim Dmitrievsky #:

Where then is the aplai on the other subsample (test, exam) of the same scaler

It's kind of the same thing.

arr_data_load = scaler.transform(arr_data_load)

On the first sample we count, and apply on all of them in a loop.

Maxim Dmitrievsky 2024.03.24 00:23 #34430

Maxim Dmitrievsky #:

Where then is the aplai on the other subsample (test, exam) of the same scaler

like here.

n_Set = len(arr_Set_Viborka)
for Set in range(n_Set):   
    if Use_Load_CSV == True:
        arr_data_load,arr_Target,arr_Info=f_Load_Data(dir_cb+'\\Setup',arr_Set_Viborka[Set],True,False)#Открыть файл в формате feather

        if Set == 0:#Нормализая - для не бинарной выборки
            scaler.fit(arr_data_load)
        arr_data_load = scaler.transform(arr_data_load)

Ah, well, let's say it's okay.

Then I don't understand why there's such a difference in the results.

Machine learning in trading: theory, models, practice and algo-trading - page 3443