Машинное обучение в трейдинге: теория, модели, практика и алготорговля - страница 2876

 

and yet how can I build a similar hash function using a machine learning algorithm


*

it is not common to build a hash function using a machine learning algorithm, as the primary goal of a hash function is to map data to a fixed-size representation in a way that preserves the structure of the data as much as possible. Machine learning algorithms, on the other hand, are typically used to learn patterns or make predictions based on data, rather than to generate fixed-size representations of data.

However, if you still want to explore the possibility of using a machine learning algorithm to build a hash function, one approach you could try is to use an unsupervised learning algorithm to learn a compact representation of the data, and then use that representation as the hash function.

For example, you might use an autoencoder or a clustering algorithm to learn a compact representation of the data, and then use that representation as the hash function. This would involve training the machine learning model on the data, and then using the trained model to generate the hash values for new data.

Overall, while it is not common to build a hash function using a machine learning algorithm, it is possible to use machine learning to learn a compact representation of the data that could be used as a hash function in certain applications.


 
Maxim Dmitrievsky #:

Here is a list of some common hash functions that can be used to map integers or other data types to lists or other data structures:

Вроде обещалось отображение вектора произвольного размера в вектор фиксированного заданного размера, а здесь отображает число в вектор?

Всё сложно и непонятно) Без GPT жизнь была намного проще) Тяжело нам придётся в век ИИ)

 
Aleksey Nikolayev #:

Вроде обещалось отображение вектора произвольного размера в вектор фиксированного заданного размера, а здесь отображает число в вектор?

Всё сложно и непонятно) Без GPT жизнь была намного проще) Тяжело нам придётся в век ИИ)

там дальше дописал про вектор в число

вектор в фиксированный вектор надо все-таки?

 

If you want to hash a list of different length to a list of fixed length, you can use a hash function that maps the list to a fixed-size list. One way to do this is to use a technique called "feature hashing," which allows you to represent a variable-length list as a fixed-size list by applying a hash function to the elements of the list and using the hash values as indices in the fixed-size list.

Here is an example of feature hashing in Python:

def feature_hash(lst: List[int], n: int) -> List[int]:
  h = [0] * n
  for x in lst:
    h[hash(x) % n] += 1
  return h

This function takes a list lst and a desired list length n as input, and returns a fixed-size list by applying a hash function to the elements of lst and using the hash values as indices in the fixed-size list. The values at each index are then incremented to count the number of occurrences of each element in the original list.

For example, if we call feature_hash([1, 2, 3], 5) , the function will return [0, 1, 1, 1, 0] .

You can use a similar approach in R as well. Here is the code for a feature hashing function in R:

feature_hash <- function(lst, n) {
  h <- rep(0, n)
  for (x in lst) {
    h[as.integer(x) %% n] <- h[as.integer(x) %% n] + 1
  }
  h
}
 
Aleksey Nikolayev #:

Вроде обещалось отображение вектора произвольного размера в вектор фиксированного заданного размера, а здесь отображает число в вектор?

Всё сложно и непонятно) Без GPT жизнь была намного проще) Тяжело нам придётся в век ИИ)

Алексей, хватит слушать профанов.. 

Если нужно разное количество признаков на вход то 
Это frequent pattern mining - поиск асоциативных правил в разных вариациях в том числе и с учителем


==============
Почему я сказал что НЕ бывает алгоритмов которые принимают разное колиство признаков? 
Потому что даже в таких алгоритмах под капотом данные в виде разреженой матрицы в формате one hot encoding. 
Те на вход всеравно приходит все наименование признаков, хоть и в более щадящем формате
=======================
Также не было обозначено количество признаков
Если признаков относительно не много 10-20к то можно и в xgboost загрузить даже на слабом ноуте без проблем
 
Maxim Dmitrievsky #:

вектор в фиксированный вектор надо все-таки?

1) Если хотим использовать любой обычный алгоритм МО (с фиксированным числом признаков), то да - берём набор векторов разной длины и преобразуем их в набор векторов заданной длины.

2) Сразу ищем необычный алгоритм МО, который может принимать на вход вектор любой длины.

Первый вариант превращается во второй, если преобразование векторов ищется посредством МО.

 
Aleksey Nikolayev #:

1) Если хотим использовать любой обычный алгоритм МО (с фиксированным числом признаков), то да - берём набор векторов разной длины и преобразуем их в набор векторов заданной длины.

2) Сразу ищем необычный алгоритм МО, который может принимать на вход вектор любой длины.

Первый вариант превращается во второй, если преобразование векторов ищется посредством МО.

так все понятно, можно через хэш ф-ии попробовать. Только стесняюсь спросить, по какому принципу выбирать вектора произвольной длины, вернее длину для них :) и из каких данных

вариантов может быть очень много

первый вариант приятнее, но нужно закладывать что-то в логику 

 

- I want to hash variable size vector of price returns to fixed vector


If you want to hash a variable size vector of stock price returns to a fixed size vector, you can use a technique called min-hash. Min-hash is a method for approximating the similarity between two sets by calculating the number of elements they have in common. It is often used in the context of text mining, but it can also be applied to other types of data, such as stock price returns.

Here is an example of how you might use min-hash to hash a variable size vector of stock price returns to a fixed size vector:

  1. Define the size of the fixed size vector: First, you would need to define the size of the fixed size vector that you want to use. This will depend on the specific requirements of your application and the amount of data you have available.

  2. Create a hash function: Next, you would need to create a hash function that can be used to hash the stock price returns data to the fixed size vector. There are many different approaches you can take to create a hash function, such as using a random permutation or a linear congruential generator.

  3. Hash the data: Once you have defined the size of the fixed size vector and created a hash function, you can use these to hash the variable size vector of stock price returns data to a fixed size vector. The resulting fixed size vector will represent the original data in a compressed form, with the size of the vector determined by the size you defined.

It's important to note that min-hash can result in loss of information, as it is an approximation of the similarity between two sets rather than an exact representation. The accuracy of the approximation will depend on the size of the fixed size vector and the specific hash function used.


How to minimize loss of information


There are several ways you can minimize the loss of information when using min-hash to hash a variable size vector of stock price returns to a fixed size vector:

  1. Use a larger fixed size vector: One way to minimize loss of information is to use a larger fixed size vector. A larger vector will have more elements and can therefore capture more information about the original data. However, this will also increase the size of the vector and may not be practical if you have a large dataset or limited storage space.

  2. Use a better hash function: Another way to minimize loss of information is to use a better hash function. Different hash functions have different properties and can result in different levels of information loss. For example, a hash function with a high degree of randomness will generally result in less information loss than a hash function with a low degree of randomness.

  3. Use multiple hash functions: You can also minimize loss of information by using multiple hash functions and combining the results. This can help capture a more complete representation of the original data, as each hash function will capture different aspects of the data.

It's important to note that it may not be possible to completely eliminate loss of information when using min-hash, as it is an approximation of the similarity between two sets rather than an exact representation. The accuracy of the approximation will depend on the size of the fixed size vector and the specific hash function used.


Here is an example of how you might use min-hash in Python to hash a variable size vector of stock price returns to a fixed size vector:

from datasketch import MinHash

# Define the size of the fixed size vector
num_hashes = 100

# Create a MinHash object
mh = MinHash(num_hashes=num_hashes)

# Hash the stock price returns data to the fixed size vector
for price_return in stock_price_returns:
    mh.update(str(price_return).encode('utf8'))

# The resulting fixed size vector is stored in the MinHash object
fixed_size_vector = mh.hashvalues


Here is an example of how you might use min-hash in R to hash a variable size vector of stock price returns to a fixed size vector:


# Install the 'minhash' package if it is not already installed
install.packages("minhash")

# Load the 'minhash' package
library(minhash)

# Define the size of the fixed size vector
num_hashes = 100

# Create a MinHash object
mh = minhash(num_hashes)

# Hash the stock price returns data to the fixed size vector
for (price_return in stock_price_returns) {
    mh = update(mh, price_return)
}

# The resulting fixed size vector is stored in the MinHash object
fixed_size_vector = getValues(mh)


It's important to note that these examples are intended to illustrate the basic process of using min-hash to hash a variable size vector of stock price returns to a fixed size vector. In a real application, you may need to modify the code to suit your specific needs and data.

 

- so min-hash is the best way?

Min-hash is a method for approximating the similarity between two sets by calculating the number of elements they have in common. It can be used to hash a variable size vector of stock price returns to a fixed size vector, but it may not necessarily be the best way to do so, as it can result in loss of information.

There are many different approaches you can take to hash stock price returns data, and the best approach will depend on the specific characteristics of the data and the requirements of your application. Some other approaches you might consider include:

  1. Feature hashing: Feature hashing is a technique used to convert categorical or textual data into a numerical representation that can be used in machine learning algorithms. You could use feature hashing to hash the stock price returns data to a fixed size vector.

  2. Locality sensitive hashing (LSH): LSH is a technique used to approximately identify similar items in a dataset. You could use LSH to hash the stock price returns data to a fixed size vector and then identify similar items based on their hash values.

  3. Other techniques: There are many other techniques you could use to hash stock price returns data, such as using a random permutation or a linear congruential generator. The specific technique you choose will depend on the characteristics of the data and the requirements of your application.

It's important to carefully consider the trade-offs between accuracy and efficiency when choosing a hashing technique. Some techniques may be more accurate but less efficient, while others may be more efficient but less accurate. You will need to find the right balance that meets the needs of your application.

 
mytarmailS #:
Алексей, хватит слушать профанов.. 

Если нужно разное количество признаков на вход то 
Это frequent pattern mining - поиск асоциативных правил в разных вариациях в том числе и с учителем


==============
Почему я сказал что НЕ бывает алгоритмов которые принимают разное колиство признаков? 
Потому что даже в таких алгоритмах под капотом данные в виде разреженой матрицы в формате one hot encoding. 
Те на вход всеравно приходит все наименование признаков, хоть и в более щадящем формате
=======================
Также не было обозначено количество признаков
Если признаков относительно не много 10-20к то можно и в xgboost загрузить даже на слабом ноуте без проблем

Уже обсуждали с вами ассоциативные правила. Для меня они плохо согласуются с моим общим подходом поиска отличий цены от СБ. Проблема в том, что СБ вполне неплохо создаёт видимость наличия правил - проблема лишь в том что они будут разными на разных участках.