Machine learning in trading: theory, models, practice and algo-trading - page 2877

 

If you want to hash a list of different length to a list of fixed length, you can use a hash function that maps the list to a fixed-size list. One way to do this is to use a technique called "feature hashing," which allows you to represent a variable-length list as a fixed-size list by applying a hash function to the elements of the list and using the hash values as indices in the fixed-size list.

Here is an example of feature hashing in Python:

def feature_hash(lst: List[int], n: int) -> List[int]:
  h = [0] * n
  for x in lst:
    h[hash(x) % n] += 1
  return h

This function takes a list lst and a desired list length n as input, and returns a fixed-size list by applying a hash function to the elements of lst and using the hash values as indices in the fixed-size list. The values at each index are then incremented to count the number of occurrences of each element in the original list.

For example, if we call feature_hash([1, 2, 3], 5) , the function will return [0, 1, 1, 1, 1, 0] .

You can use a similar approach in R as well. Here is the code for a feature hashing function in R:

feature_hash <- function(lst, n) {
  h <- rep(0, n)
  for (x in lst) {
    h[as.integer(x) %% n] <- h[as.integer(x) %% n] + 1
  }
  h
}
 
Aleksey Nikolayev #:

I think it promised to map a vector of arbitrary size into a vector of fixed specified size, but here it maps a number into a vector?

Everything is complicated and incomprehensible) Life was much simpler without GPT) We will have a hard time in the age of AI).

Alexei, stop listening to the profane.

If you need different numbers of features per input.
This is frequent pattern mining - search for associative rules in different variations, including with a teacher.


==============
Why did I say that there are NOT algorithms that accept different numbers of features?
Because even in such algorithms under the bonnet the data is in the form of a sparse matrix in one hot encoding format.
The input is still all the names of the features, albeit in a more gentle format.
=======================
Also the number of attributes was not specified
If the number of features is relatively small 10-20k, then you can load it into xgboost even on a weak laptop without any problems
 
Maxim Dmitrievsky #:

vector into a fixed vector after all?

1) If we want to use any ordinary MO algorithm (with a fixed number of features), then yes - we take a set of vectors of different lengths and convert them into a set of vectors of a given length.

2) Immediately look for an unusual MO algorithm that can take any length vector as input.

The first option turns into the second option if the vector transformation is sought by means of MO.

 
Aleksey Nikolayev #:

1) If we want to use any conventional MO algorithm (with a fixed number of features), then yes - we take a set of vectors of different lengths and convert them into a set of vectors of a given length.

2) Immediately look for an unusual MO algorithm that can take as input a vector of any length.

The first option turns into the second option if the vector transformation is sought by means of MO.

So it's clear, we can try it via hash f-iys. But I'm shy to ask, on what principle to choose vectors of arbitrary length, or rather the length for them :) and from what data.

there can be a lot of variants

the first option is more pleasant, but you need to put something in the logic

 

- I want to hash variable size vector of price returns to fixed vector


If you want to hash a variable size vector of stock price returns to a fixed size vector, you can use a technique called min-hash. Min-hash is a method for approximating the similarity between two sets by calculating the number of elements they have in common. It is often used in the context of text mining, but it can also be applied to other types of data, such as stock price returns.

Here is an example of how you might use min-hash to hash a variable size vector of stock price returns to a fixed size vector:

  1. Define the size of the fixed size vector: First, you would need to define the size of the fixed size vector that you want to use. This will depend on the specific requirements of your application and the amount of data you have available.

  2. Create a hash function: Next, you would need to create a hash function that can be used to hash the stock price returns data to the fixed size vector. There are many different approaches you can take to create a hash function, such as using a random permutation or a linear congruential generator.

  3. Hash the data: Once you have defined the size of the fixed size vector and created a hash function, you can use these to hash the variable size vector of stock price returns data to a fixed size vector. The resulting fixed size vector will represent the original data in a compressed form, with the size of the vector determined by the size you defined.

It's important to note that min-hash can result in loss of information, as it is an approximation of the similarity between two sets rather than an exact representation. The accuracy of the approximation will depend on the size of the fixed size vector and the specific hash function used.


- How to minimise loss of information


There are several ways you can minimize the loss of information when using min-hash to hash a variable size vector of stock price returns to a fixed size vector:

  1. Use a larger fixed size vector: One way to minimize loss of information is to use a larger fixed size vector. A larger vector will have more elements and can therefore capture more information about the original data. However, this will also increase the size of the vector and may not be practical if you have a large dataset or limited storage space.

  2. Use a better hash function: Another way to minimise loss of information is to use a better hash function. Different hash functions have different properties and can result in different levels of information loss. For example, a hash function with a high degree of randomness will generally result in less information loss than a hash function with a low degree of randomness.

  3. Use multiple hash functions: You can also minimise loss of information by using multiple hash functions and combining the results. This can help capture a more complete representation of the original data, as each hash function will capture different aspects of the data.

It's important to note that it may not be possible to completely eliminate loss of information when using min-hash, as it is an approximation of the similarity between two sets rather than an exact representation. The accuracy of the approximation will depend on the size of the fixed size vector and the specific hash function used.


Here is an example of how you might use min-hash in Python to hash a variable size vector of stock price returns to a fixed size vector:

from datasketch import MinHash

# Define the size of the fixed size vector
num_hashes = 100

# Create a MinHash object
mh = MinHash(num_hashes=num_hashes)

# Hash the stock price returns data to the fixed size vector
for price_return in stock_price_returns:
    mh.update(str(price_return).encode('utf8'))

# The resulting fixed size vector is stored in the MinHash object
fixed_size_vector = mh.hashvalues


Here is an example of how you might use min-hash in R to hash a variable size vector of stock price returns to a fixed size vector:


# Install the 'minhash' package if it is not already installed
install.packages("minhash")

# Load the 'minhash' package
library(minhash)

# Define the size of the fixed size vector
num_hashes = 100

# Create a MinHash object
mh = minhash(num_hashes)

# Hash the stock price returns data to the fixed size vector
for (price_return in stock_price_returns) {
    mh = update(mh, price_return)
}

# The resulting fixed size vector is stored in the MinHash object
fixed_size_vector = getValues(mh)


It's important to note that these examples are intended to illustrate the basic process of using min-hash to hash a variable size vector of stock price returns to a fixed size vector. In a real application, you may need to modify the code to suit your specific needs and data.

 

- so min-hash is the best way?

Min-hash is a method for approximating the similarity between two sets by calculating the number of elements they have in common. It can be used to hash a variable size vector of stock price returns to a fixed size vector, but it may not necessarily be the best way to do so, as it can result in loss of information.

There are many different approaches you can take to hash stock price returns data, and the best approach will depend on the specific characteristics of the data and the requirements of your application. Some other approaches you might consider include:

  1. Feature hashing: Feature hashing is a technique used to convert categorical or textual data into a numerical representation that can be used in machine learning algorithms. You could use feature hashing to hash the stock price returns data to a fixed size vector.

  2. Locality sensitive hashing (LSH): LSH is a technique used to approximately identify similar items in a dataset. You could use LSH to hash the stock price returns data to a fixed size vector and then identify similar items based on their hash values.

  3. Other techniques: There are many other techniques you could use to hash stock price returns data, such as using a random permutation or a linear congruential generator. The specific technique you choose will depend on the characteristics of the data and the requirements of your application.

It's important to carefully consider the trade-offs between accuracy and efficiency when choosing a hashing technique. Some techniques may be more accurate but less efficient, while others may be more efficient but less accurate. You will need to find the right balance that meets the needs of your application.

 
mytarmailS #:
Alexei, stop listening to the profane.

If you need a different number of signs per input.
This is frequent pattern mining - search for associative rules in different variations, including with a teacher.


==============
Why did I say there are NOT algorithms that take different numbers of features?
Because even in such algorithms under the bonnet the data is a sparse matrix in one hot encoding format.
The input is still all the names of the features, albeit in a more gentle format.
=======================
Also, the number of features was not specified
If the number of signs is relatively small 10-20k, you can load it in xgboost even on a weak laptop without problems.

We have already discussed associative rules with you. For me they don't fit well with my general approach of searching for differences between price and SB. The problem is that the SB is quite good at making it look like there are rules - the only problem is that they will be different at different sites.

 
Aleksey Nikolayev #:

Have already discussed associative rules with you. For me, they don't fit well with my general approach of looking for differences between price and SB. The problem is that SB is quite good at making rules appear to exist - the only problem is that they will be different at different sites.

Boost then, but to prepare the data as you want (without structure) you need to understand how to do it properly, for this you should study how to prepare data for associative rules.
 

Good work, even took something interesting for myself in the context of changing the length of the window.

If you have any more questions, please sketch them out, I will ask you after the New Year.

 
Maxim Dmitrievsky #:

It's clear, we can try using hash functions. But I'm shy to ask, on what principle to choose vectors of arbitrary length, or rather the length for them:) and from what data.

there can be a lot of variants

the first option is more pleasant, but you need to put something in the logic

This is a very important question, I'm always thinking about it) Let's just talk about the length of the history used. You need a reasonable compromise between relevance and length for calculations. The shorter, the more relevant, but the longer, the more accurate the calculations. Sometimes a good compromise is unattainable in principle.

Reason: