Generic Class Library - bugs, description, questions, usage and suggestions - page 10
You are missing trading opportunities:
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
Registration
Log in
You agree to website policy and terms of use
If you do not have an account, please register
About hash functions
From the previous examples, it is clear that the hash function takes the brunt of the load. A hash function can be made as fast as possible, but it will most likely be very collision prone. And vice versa, a hash function may be made as stable to collisions as possible, but its computational complexity will be inadequate to the task being performed. Let's consider two extreme examples. Example 1, the fastest possible hash function:
Will always return the same number. If our dictionary stores only one word, it will be enough, because the word will be at index 0. If, however, we have 10 words, then all ten of them will be at index 0 (don't forget we have an array of arrays).
For the second example, we turn to a reversible encryption, e.g. based on a Feistel network with number of rounds N. This is not a hash function, and it is not subject to collisions at all. I.e. for two different strings we will get guaranteed different numbers. The length of obtained numbers will be equal to the string length. Consequently, if string is 1000 characters long, we'll get uchar array of size 1000. If we need to store this string in a dictionary of size 3 elements, then you must admit that it's strange to calculate such a complex code, to find a word with only three elements.
Consequently we should always look for the golden mean (as rightly pointed out by fxsaber): we need a function which quickly calculates hashes and usually doesn't cause collisions. Let's estimate the probability of collisions for our previous hash function:
There are 26 characters in the alphabet. Let's assume that the number of words beginning with symbol 'a' is equal to the number of words beginning with any other symbol. Then if our dictionary contains 988 words, 38 of them will begin with a, 38 with b, 38 with c, ... 38 to the letter w and 38 to the letter - z. The probability of a collision occurring in each case would be 1/26 or 3.8461%. If we have 988 words, then we obtain 988*3.8461 = 37.999 words per index.
It is clear that the more words for one and the same letter, the harder it is to find a particular word. In our case we not only have to calculate the index of the first letter, but also try all the words, which also start with the same letter. To avoid this we can complicate our hash function:
We take the first two characters of the word. Then there will be 26^2 = 676 possible combinations. Note that the complexity of the second variant of hash function calculation has increased linearly, roughly speaking twice, while its resistance to collisions has increased non-linearly, squared.
For the second variant, the average will be one collision for every 676 words. In our case, for 988 words, there would be 988/676 = 1.4615 collisions per index. That is, each index will contain on average either one word or two words. I.e. on average there won't be any search at all (one comparison), or it will be very short (two comparisons).
If the ratio of dictionary size to hash function's cobmins is close to 1.0000000, then on average no brute force will be needed, because each element will be located at its own index all by itself. If a hash function will provide a very large range of values, the ratio of dictionary size to possible combinations of the hash function will always be even lower than 1.0. For example, if a hash function produces 2^32 combinations, then for any size N less than 2^32 the ratio N/2^32 < 1.0 will be true. If N is very small, then we simply normalize the number obtained by the hash function, so that it would always be in the limit of N:
int index = GetHashCode(word)%ArraySize(m_array);
If the hash function outputs results uniformly, then on average, we will get a ratio of 1.000. i.e. there will be only one word in one index of the array. Thus we can say that dictionary is a special case of array, in which key is specified instead of index.
The size of the array can easily be changed to match the size of the dictionary.
I don't take infinite variants into account.
The point is that the size of the dictionary is often unknown. A simple example, let's say we have an Expert Advisor that trades. It monitors the performed deals. When a trade appears in the history, we need to connect this trade with the Expert Advisor's Medjik, for example. For this it is logical to use the dictionary. Where the deal number is used as a key (unique identifier), and the magic number of the Expert Advisor is used as a value. The problem is that at the start of the EA we cannot determine in advance if we will have 100 trades, 1000 or none. No matter how much memory you allocate beforehand, it will either be too little or too much.
That's the thing: the size of the dictionary is often unknown. A simple example, let's say we have an Expert Advisor that trades. It tracks performed trades. When a trade appears in the history, we need to associate this trade with the EA's "Medjic". For this it is logical to use the dictionary. Where the deal number is used as a key (unique identifier), and the magic number of the Expert Advisor is used as a value. The problem is that at the start of the EA we cannot determine in advance if we will have 100 trades, 1000 or none. No matter how much memory you allocate beforehand, it will still be either too little or too much.
Well, clearly, there are different tasks.
I was solving a task, in which I needed to create a handy dictionary for a single human language. The size of my array is not accurate, but what's the problem with adjusting it to the number of words of a particular language? It can easily fit in there.
For example the Russian language. Suppose it contains 10,000 basic words. All of these words can fit nicely into my array.
The first dimension - 66 letters (small and large), the second dimension - the number of letters in the word (up to 25 for example), the third dimension - the coincidence of the first letter and the number of letters - 50.
The whole language will fit.
//----------------------------
Initially, I was solving exactly this problem. You are now replacing the original problem with another one and saying that the solution is no good.
No matter how much memory you allocate beforehand, it will still be either too little or too much.
Right.
It's also true that there is no universal solution for all tasks.
True.
Just as it is true that there is no one-size-fits-all solution to every problem.
Completely indisputable.
Keep your voice down, comrade. No one is immune from mistakes, not even you. So be kind enough to point out the mistakes of others in a softer tone. I'll correct them now.
And the right option will remain a secret???
There can't be a correct variant of hash tables and hashes in principle - it all depends on the specific purpose and application conditions.
It's like making a sandwich. Maybe someone doesn't eat mayonnaise or ketchup, or he's a vegan....
But putting ketchup on the table and laying bread on it - it's kind of clear that's not the case here....
Basics on the subject for the lazy:
https://www.slideshare.net/mkurnosov/6-32402313
In reality, everything is much more complicated and is discussed in the relevant literature or in good "algorithms and data structures" courses.
In short, hash function should work firstly: quickly, secondly - there is no need to invent something very complicated, because the appearance of the same values is normally resolved within the dictionary by direct brute force. The main thing is not to have too frequent collisions, that's all. In Generic, the set of functions in the file HashFunction.mqh is responsible for the hash of functions of base types. To make sure everything is simple there, let's see how it is organized:
Integer types are either returned as is, or for short integer types not elaborate bitwise conversion operations are used. For types that don't fit into 32 digits, an exclusive or followed by a left-right join is used. For strings, also not complicated hash is calculated through direct brute force. For real numbers the bit representation is taken with union and it is used as a hash.
Dictionary-type algorithms are conventionally divided into two parts:
What is a CHashSet? It is a collection (an array in its essence) where each element is unique. For example, such a collection can contain numbers 2,5,1,7,10,15,1024. But no two numbers can be the same. It can also contain strings, real numbers and even user-defined complex types (more about that later). But for any type the rule is the same: there can not be another of the same type in the dictionary.
What is a CHashMap? It's a collection (also an array in its essence), where each element is a complex type key-value:
It's arranged almost the same way as CHashMap, but allows you to match two sets such that an element of set 1 matches an element of set 2. This is very handy stuff, allowing you to write very efficient query algorithms with an average execution time of O(1).
Later on, we'll use examples to break down the construction of some kind of correspondence.
Interestingly, any key-value dictionary naturally establishes such a non-trivial relation asone to many. For example, we have a bunch of historical orders, and a bunch of trades that were executed based on them. Each trade corresponds to only one order, but one order can correspond to multiple trades. The dictionary of such a relation can store as a key the number of a deal, and as a value, the number of the order that served to open it.