Generic Class Library - bugs, description, questions, usage and suggestions - page 10

 

About hash functions

From the previous examples, it is clear that the hash function takes the brunt of the load. A hash function can be made as fast as possible, but it will most likely be very collision prone. And vice versa, a hash function may be made as stable to collisions as possible, but its computational complexity will be inadequate to the task being performed. Let's consider two extreme examples. Example 1, the fastest possible hash function:

int GetHashCode(string word)
{
   return 0;
}

Will always return the same number. If our dictionary stores only one word, it will be enough, because the word will be at index 0. If, however, we have 10 words, then all ten of them will be at index 0 (don't forget we have an array of arrays).

For the second example, we turn to a reversible encryption, e.g. based on a Feistel network with number of rounds N. This is not a hash function, and it is not subject to collisions at all. I.e. for two different strings we will get guaranteed different numbers. The length of obtained numbers will be equal to the string length. Consequently, if string is 1000 characters long, we'll get uchar array of size 1000. If we need to store this string in a dictionary of size 3 elements, then you must admit that it's strange to calculate such a complex code, to find a word with only three elements.

Consequently we should always look for the golden mean (as rightly pointed out by fxsaber): we need a function which quickly calculates hashes and usually doesn't cause collisions. Let's estimate the probability of collisions for our previous hash function:

int GetHashCode(string word)
{
   int index = StringGetCharacter(word, 0)-'a';
}

There are 26 characters in the alphabet. Let's assume that the number of words beginning with symbol 'a' is equal to the number of words beginning with any other symbol. Then if our dictionary contains 988 words, 38 of them will begin with a, 38 with b, 38 with c, ... 38 to the letter w and 38 to the letter - z. The probability of a collision occurring in each case would be 1/26 or 3.8461%. If we have 988 words, then we obtain 988*3.8461 = 37.999 words per index.

It is clear that the more words for one and the same letter, the harder it is to find a particular word. In our case we not only have to calculate the index of the first letter, but also try all the words, which also start with the same letter. To avoid this we can complicate our hash function:

int GetHashCode(string word)
{
   uchar i1 = (uchar)word[0]-'a';
   uchar i2 = (uchar)word[1]-'a';
   int hash = i1<<(sizeof(uchar)*8);
   hash += i2;
   return hash;
}

We take the first two characters of the word. Then there will be 26^2 = 676 possible combinations. Note that the complexity of the second variant of hash function calculation has increased linearly, roughly speaking twice, while its resistance to collisions has increased non-linearly, squared.

For the second variant, the average will be one collision for every 676 words. In our case, for 988 words, there would be 988/676 = 1.4615 collisions per index. That is, each index will contain on average either one word or two words. I.e. on average there won't be any search at all (one comparison), or it will be very short (two comparisons).

If the ratio of dictionary size to hash function's cobmins is close to 1.0000000, then on average no brute force will be needed, because each element will be located at its own index all by itself. If a hash function will provide a very large range of values, the ratio of dictionary size to possible combinations of the hash function will always be even lower than 1.0. For example, if a hash function produces 2^32 combinations, then for any size N less than 2^32 the ratio N/2^32 < 1.0 will be true. If N is very small, then we simply normalize the number obtained by the hash function, so that it would always be in the limit of N:

int index = GetHashCode(word)%ArraySize(m_array);

If the hash function outputs results uniformly, then on average, we will get a ratio of 1.000. i.e. there will be only one word in one index of the array. Thus we can say that dictionary is a special case of array, in which key is specified instead of index.

 
Tag Konow:

The size of the array can easily be changed to match the size of the dictionary.

I don't take infinite variants into account.

The point is that the size of the dictionary is often unknown. A simple example, let's say we have an Expert Advisor that trades. It monitors the performed deals. When a trade appears in the history, we need to connect this trade with the Expert Advisor's Medjik, for example. For this it is logical to use the dictionary. Where the deal number is used as a key (unique identifier), and the magic number of the Expert Advisor is used as a value. The problem is that at the start of the EA we cannot determine in advance if we will have 100 trades, 1000 or none. No matter how much memory you allocate beforehand, it will either be too little or too much.

 
Vasiliy Sokolov:

That's the thing: the size of the dictionary is often unknown. A simple example, let's say we have an Expert Advisor that trades. It tracks performed trades. When a trade appears in the history, we need to associate this trade with the EA's "Medjic". For this it is logical to use the dictionary. Where the deal number is used as a key (unique identifier), and the magic number of the Expert Advisor is used as a value. The problem is that at the start of the EA we cannot determine in advance if we will have 100 trades, 1000 or none. No matter how much memory you allocate beforehand, it will still be either too little or too much.

Well, clearly, there are different tasks.

I was solving a task, in which I needed to create a handy dictionary for a single human language. The size of my array is not accurate, but what's the problem with adjusting it to the number of words of a particular language? It can easily fit in there.

For example the Russian language. Suppose it contains 10,000 basic words. All of these words can fit nicely into my array.

The first dimension - 66 letters (small and large), the second dimension - the number of letters in the word (up to 25 for example), the third dimension - the coincidence of the first letter and the number of letters - 50.

The whole language will fit.

//----------------------------

Initially, I was solving exactly this problem. You are now replacing the original problem with another one and saying that the solution is no good.

 
Vasiliy Sokolov:

No matter how much memory you allocate beforehand, it will still be either too little or too much.

Right.

It's also true that there is no universal solution for all tasks.

 
Tag Konow:

True.

Just as it is true that there is no one-size-fits-all solution to every problem.

Completely indisputable.

 
Sergey Dzyublik:

Keep your voice down, comrade. No one is immune from mistakes, not even you. So be kind enough to point out the mistakes of others in a softer tone. I'll correct them now.

 
Alexey Viktorov:

And the right option will remain a secret???


There can't be a correct variant of hash tables and hashes in principle - it all depends on the specific purpose and application conditions.
It's like making a sandwich. Maybe someone doesn't eat mayonnaise or ketchup, or he's a vegan....
But putting ketchup on the table and laying bread on it - it's kind of clear that's not the case here....


Basics on the subject for the lazy:
https://www.slideshare.net/mkurnosov/6-32402313

In reality, everything is much more complicated and is discussed in the relevant literature or in good "algorithms and data structures" courses.

Лекция 6: Хеш-таблицы
Лекция 6: Хеш-таблицы
  • 2014.03.17
  • Mikhail Kurnosov, Associate Professor (Docent) Follow
  • www.slideshare.net
Лекция 6 Хеш-таблицы Курносов Михаил Георгиевич к.т.н. доцент Кафедры вычислительных систем Сибирский государственный университет телекоммуникаций и информатик…
 

In short, hash function should work firstly: quickly, secondly - there is no need to invent something very complicated, because the appearance of the same values is normally resolved within the dictionary by direct brute force. The main thing is not to have too frequent collisions, that's all. In Generic, the set of functions in the file HashFunction.mqh is responsible for the hash of functions of base types. To make sure everything is simple there, let's see how it is organized:

//+------------------------------------------------------------------+
//|                                                 HashFunction.mqh |
//|                   Copyright 2016-2017, MetaQuotes Software Corp. |
//|                                             https://www.mql5.com |
//+------------------------------------------------------------------+
//+------------------------------------------------------------------+
//| Unioun BitInterpreter.                                           |
//| Usage: Provides the ability to interpret the same bit sequence in|
//|        different types.                                          |
//+------------------------------------------------------------------+
union  BitInterpreter
  {
   bool              bool_value;
   char              char_value;
   uchar             uchar_value;
   short             short_value;
   ushort            ushort_value;
   color             color_value;
   int               int_value;
   uint              uint_value;
   datetime          datetime_value;
   long              long_value;
   ulong             ulong_value;
   float             float_value;
   double            double_value;
  };
//+------------------------------------------------------------------+
//| Returns a hashcode for boolean.                                  |
//+------------------------------------------------------------------+
int GetHashCode(const bool value)
  {
   return((value)?true:false);
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for character.                                |
//+------------------------------------------------------------------+
int GetHashCode(const char value)
  {
   return((int)value | ((int)value << 16));
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for unsigned character.                       |
//+------------------------------------------------------------------+
int GetHashCode(const uchar value)
  {
   return((int)value | ((int)value << 16));
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for short.                                    |
//+------------------------------------------------------------------+
int GetHashCode(const short value)
  {
   return(((int)((ushort)value) | (((int)value) << 16)));
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for unsigned short.                           |
//+------------------------------------------------------------------+
int GetHashCode(const ushort value)
  {
   return((int)value);
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for color.                                    |
//+------------------------------------------------------------------+
int GetHashCode(const color value)
  {
   return((int)value);
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for integer.                                  |
//+------------------------------------------------------------------+
int GetHashCode(const int value)
  {
   return(value);
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for unsigned integer.                         |
//+------------------------------------------------------------------+ 
int GetHashCode(const uint value)
  {
   return((int)value);
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for datetime.                                 |
//+------------------------------------------------------------------+
int GetHashCode(const datetime value)
  {
   long ticks=(long)value;
   return(((int)ticks) ^ (int)(ticks >> 32));
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for long.                                     |
//+------------------------------------------------------------------+
int GetHashCode(const long value)
  {
   return(((int)((long)value)) ^ (int)(value >> 32));
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for unsigned long.                            |
//+------------------------------------------------------------------+  
int GetHashCode(const ulong value)
  {
   return(((int)value) ^ (int)(value >> 32));
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for float.                                    |
//+------------------------------------------------------------------+
int GetHashCode(const float value)
  {
   if(value==0)
     {
      //--- ensure that 0 and -0 have the same hash code
      return(0);
     }
   BitInterpreter convert;
   convert.float_value=value;
   return(convert.int_value);
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for string.                                   |
//+------------------------------------------------------------------+
int GetHashCode(const double value)
  {
   if(value==0)
     {
      //--- ensure that 0 and -0 have the same hash code
      return(0);
     }
   BitInterpreter convert;
   convert.double_value=value;
   long lvalue=convert.long_value;
   return(((int)lvalue) ^ ((int)(lvalue >> 32)));
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for string.                                   |
//| The hashcode for a string is computed as:                        |
//|                                                                  |
//| s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]                     |
//|                                                                  |
//| using int arithmetic, where s[i] is the ith character of the     |
//| string, n is the length of the string, and ^ indicates           |
//| exponentiation. (The hash value of the empty string is zero.)    |
//+------------------------------------------------------------------+
int GetHashCode(const string value)
  {
   int len=StringLen(value);
   int hash=0;
//--- check length of string
   if(len>0)
     {
      //--- calculate a hash as a fucntion of each char
      for(int i=0; i<len; i++)
         hash=31*hash+value[i];
     }
   return(hash);
  }
//+------------------------------------------------------------------+
//| Returns a hashcode for custom object.                            |
//+------------------------------------------------------------------+
template<typename T>
int GetHashCode(T value)
  {
//--- try to convert to equality comparable object  
   IEqualityComparable<T>*equtable=dynamic_cast<IEqualityComparable<T>*>(value);
   if(equtable)
     {
      //--- calculate hash by specied method   
      return equtable.HashCode();
     }
   else
     {
      //--- calculate hash from name of object
      return GetHashCode(typename(value));
     }
  }
//+------------------------------------------------------------------+

Integer types are either returned as is, or for short integer types not elaborate bitwise conversion operations are used. For types that don't fit into 32 digits, an exclusive or followed by a left-right join is used. For strings, also not complicated hash is calculated through direct brute force. For real numbers the bit representation is taken with union and it is used as a hash.

 

Dictionary-type algorithms are conventionally divided into two parts:

  • Sets of key-value pairs, where the key is unique and the value is not. In Generic it is CHashMap.
  • Sets of unique values. In Generic, this is CHashSet.

What is a CHashSet? It is a collection (an array in its essence) where each element is unique. For example, such a collection can contain numbers 2,5,1,7,10,15,1024. But no two numbers can be the same. It can also contain strings, real numbers and even user-defined complex types (more about that later). But for any type the rule is the same: there can not be another of the same type in the dictionary.

What is a CHashMap? It's a collection (also an array in its essence), where each element is a complex type key-value:

template<typename TKey,typename TValue>
class CHashMap: public IMap<TKey,TValue>
  {
protected:
   int               m_buckets[];
   Entry<TKey,TValue>m_entries[];
   int               m_count;
   int               m_free_list;
   int               m_free_count;
   IEqualityComparer<TKey>*m_comparer;
   bool              m_delete_comparer;
};

It's arranged almost the same way as CHashMap, but allows you to match two sets such that an element of set 1 matches an element of set 2. This is very handy stuff, allowing you to write very efficient query algorithms with an average execution time of O(1).

Later on, we'll use examples to break down the construction of some kind of correspondence.

Interestingly, any key-value dictionary naturally establishes such a non-trivial relation asone to many. For example, we have a bunch of historical orders, and a bunch of trades that were executed based on them. Each trade corresponds to only one order, but one order can correspond to multiple trades. The dictionary of such a relation can store as a key the number of a deal, and as a value, the number of the order that served to open it.
 
Please do not litter the thread off-topic.