Need help! Can't solve the problem, I'm hitting hardware limitations - page 7

 
YuraZ:

the idea is clear...

and yet this methodology (with a quick search) is not available on industry bases

there must be a reason

Because the database already perfectly copes with the task of search without loading all data into RAM.
 
Integer:

No one is talking about searching in compressed data. We're talking about comparing two compressed sequences.

Suppose an array, "aaa", "bbb", "vvvv". Each of the three array elements is compressed by itself independently from the rest. Suppose we compress and get the array "a", "b", "c".

We have the sought string "bbb" which we need to find in the array. Before searching, we compress it and get "b". Now we search and find it.

Let's be clear, in your case, it should be 3a, 3b and 3c in compressed form, for you are omitting the number of repetitions.

If you think that such an algorithm will give 70-80% compression, you are mistaken. Even on Russian-language text (not to mention numbers), this approach will only inflate the data.

For example, word "Expert" will be recoded as "1E1k1с1п1е1р1t" and will not be compressed even a bit, but will be inflated twice. If you omit "1", there will still be no compression.

The date and time 2014.08.18 13:45:45 will not give compression your way.

Not to mention the quotes... So the efficiency of such transcoding is close to 0 in this case. This is the so called RLE algorithm used in PCX format.

 
elugovoy:

1. Let's be clear, in your case it should be 3a, 3b and 3c in compressed form, for you are omitting the number of repetitions.

If you think that such an algorithm will give 70-80% compression, you are wrong. Even on Russian-language text (not to mention numbers), this approach will only inflate the data.

For example, word "Expert" will be recoded as "1E1k1с1п1е1р1t" and will not be compressed even a bit, but will be inflated twice. If you omit "1", there will still be no compression.

The date and time 2014.08.18 13:45:45 will not give compression your way.

Not to mention the quotes... So the efficiency of such transcoding is close to 0 in this case. This is the so called RLE algorithm used in PCX format.

1. Not a fact. Maybe the data is such that everything goes three times, that's why it's such a simple compression algorithm.

The rest... Oh thank you, you've opened my eyes to the world, I didn't know that compressing short sequences of data increases their size.

 
Integer:
Because the database is already doing a great job of searching without loading all the data into RAM.

In fact, a good and correctly set up SQL server (if your hardware has 64g and 128zu for example)

occupies practically all of 64g minus the needs of the operating system ... 128г

and the search for 20 gig of data (pre-cached data) - would practically take place in memory...

That's why there is no sense to compress it

--

 
Integer:

You're kind of hinting:

How to search, I wrote earlier.

First of all, "hint" and "assert" are different concepts.

Secondly, there's not even a hint of that in my words, I'll say it againThe source file will have one tree, while the portion of data to be searched will have its own, quite different tree.

And using a more or less serious compression algorithm (any classical one, even Huffman's) you won't be able to do the search. Not even theoretically.

 
Integer:
Because the database is already doing a great job of searching without loading all the data into RAM.
that's why it makes sense to put 20 gigabytes in the SQL database
 
elugovoy:

First of all, a "hint" and an "assertion" are different concepts.

Secondly, there's not even a hint of that in my words, I'll say it againThe source file will have one tree, but the portion of data to be found will have its own tree.

And using a more or less serious compression algorithm (any classical one, even Huffman's) you won't be able to do the search. Not even theoretically.

Why would a portion of data have a different tree if you're compressing the same data? If different data, let it have a different tree. The important thing is to match the compressed data when the same data is compressed.
 
Integer:
Why would there be a different tree for a portion of the data, if the same data was compressed? If different data then let it have a different tree. What matters for us is the coincidence of the compressed data when the same data is compressed.

Dimitri - if that were possible

long time ago they would have created a SQL industrial database - with (FAST) search in well (80%-90%) compressed data...

also with Insert Update Delete

 
Integer:

1. That's not a fact. Maybe the data is such that everything goes three times, hence the simple compression algorithm.

The rest... Thank you for opening my eyes to the world, I just didn't know that compressing short sequences of data increases their size.

And one more small argument, to keep my eyes open. Any encoding has a purpose.

There is a decompression coding, the purpose is to transmit binary data over communication channels which only support teletype (e.g. pictures by email), usually Base64 based on the Radix algorithm is used.

Redundant encoding associated with data correction, (like CD Aduio) the purpose is to protect the data as much as possible from damage to the physical carrier.

Compression coding, the purpose storage/archiving data.

 
YuraZ:

Dmitry - if it were possible

they would have built an industrial SQL database long ago - with (FAST) search in well (80%-90% ) compressed data...

Going for a second round? Start re-reading from page 5-6. Read the post carefully.

Don't attribute to me what I was suggesting. I suggested comparing condensed sequences that are compressed independently of each other. Not to look for small text compressed separately in a separately compressed huge file at a time.