Need help! Can't solve the problem, I'm hitting hardware limitations - page 10

 

I'm trying again.

We have three files.

1. L. Tolstoy. War and Peace, Volume 1.

2. L. Tolstoy. War and Peace, Volume 2.

3. F. Dostoevsky. Crime and Punishment.

We pack each of them.

We have three packed files with no names (just do not ask me how I imagine a file with no name). We also have one non-packaged file, let it be "Crime and Punishment".

How to find this file in the three compressed files in the most economical way?

Option 1: I uncompress all three files and find the file I am looking for.

Option 2. Compress the file you are looking for and find the exact same file among the three compressed ones.

 
YuraZ:

Mm-hmm - so the one you propose is no good

That wasn't my suggestion. In any case, it's an interesting idea.
 
Integer:
Yes, I know that if the data are short, the size increases when archiving.

If you want to continue in this direction, you can also use hashing or checksum to search, you don't have to use compression coding. Create a hash index and search by dichotomy.

But this is if source portion is available in full size.

For example, in such cases I use DBMS without any tricks. I spend less time on development and the product is stable.

 

You guys are saying all the right things, and this once again emphasises that the compression option should be justified for the task.

You have to rely on the problem statement.

 
elugovoy:

If you want to continue in this direction, you can also use hashing or checksum to search, you don't have to use compression coding. Create a hash index and search by dichotomy.

But this is if source portion is available in full size.

For example, in such cases I use DBMS without any tricks. I spend less time on development, and the product is stable.

Good idea.
 
Integer:
So that wasn't my suggestion. In any case the idea is interesting.

>>> Talking about comparing two compressed sequences.

Dima! Remind me - this is what we were talking about

>>> that's exactly what we discussed in practice.

>>> Yes, I know, if data is short, it increases the size when archiving.

>> that's why it's no good

--

that's why industrial databases don't have that ideology.

...

 
elugovoy:

Ну если хотите продолжать в этом направлении, то для поиска можно и хеширование использовать или контрольную сумму, не обязательно кодировать сжатием. Создать индекс по хешам и методом дихотомии выполнять поиск.

Но это при наличии исходной порции в полном объеме.

Я например, в подобных случаях использую СУБД без всяких выкрутасов. И времени меньше затрачиваю на разработку, и продукт стабильный получается.

Integer:

Good idea.

you may try it

>>> For example, in such cases I use DBMS without any tricks. It takes less time to develop, and the product is stable.

but it's better to use a ready-made industrial SQL database

 
YuraZ:

>>> Talking about comparing two compressed sequences.

Dima! Remind me - this is what we were talking about

>>> that's exactly what we discussed in practice.

>>> Yes, I know, if data is short, it increases the size during archiving.

>> that's why it's no good

--

that's why industrial bases don't have this ideology either

...

I think it's for another reason. Because there, the problem of loading big data into the outliner is solved differently, they are not loaded, but read directly from the disk. (probably)
 
YuraZ:

you may try this one

>>> I, for example, in such cases, use a DBMS without any tricks. I spend less time on development, and the product is stable.

but the best way is to use ready-made industrial SQL databases

Yurichik, I mean without any twists and turns with file processing, compression, etc. I meant just working with SQL and robot/indicator logic. I worked with many databases, the only problem was to make MQL and SQL work together)). I created a nice-looking solution without arrays and structures.

In general, I prefer not to reinvent the wheel and solve problems by the best means.

 
Integer:
I think it's for another reason. Because the problem of loading large data into the operating system is solved differently there, they are not loaded, but read directly from the disk. (I guess)

the server does it... very efficiently.