Need help! Can't solve the problem, I'm hitting hardware limitations

 

There is a large amount of information (about 20 GB in a text file).

The information consists of the same kind of sequences, about a million of them.

It is necessary to go through all the sequencesrepeatedly and make some calculations.

The first thing that comes to mind is to read all the contents of the file, fill the array of structures with it and work with them in memory.

But it goes wrong, with next resizing MT swears "Memory handler: cannot allocate 5610000 bytes of memory".

Dispatcher shows that terminal.exe uses 3.5 GB RAM (of 16 physical). I assume this is because the process can only get 4GB.

Before you start

Read 2%

Read 6%

Read 12%

Read 15%

Everyone...

EA says "Not enough memory(4007 Mb used, 88 Mb available, 4095 Mb total)!!!".

And this is only 15.3% of the required amount (and I would like to increase it in the future as well).


Option 2 - read every time the file. Find the necessary piece, save it to the structure, read the next piece, compare the result, overwrite the structure.

And if I had to go through these sequences once, that's what I would do. But you have to go through them many times, shifting forward a bit each time.

So you have to read a lot of times, which is:

  • very, very slow.
  • Wiping a hole in the screw.
I'm not sure I'm ready to wait a few days for the results...

It's also frustrating how much information there is... If it was 10 GiG, I'd move it to RAM-disk (in fact, into memory) and read as much as I can. Yes?

That's all I can think of.

Try to recompile these sequences, so that there would be many-many pieces, but each containing only necessary at the moment information?

Also try to compress data (I've already converted to floats with char types everywhere I can)? But it will give me 10%-20% more at most, and I need to reduce volume by an order of magnitude...

Any advice, friends? I'll get it )

 

komposter:

Any advice, friends? I'll take care of it.)

As options...

1. Make your own cache. In this case, you control what's in memory. You know the algorithm, so you can make the cache efficient.

2. Use mapping for the file. Wind will cache what it needs and it won't wipe the hard drive.

 
TheXpert:

As options...

1. make your own cache. In this case, you will be able to manage all the contents yourself.

2. use mapping for the file. vin itself will cache what it needs, so it won't overwrite the disk.

1. this is the cache... Or I don't understand what you mean. My option of constantly reading the necessary chunks?

2. Can you elaborate a bit more? What will mapping do and which way to approach it?

 
I'm starting to get the hang of mapping. I'll study more mana and then go to the mines.)
 

Oh, shit...

32-битные архитектуры (Intel 386, ARM 9) не могут создавать отображения длиной более 4 Гб

Same eggs, but from the side. Reading might speed up, but it doesn't solve the problem globally.

 

Another idea is to move everything to a database (MySQL?) and work with it. The idea is that databases are designed for such volumes and constant digging.

Are there any experts? Who has something to say?

 

1) Is there any way to redo the algorithm? To load a block (2GB), process it, save the result (shorter), release memory, load the next block ...

and at the end, process all the results all over again.

2) When there is a lot of work with memory, hash-based solutions, B-trees (and their modifications), offload to database are associated.

 
ALXIMIKS:

1) Is there any way to redo the algorithm? To load a block (2GB), process it, save the result (shorter), release memory, load the next block ...

and at the end, process all the results all over again.

2) When there is a lot of work with memory, hash-based solutions, B-trees (and their modifications), offloading to the database are associated.

1. I wrote about it - you can, but the problem is that you have to process the data multiple times. It will be very slow.

2. Tomorrow I'll google myself, I'll be grateful for short description.

 
komposter:

1. I have written about this - you can, but the problem is that you have to process the data many times. It would be very slow.

2. Tomorrow I'll google myself, I'd be grateful for a brief description.

I remembered a site where a similar problem and variants of its solution in C++ were discussed.

Отсортировать строки в файле размером 3GB | FulcrumWeb
Отсортировать строки в файле размером 3GB | FulcrumWeb
  • www.fulcrumweb.com.ua
Необходимо написать алгоритм, который бы смог отсортировать строки в файле большого размера (от 2-х до 4-х Gigabytes). Результатом выполнения должен быть другой файл. За хорошую реализацию гарантированный приз – флешка для хранения таких файлов и, возможно, предложение работы в нашей компании.
 
I'm sorry but what if you try 64 bit or mt only runs 32
I naively thought that such a highly mathematical thing should run on 64 bits
Take aerodynamics calculations for aircraft, they don't work at all on 32x
about the basic argument that 32x is faster to run the machine I know but it is really a hardware problem imho
 

1. Naturally, use an x64 system.

2. rent a more powerful machine in the Amazon EC2 cloud and do the calculations on it.

3. use compressed data, decompress in memory on the fly. Real data is better compressed if you divide it into streams (sign/mantissa/exponent); you can use 12-bit float (at the expense of accuracy).

4. do an off-advisor calculation with something that can handle big data (Matlab/R/etc).