Need help! Can't solve the problem, I'm hitting hardware limitations - page 19

 
komposter:

So you would have to go through several million deals from other sequences! That's exactly what I mean.

Well, I'm talking about checking one single number (sequence number), a million isn't that much for such an elementary operation. If we count the criterion recursively, it's only one pass of the file, and we'll have to do it anyway. Another thing to consider is that recursion data map will contain the same several million elements (multiplied by the number of parameters for one sequence).

Another alternative is to fully recalculate and store criteria before sorting. Again, the possibility of using recursion is critical.

It is clear that there will be a lot of operations, but is it less in the original version?

 
ALXIMIKS:

In C++ you can do it without a parser if:

pushing the idea 10 times - start another file with the values of the start positions of the sequences in another file, then you don't even need to store the number of trades in the sequence structure.

You will get an ordinary index, which has already been mentioned before.

 

I implemented the algorithm I described above.

Processing my data with X = 20 (criterion calculation on 20 deals) took about112 minutes (didn't catch it exactly), the lion's share was eaten by array shift (40% according to the profiler).

After loops of arrays and some other optimization the processing became faster:

  • for X = 20: 48 minutes
  • at X = 50: 60 minutes

The memory was only sufficient for 65 transactions (multiplied by a million sequences). This, of course, is not enough - I was counting on at least 100.

The next step was to discard all unnecessary information about the trades. Leaving only opening and closing times, and profit in pips, managed to take off with X = 185.

Next - just accelerate the work with the file, FileReadXXX now takes 30% of time (and the dispatcher says there is no work with the disk, but the CPU is loaded).

FileRead is exactly the first after FileSeek, i.e. sequential reading works fast.

I will let you know about new results.

kazakov.v:
In C++ it's done by fread in 64K-128K buffer, parsing it better by your own parser, because sscanf-es are very slow.

I will try to work with the files through WinAPI, it was advised to me by the service-desk:

ALXIMIKS:

I'm pushing the idea 10 times - to make another file with the values of the beginning positions of the sequences in another file, then in the structure of the sequence will not even need to store the number of deals.

I don't understand what the index will do.

Writing down the number of trades doesn't make a difference, sequential reading works quickly, the problem is exactly reading the block after moving around the file.

Please write a suggested algorithm.

C-4:

The problem is non-trivial, but there is not a single line of code yet. Andrey, a lot of people here are interested - formulate the problem and offer test data. Let's organize some sport programming.

We need test data + pseudocode, with general principles of working with data.

The task is formulated from start to finish.

And the test data can be generated with a slightly modified script, which I posted earlier.

I'll do that a bit later.

joo:
why go through the options from the base? would it be better to generate trades on history according to the criteria? no? wouldn't that be the same as what is required?

If for tests, of course, yes, but if for solving the problem, of course, no )


Candid:

Well we're talking about checking one single number (sequence number), a million isn't that much for such an elementary operation. If we consider criterion recursively, it's just one pass of file, we'll have to do it anyway. Another thing to consider is that recursion data map will contain the same several million elements (multiplied by the number of parameters for one sequence).

Another alternative is to fully recalculate and store criteriabefore sorting. Again, the possibility of using recursion is critical.

It is clear that there will be a lot of operations in any case, but is it less in the original version?

In the initial variant, you can calculate the criterion once when finding the last transaction from the passed history.

And you need to read such a fragment of the file that it would contain X number of deals of all sequences. It will be much bigger than X *number of sequences, because there are different amount of deals and they may not be evenly distributed.


2 all:

Anyway, thanks for the support.

If you don't mind, please run the test script and share the results.

 
komposter:

And the test data can be generated with the slightly modified script I posted earlier.

I am attaching the updated script, now it does not just record identical trades, but generates random sequences with specified settings(lifetime - from and to, profit - from and to).

To get a file comparable to mine, run the script with default settings:

2014.08.28 04:12:36.044 sTest_ReadWriteBIN EURUSD,M15: 140,000 secuences writedown in 57.1 sec
2014.08.28 04:13:20.688 sTest_ReadWriteBIN EURUSD,M15: 6632 Mb loaded in 44.0 sec(150.7 MB/sec)

After that, you can work out the algorithm on it.

For the laziest, archive the file on google drive.

Files:
 

komposter:

1. In the original variant you can calculate the criterion once when finding the last deal from the passed history.

In your variant, you need to read such a piece of file, that it would contain X deals of all sequences. It will be much larger than X *number of sequences, as the number of deals is different and they may not be evenly distributed.

1 This is not very clear, if we are looking for maximum (minimum) criterion, we still have to compute criterion for all candidates in the end. It doesn't even matter if local or global extremum is being looked for.

2. Clearly more is the question of acceptable size or not in terms of required memory. Moreover, a bigger block size in terms of disk read speed is even better. Certainly not yet a problem for the RAM. It is of fundamental importance that the reading is sequential and only takes place once.

The variant of calculating criteria before sorting, however, requires reading twice. But it has its own advantages, especially considering the possibility of splitting the data into several smaller files and their subsequent joint processing.

Without implementation, however, all arguments would remain abstract. So at this point in the discussion I will just have to take my leave :)

 
komposter:

How to make stitching of files with indexing of which bit is the beginning and the end of which file.

For example, make 1000 metafiles out of 1000 files, and if criteria are known, make statistical ordering to fit similar files into one metafile.

Better experiment with FileOpen, how it works faster to open a large file or a small one, in the sense of opening a 1000 times a large file equal in size to 1000 small ones or 1000000 times to open a small one?

It may turn out that the files should not be merged, but split.

 
Urain:

How to make stitching of files with indexing of which bit is the beginning and the end of which file.

For example, make 1000 metafiles out of 1000 files, and if criteria are known, make statistical ordering to fit similar files into one metafile.

Better experiment with FileOpen, how it works faster to open a big file or a small one, in the sense of opening a 1000 times a big file equal in size to 1000 small ones or 1000000 times to open a small one?

It may turn out that files should not be merged, but split up.

I'm currently working with one big file, but wanted to go to a million small ones.

Firstly, they can be appended, and secondly, they can be accessed by name (no need to store "start of each sequence").

I asked in service-desk about a million openings of different files (I have cache implemented that way). The answer is clear and straightforward.

I will test api-functions both for one big file and for million small files, I will publish results.

Candid:

However, without implementation all arguments will remain abstract. So at this point of the discussion it would be appropriate for me to withdraw :)

I should have started with it ;)

But in any case, thanks for your participation, ideas without code are valuable too.

 

With a million files you will ruin ntfs in such a way that you will cry from this insane brainchild of microsoft, which has locked everyone forever in its insanely slow implementation of the file system.

Everything ms has done in their filesystems is monstrous weights, retardation and stupidity. Heck, these idiots have made no effort to optimise and speed, and the api is just primitive. In 2014 this can be stated unequivocally. Well and weep.

 

Anyone who wants to improve the efficiency of handling a bunch of files in windows, the first thing they do is go into chunks - grouping and their own data structure within a single file.

Once you start storing more than a thousand (much less tens or hundreds of thousands) of disparate files in windows, you're screwed.

 

File operations in MQL4/5 go through our very efficient caching mechanisms, which allow us to achieve very efficient and high read and write speeds.

You can stream data byte by byte - all of it will be quickly collected in the internal buffer and written to disk only in large chunks. The number of WinAPI calls is minimal, which gives high speed operation. Reading is similar - it works proactively, reading in large chunks, very rarely calling WinAPI and very quickly giving back data from its cache with minimal overhead.

Roughly speaking, in comparison with build 509, we have increased speed of file operations in MQL4/5 by an order of magnitude (if we mean regular chunk-based operation, rather than "write by megabyte blocks to maximize speed measurements").