How to find the best deals in a 20 gb database - General

Mykola Demko 2014.08.19 10:04 #111

komposter:

Does the whole database fit into 10 lakh lines or not? all files together

Mykola Demko 2014.08.19 10:05 #112

Candid:

But the sequences are independent of each other, aren't they? Then why can't we loop through the dates on a single-loaded sequence at once? Here, by the way, it may be possible to go to some efficient recurrence algorithm, but it depends on your luck. The size of a million on a million will remain, and the file will be read once.

Of course, a problem where the number of steps remains the same at the next iteration (i.e. the search area does not get narrower as computation proceeds) does not look very robust. But this is subjective, of course.

Recursion on such dimensions will fall when cache limit is exceeded.

Sergey Dzyublik 2014.08.19 10:06 #113

There are many single-type sequences of transactions, each sequence is sorted by time.

Сделки в разных последовательностях разные, по времени распределены неравномерно (и в каждой последовательности по своему). Сделок разное количество. Но все - в интервале от Даты1 до Даты2.

The task is to move from D1 to D2 with M minutes step (or better - exactly by points of making trades of all sequences), find the sequence that is better than others by criterion K (a separate task - not just find the best sequence, but sort the whole set by criterion and output the top 10 - but it is an optional, not necessary yet).

Criterion K is calculated based on X previous trades of the corresponding sequence, while calculating almost all information about each of X trades (only profit, for example, is not enough).

This is where we should have started.

Have I understood correctly the following:

1) A 20 gb file consists of about one million sequences sorted by time

2) The size of each sequence may vary and depends on the number of deals the sequence contains

3) The average size of a sequence is 20/10^6 = 20 Mb, so what can we guarantee to download completely one sequence?

4) The K coefficient depends only on the trades within the given sequence

5) For each sequence we have to find K (in total 10^6 pieces) and select the top 10

A If we create another file with values of distances between sequences.

1) See how many sequences we can download and sum up the distance between them (keeping the intermediate values of the sums)

2) We read the distance from file into RAM

3) Run for each sequence of the search algorithm to find K (we know where the beginning of sequences, because we keep the subtums calculated in step 1)

4) Again start point 1 moving a little ahead

As for the top 10:

n is the total values of K, m is the number of best.

1) you can find all K and then, with the help of the heap data structure, choose the needed number of best values (Get heap O(n), choose the best O(log n) number of times m, space in memory - n)

2) count the required number - m instances (for example, 10), sort them and use a binary search to find the insertion point for the remaining K instances.

(initial sorting O(m*log m), search for insertion O(log m) number of times (n-m), insertion O(1), memory space occupied - m).

Real and Generated Ticks Optimization Types - Algorithmic How to Purchase an

Nikolay Likhovid 2014.08.19 10:20 #114

Urain:
Recursion on these dimensions will drop when the cache limit is exceeded.

In classical recursion the cache size is fixed.

Nikolay Likhovid 2014.08.19 10:23 #115

ALXIMIKS:

3) The average sequence size is 20/10^6 = 20 MB, what will fully load one sequence we can guarantee?

By the way, yes, you can load batches of sequences at once.

Sergey Dzyublik 2014.08.19 10:34 #116

I feel like I can't get the gist of what's needed and what's given (((

А потом "нужная дата" сдвигается на точку закрытия сделки из выбранной последовательности и алгоритм повторяется.

and yes20/10^6 = 20kb because 1gb = 1000mb = 10^6kb

Mykola Demko 2014.08.19 10:34 #117

YuraZ:

Going towards SQL

Relatively new to me (haven't worked closely, just basic queries);

This can be quite slow at the learning stage.

If you choose this option, it is better to build all business logic with stored procedures

leaving the Expert Advisor only two functions, send a request to the server and get a completely finished result

all calculations on the server

The complexity of installation on a single client machine (stand-alone version);

In fact, on the web you can find many descriptions of how to put the SQL server

( ORACL, SYBASE + CENTOS for example ) ORACL, SYBASE, MS SQL + WINDOWS stand alone machine

ORACL is a bit more complicated to learn - less experts, less literature.

MS SQL - perhaps the biggest amount of information and more literature on the web.

there will be no difficulties - there are many descriptions on the web and more books in the shop.

MSSQL 2012 by its parameters is very close to ORACL - already in 2014.

SQL + LINUX is usually chosen for production purposes and if you don't know anything about LINUX it's better to use WINDOWS.

MSSQL Expres balloon but restrictions use 1 core, 1Gb of memory, and 10Gb base

others are paid.

Andrew Petras 2014.08.19 10:43 #118

komposter:
...

There are many similar sequences of deals, each sequence is sorted by time.

Transactions in different sequences are different, distributed unevenly in time (and in a different way in each sequence). The number of deals is different. But all of them are in the interval from Date1 to Date2.

The task - moving from D1 to D2 with M minutes step (or better - specifically by points of making deals of all sequences), find a sequence, that is better than others by criterion K (a separate task - not just find the best sequence, but sort the whole set by criterion and output the top 10 - but it is an optional, not required yet).

...

I don't understand where.

There is a criterion - all - in the interval from Date1 to Date2.

komposter:

Everything is like this.

And then the "right date" is shifted to the deal closing point of the selected sequence and the algorithm repeats.

And so on a million times =)

I.e., the next one is read.

Why not split the file into many interval ones from Date1 to Date2? There will be spent sequences that can be closed, right?

Fibonacci Time Zones - Fibonacci Time Zones - Fibonacci Time Zones -

Nikolay Likhovid 2014.08.19 10:47 #119

Silent:

I don't understand where.

Here's the criterion - everything is between Date1 and Date2.

I.e., it reads as follows.

Why not divide the file into many intervals from Date1 to Date2? There will be used-up sequences that can be closed, right?

Apparently one of the results of a single date pass is a new date.

Sergey Dzyublik 2014.08.19 10:59 #120

if the problem is this:

given a row 1 2 3 4 5 6 7 8 9

given a width of e.g. 4, you have to go with this width through the whole row to find some value inside the width (e.g. a minimum):

first you need to find in 1 2 3 4 then 2 3 4 5 then 3 4 5 6 then 4 5 6 7 then .... etc.

then the task is solved by maintaining the minimum in the data structure of the queue:

1) implementation of this was suggested in the mails.ru video course through four stack data structures

2) I have verbally invented an implementation through queue data structure and dec data structure, most likely someone has already done it once and it is named after him, I just need to find it.

Fibonacci Tools - Objects Fibonacci Tools - Objects Real and Generated Ticks

Need help! Can't solve the problem, I'm hitting hardware limitations - page 12