Machine learning in trading: theory, models, practice and algo-trading - page 2808

 
Maxim Dmitrievsky #:

it's all there. The speed will be catastrophically affected. Dataframes are the slowest beasts with big overhead.

It's not about video cards, it's about understanding that such things don't count through dataframes in a sober state.

What is meant by "dataframes" - explain to the ignorant in this language.

 
mytarmailS #:

Tip: Is it necessary to use vectors of 100,000 observations to see the correlation between them?

I am looking for highly correlated vectors, i.e. with correlation greater than 0.9.

I don't know whether it is necessary or not - you should experiment. The sample is not stationary - for half of the sample there was no correlation, and then bang, and then it appeared.

Besides, I've tried all the coefficients in steps of 0.1.

mytarmailS #:
You're welcome.

Is this the cry of the soul?

 
Vladimir Perervenko #:

Depends on zhekez and sample size. If the processor is multi-core, parallelise execution. Below is a variant of parallel execution

It is 4 times faster than serial execution. Hardware and software

Good luck

So parallelism will not increase RAM consumption?

Although mytarmailS code is more RAM-hungry, it is 50 times faster, maybe there are some limitations of the libraries you use - the script worked for more than 30 hours and did not create a single file.

Thanks for some complicated code examples - in R I am rather just a consumer, I can't figure out what to correct in the main script.

 
mytarmailS #:
Do you mean that for each data type there should be a method to calculate corr?

matrix is a data type built into R, it has something like matrix.corr() vector.

 
Aleksey Vyazmikin #:

What is meant by "dataframes" - explain to the ignorant of this language.

It was rather a message to R writers :) these are tables for convenient display of data and some typical manipulations with them such as extracting subsamples (as in sql).

They are not designed to race them in loops on such large data as you have, it will be slower than arrays by 20-100 times. By memory you have already understood by yourself.

I think it's fine here:

#  чтобы прочитать как работает функция и примеры  ?caret::findCorrelation
#  находим колонки которые не коррелированы с порогом корреляции 0,9    "cutoff = 0.9"
not_corr_colums <- caret::findCorrelation(as.matrix(df), cutoff = 0.9, exact = F,names = F)

I don't know how fast the built-in type "matrix" is, but it uses caret, which can also slow down. The built-in type has no vector operation to calculate correlation or something.

 
Where do these thoughts come from
 
mytarmailS #:
Where do these thoughts come from

why do you slow down an inbuilt type with the left lobe, which should have its own corr calculation, as fast as possible for it

 
Maxim Dmitrievsky #:

why do you slow down a built-in type that should have its own Korr calculation that is as fast as possible for it?

Doesn't the lib take the type into account? Data type is like data for the cheapest calculations. The same matrix should be designed for calculations.

 
mytarmailS #:
How to get smarter in the future without getting stupider in the past? Algorithmically... without creating terabytes of knowledge.

You don't.

 
Valeriy Yastremskiy #:

Doesn't the lib take type into account? Data type is like data for the cheapest calculations. The same matrix should be designed for calculations.

I have not found an analogue of numpy for R, and the matrices there are not that fast and R itself consumes a lot of memory due to its paradigm.

Of course, a third-party lib can be slow, who would check it?

I don't know what to compare with, so I don't want to load a gigabyte dataset to compare the speed
Reason: