Machine learning in trading: theory, models, practice and algo-trading - page 194

 
Well. I started the week before the market opened, actively testing version 14. I would like to say the following. The longer I train, the more inputs are involved in the TS. More predicates. I had up to 8-9 entries at maximum. However, the generalization ability in this case is not usually high. And such TSs work with a stretch. In other words, they can barely reach the mark of 3. But predicates with 4-6 inputs work satisfactorily. I increased the number of entries from 50 to 150. I've been practicing for the third hour. But I think that this time there will also be some inputs. So let's see......
 
Well, again I noticed such a thing. The thing is that I have a set of data, perdicts 12 and then go their same lag, lag1 and lag2. Previously, the inputs were mostly at the beginning of the set, that is, lags were few and then no more than lag1, rarely when there was lag2. Now, on the contrary, the first data is practically not used at all, but lag1 and what is the most regrettable lag2 began to appear more often. But the fact is, before the generalization went on the initial columns mainly, now on the final.... practically, so let's make conclusions....
 
Mihail Marchukajtes:
Well, again I noticed this stuff. The thing is that I have a set of data, perdicts 12 and then go their same lags, lag1 and lag2. Previously, the inputs were mostly at the beginning of the set, that is, lags were few and then no more than lag1, rarely when there was lag2. Now, on the contrary, the first data is practically not used at all, but lag1 and what is the most regrettable lag2 began to appear more often. But the fact is, before the generalization went on the initial columns mainly, now on the final.... practically, so let's conclude....

So you need to roll back to previous versions.

My flight is normal. Maybe because there are no lags in the sample?

 
Dr.Trader:

It looks good in general, I wonder what will happen in the end.

About the committee - I posted some examples, but there are models that use regression with rounding when classifying, and there is not so straightforward. I tried two different ways of combining votes:

1) Round everything up to classes, take the class for which there would be more votes.
I.e., having a 4-bar forecast from three models
c(0.1, 0.5, 0.4, 0.4) c(0.6, 0.5, 0.7, 0.1) c(0.1, 0.2, 0.5, 0.7) I would further round it up to classes
c(0, 1, 0, 0) c(1,1,1,0) c(0,0,1,1) , and the final vector with predictions would be c(0, 1, 1, 0) by number of votes.

2) another option is to find the average result right away, and only then round it up to the classes
the result would be c((0.1+0.6+0.1)/3, (0.5+0.5+0.2)/3, (0.4+0.7+0.5)/3, (0.4+0.1+0.7)/3)
or (0.2666667, 0.4000000, 0.5333333, 0.4000000), or
c(0, 0, 1, 0)

You can see that the result is different, and it depends on which step to round. I don't know which is more standard, but I think the second way works better with new data.

The tsDyn package the SETAR function

Turns out that the value of threshold (there can be two thresholds like in RSI) is variable. Gives amazing results.

Also let's not forget the calibration algorithms in classification. The point is that the class prediction is in reality not a nominal value, the algorithm calculates the probability of the class, which is a real number. Then this probability is divided for example in half and you get two classes. And if the probabilities are 0.49 and 051, that's two classes? What about 0.48 and 052? Is this a division into classes? Here is where SETAR would divide into two classes, between which would be Reshetovskie "on the fence".

 
Dr.Trader:

It looks good in general, I wonder what will happen in the end.

About the committee - I posted some examples, but there are models that use regression with rounding when classifying, and there is not so straightforward. I've tried two different ways of combining votes:

1) round everything up to classes, take the class that has the most votes.
I.e. having a 4 bar forecast from three models
c(0.1, 0.5, 0.4, 0.4) c(0.6, 0.5, 0.7, 0.1) c(0.1, 0.2, 0.5, 0.7) I would further round it up to classes
c(0, 1, 0, 0) c(1,1,1,0) c(0,0,1,1) , and the final vector with predictions would be c(0, 1, 1, 0) by number of votes.

2) Another option is to find the average result right away, and only then round it up to classes
the result would be c((0.1+0.6+0.1)/3, (0.5+0.5+0.2)/3, (0.4+0.7+0.5)/3, (0.4+0.1+0.7)/3)
or (0.2666667, 0.4000000, 0.533333, 0.4000000), or
c(0, 0, 1, 0)

You can see that the result is different, and it depends on which step to round. I don't know which of these is more standard, but it seems to me the second way works better on new data.
This is the gbpusd pair. Which means the model is waiting to be tested by Brexit. i haven't even processed last year's data yet.... Could be a plum...

Depending on the result of the final test I will set the tone of the article. It's always a bit of a surprise to see that the model works and the norm to see that it drains.

I will assemble the committee as follows:

I build n vectors of numeric type predictions on the number of models on the training data (regression of price increment).

I average the response on the selected models.

I count quantiles 0.05 and 0.95.

At validation I repeat steps 1 and 2.

I select only those examples where the average is outside the quantiles.

I multiply the response by the prediction sign and subtract the spread.

On the obtained vector I build m subsamples with random inclusion at the rate of 1-4 deals per day depending on the forecast horizon.

The committee has already shown a threefold increase in MO compared to single models. Because the models are diverse...

 
Как из данных вычленить некие группы данных по условию
Как из данных вычленить некие группы данных по условию
  • ru.stackoverflow.com
нужно найти такие строчки которые повторяются не менее 10 раз в всей выборке и в каждой из найденных одинаковых групок которые повторялись, количество "1" в target.label должно превышать 70% по отношению к "0" вот найденные одинаковые строчки единичек больше чем нулей...
 

I'll answer it here, then.

#пара строк из той таблицы, не буду я всё текстом копировать, потом первая строка повторена ещё дважды
dat <- data.frame(cluster1=c(24,2,13,23,6), cluster2=c(5,15,13,28,12), cluster3=c(18,12,16,22,20), cluster4=c(21,7,29,10,25), cluster5=c(16,22,24,4,11), target.label=c(1,1,0,1,0))
dat <- rbind(dat, dat[1,], dat[1,])
#результат последней строки поменян на 0 для эксперимента
dat[7,"target.label"]=0

library(sqldf)
#для sqldf точек в названиях колонок быть не должно
colnames(dat)[6] <- "target"

dat1 <- sqldf( "select cluster1, cluster2, cluster3, cluster4, cluster5, avg(target) as target_avg, count(target) as target_count from dat group by cluster1, cluster2, cluster3, cluster4, cluster5" )
dat1
dat1[ dat1$target_count>=10 & dat1$target_avg>0.63 , ]
dat1[ dat1$target_count>=10 & ( dat1$target_avg<0.37 | dat1$target_avg>0.63 ), ] #на случай если оба "0" или "1" встречаются чаще 70%
 
SanSanych Fomenko:

The tsDyn package is a SETAR function

SETAR refers specifically to committee calibration, or is that a separate topic for creating financial models?

I flipped through the package's manual, didn't see what I need... I have a situation like this: I have a training table with 10000 examples. I have 100 models that were trained on these examples. To test the models you can use them to predict the same input data and I get 100 vectors each with 10000 predictions. Can SETAR be used to somehow combine all these 100 vectors into one?
And then, for a forecast with new data, there would be 100 forecasts again, and I would have to merge them into one (there would not be 100 vectors, but just 100 single forecasts). SETAR can do that too, using the committee parameters from the training data?

 
Dr.Trader:

SETAR refers specifically to committee calibration, or is that a separate topic for creating financial models?

I flipped through the manual for the package, I didn't see what I need... Here's the situation: I have a training table with 10000 examples. I have a training table with 10000 examples and 100 models that were trained on these examples. To test the models you can use them to predict the same input data and I get 100 vectors each with 10000 predictions. Can SETAR be used to somehow combine all these 100 vectors into one?
And then, for a forecast with new data, there would be 100 forecasts again, and I would have to merge them into one (there would not be 100 vectors, but just 100 single forecasts). SETAR can do that too, using committee parameters from the training data?

As I understand it, it has nothing to do with committees
 
Yury Reshetov:

So you need to roll back to previous versions.

I have a normal flight. Maybe because there are no lags in the sample?

Well, yes, I did lags because in previous versions they increased enumeration ability, now with improved algorithm of prefetching it is not required, so I'm trying to train without them. Let's take a look and see. Later I will write about the result of today...