Dependency statistics in quotes (information theory, correlation and other feature selection methods) - page 20

 

Thank you! Downloaded it, took a look.

So, I'll discretise using a division-by-quantile scheme, that way the probability density function will be uniform. I'll measure the mutual information for 500 lags, post a graph.

 

And for starters:

The type of probability density function for your raw data:

Corresponds to a normal distribution.

Next, an autocorrelogram over the original series of your values up to lag 50:

You can see that in general the correlations are not significant, although some correlation slips through at some lags.

Finally, I took the values of your series squared and plotted an autocorrelogram to look solely at the density of "volatility":

I note that the volatility depends on its close past values. It is all similar to the daily quotations of stock indices and slightly similar to the daily quotations of EURUSD (I will post the calculation for them later).

We await the results of the I(X,Y) calculation.

 
alexeymosc: Waiting for the results of the calculation of I(X,Y).

Great, we are waiting, Alexey.

After your results for I(X,Y) I can load the data into my chi-square calculation script. I don't believe that something useful will come out (this is my a priori assumption).

 

I apologise for the delay. The internet is down.

I'll start with the methodological part. I discretised the series into 5 values (quantiles). Why? When you calculate the cross-frequencies for the target and dependent variable you get 25 choices, if you divide 10,000 by 25 you get 400. This is a statistically significant sample. You can do 3 to 7; in my opinion, I took the middle ground.

This is how the average receiver information (target variable) is calculated;


I note that for any lag, calculating the average information will give a similar value (unless, of course, we have discretised the independent variables within a different length alphabet).

This is the calculation of cross-entropy for the target and dependent variables:

Histogram of mutual information values over the original time series :

I can only note the first lags that stand out from the overall picture. It's hard to say anything about the rest of it.

I also did the following. Since the data was normal, I generated 10,000 random numbers with the same mean and standard deviation in Echel. I counted the mutual information for 500 lags. This is what came out:


You can see by eye that the first lags are not so informative anymore.

The rest of the metrics on the resulting samples of mutual information values should be removed and compared. So:

Sum of mutual information for 500 variables for the original series: 0.62. For the random series: 0.62. This means that the average of the samples will also be equal. Put the first check mark on the assumption that the original series does not differ much from the random series (even taking into account the volatility dependence).

Let's carry out nonparametric tests to confirm the hypothesis of insignificance of differences between the two experimental samples.

Kolmogorov-Smirnov test (for samples without consideration of the order of variables and with a priori unknown probability density functions): p > 0.1 at 0.05 significance level. We reject the hypothesis that the difference between the samples is significant. Place the second check mark.

As a result we have: the initial series is insignificantly different from the random series as was shown using the statistics of mutual information.

In this case, the dependence of volatility did not have a strong impact on the appearance of the histogram. However, it must be remembered that I did the sampling differently for the DJI.

 
Mathemat:

Very well, we wait, Alexei.

After your results for I(X,Y) I can load the data into my chi-square calculation script. I don't believe that something useful will come out (it's my a priori assumption).

I, too, am a priori silencing the Bayesian plausibility...

See the foreshortenings.

:)

noise - as it was originally seen.

And your Alexei reserches are wiser.

But Poisson is my friend.

 
The Mann-Whitney test gave a p value of 0.46. We also reject the hypothesis that the differences between the samples are significant.
 
Guys, I will now analyse the EURUSD diaries in a similar vein. Let's see!
 

Thank you Dougherty!

YOU are the right one!

Pleased to meet you.

 
alexeymosc:
Guys, I will now analyse the EURUSD diaries in a similar vein. Let's see!

Try watchbooks instead. There is little mutual information in the daily chart.

P.S. The preliminary summary is as follows: GARCH(1,1) showed some kind of a volatility clustering, similar to the eh... heteroscedasticity, but, as expected, it does not provide any information. Maybe we should increase the orders of magnitude, i.e. the arguments of the model?

 

Data from A-ri server, EURUSD D1. Took the series increments at the neighbouring Close prices. Discretized by 5 quantiles.

Let's see what the calculation of the mutual information has yielded:

We can see that the nearest 100-200 lags carry more information than the others.

Now let's mix randomly the increments and get a random series. Let's calculate the VI:

Wow. Already no information can be seen on the nearest lags.

Let us visually compare results:

The nearer lags clearly show the preponderance of the original (blue) series.

I took a moving average with window 22 (month) on the I values for the original and random series:

Clearly, the original (blue) series does have a different information memory from the random one (let's leave discussion of the nature of this information for the dessert) on the near lags up to about 200 counts.

What do the non-parametric tests say?

Kolmogorov-Smirnov test:

p < 0,001

Mann-Whitney test:

p = 0,0000.

We reject the hypothesis of insignificance of differences between samples. Or, the EURUSD D1 return series is very different from random data with similar characteristics in terms of mean and spread.

Ugh. I'm going to have a smoke break.