Dependency statistics in quotes (information theory, correlation and other feature selection methods) - page 19

 
Candid:

You don't need to comment, you need to try to answer my questions. I'll tell you a secret - they're designed for you to understand something by trying to answer them).

I read the discussion, by the way, do you seriously want to discuss a 17-page mixture of flies and cutlets?

Am I even correct in guessing what you call the two processes?

I don't know where on page 17 you saw a mixture of cutlets and flies. It came up earlier...

As for understanding - I recommend to look at Alexey's table and answer - under the assumption of which theoretical distribution it is built?

;)

 

and the two processes are theoretical (the null hypothesis) and real.

You would have to know.

 
joo:

I don't understand a half of words in this thread at all, but even I understood, that distributions have nothing to do with it.

The distribution of a process, in which there are dependencies between individual counts, does not have to be either uniform or normal. That's obvious.

Example: Pushkin's poems. If the text mentions the words "oak" and "chain", then somewhere near it there's "cat". This relationship between words has nothing to do with the distribution of the word "tom", or any other word, in the paragraphs.

You know how primitive it is to check authenticity of authorship?

That's how by the frequency of combinations "dub-chain-cat" from the "reference" texts and the checked - make a conclusion.

Because there is always a basis for comparison.

But here I do not understand what is being compared with what?

Where is theoretical frequency. Or rather whose is it?

Maybe Candid is right and we just need to emigrate to Greece, and everything will fall into place?

;)

 
avatara:

I don't know where you saw a mixture of cutlets and flies on page 17. It was there before...

Here's the problem with endings again, 17 pages turned into page 17. Would you care to reread those 17 pages for other "typos" of perception?
and the two processes are theoretical (null hypothesis) and real
Actually, my first post quoted the topicstarter, it would be more logical to assume that I was referring to his version in the first place. Especially since he, unlike Alexey, described it in great detail. But I am not sure that the identification of hypotheses with processes contributes to the clarity of presentation.
As for understanding - I recommend to look at the table Alexey, and answer - under the assumption of which theoretical distribution it is built?

Frankly speaking - I do not know. I would build on an empirical distribution.
 
avatara:

and the two processes are theoretical (the null hypothesis) and real.

You should know.

No, wrong. I'm interpreting this criterion. It has the same statistics, by the way. It just applies to other quantities.

Now for the two variables whose independence is being tested. In the block table I posted, these are the returns of two bars spaced 310 bars apart (309 bars between them). The statistics are checked on the entire population of such pairs of bars in the history. If there are 60000 bars in the history, then there are 59690=60000-310 such pairs of bars.

The bar which is further in the past is the source of the S . Its paired bar closer to the present is the receiver R . The returns S and R are the values whose independence is checked. More precisely, not the returns themselves, but the quantile numbers they fall into. Why divide by quantiles was explained earlier: to make chi-squared work (frequencies of at least 10).

About the ox as the main source of the phenomenon - I'll think about it. Something is not so simple here... But Candid's suggestion makes sense to test it (remove the ox).

 

I have had a quick look at the author's article. I have a suspicion that the author did not find a correlation between variables current bar <-> past bar, but only the fact of clustering of volatility. Of course, even on that basis, the chart is interesting, as a confident correlation of volatility up to 50-60 lags is something new. Naturally, when mixing the data using the Monte Carlo method, the clustering breaks down, which was evident in the charts.

To understand what has been found, it's necessary to test the proposed formula on non-normal and obviously independent distributions, especially on the classic GARCH(1,1) or better yet, on GARCH(3,3); If dependence can be found on it as well, the formula gives nothing new, it simply defines a special case of the martingale in one more way.

If author has wish, I can provide him synthetic GARCH returnees.

 

Thank you. Give me some artificial data, I will test it at the weekend.

As for the formula, yes there's nothing particularly wonderful about it, it's stochastic analysis from a different angle.

Regarding volatility, a lot has already been said here and I agree with the views. But the number of lags on which independent variables carry volatility information for the zero bar is really clearly indicated. And the depth of the lag dip is different for different financial instruments while maintaining information relevance.

 
I generally think that if you can't predict returns based on past returns, then there is always, for me personally, the possibility of going back to the problem of selecting independent variables (various indicators) for prediction. The topic is called feature selection, and I'd be glad to discuss other methods, such as principal components analysis, using NS with auto-associative memory, trained network analysis (weights), cluster analysis, chi-square, there's also Lipschitz exponent (correction: constant). All in all, people, it's a big topic...
 
C-4: Of course, even based on that, the chart is interesting, as a confident volatility correlation to 50-60 lags is something new.

Thank you for noticing. That's what's so alarming. Probably, yes, vola explains a significant part of the phenomenon, but it doesn't seem to explain all of it. And on a watch, that correlation goes even further back... hundreds bars deep.

By the way, there are significantly fewer correlations on the days than on H4, which, in turn, have far fewer correlations than on H1.

 
Mathemat:

Thank you for noticing. That's what's so alarming. Probably, yes, vola explains a significant part of the phenomenon, but it doesn't seem to explain all of it. And on the watch, that correlation extends even further. hundreds bars deep.

By the way, there are significantly fewer correlations on the days than on H4, which, in turn, have far fewer correlations than on H1.


If it is about volatility again, then it can be explained very well by a clear cyclicality depending on the time of day:

You don't need to be Einstein to notice even with the naked eye the clustering of the ox around 16:30. Therefore on intraday scales such "correlations" are of course much more definite. And of course this does not give us anything anyway. We just know that strong movements occur at 16:30 (as we can see on the chart), caused by volatility inflows, but we still don't know the direction of the movement or its targets.

As I promised, I am pasting a synthetics - GARCH(1,1) with the standard parameters offered by MathLab: garchset('P',1,'Q',1,'C', 0.0001, 'K', 0.00005, 'GARCH', 0.8, 'ARCH', 0.1); I didn't manage to make GARCH(3,3) or even more - I know the program badly and a simple change from 'P',1,'Q',1 to 'P',3,'Q',3 didn't work. The series contains 10 000 tests, which I think will be quite enough. Here is its price chart:

It would also be interesting to generate a SB based on hourly volatility data of the same EURUSD. It will have the same volatility character as in EURUSD, but the chart itself will consist of 100% noise. If it will detect the dependence, it means that the method is not suitable for price forecasting, but if it will not reveal the dependence then we will witness the birth of a new indicator, able to determine if we are dealing with senseless abstruse synthetics or the real market.

Files:
garch.zip  91 kb