Discussing the article: "Developing a robot in Python and MQL5 (Part 1): Data preprocessing"

 

Check out the new article: Developing a robot in Python and MQL5 (Part 1): Data preprocessing.

Developing a trading robot based on machine learning: A detailed guide. The first article in the series deals with collecting and preparing data and features. The project is implemented using the Python programming language and libraries, as well as the MetaTrader 5 platform.

The market is becoming increasingly complex. Today it is turning into a battle of algorithms. Over 95% of trading turnover is generated by robots. 

The next step is machine learning. These are not strong AI, but they are not simple linear algorithms either. Machine learning model is capable of making profit in difficult conditions. It is interesting to apply machine learning to create trading systems. Thanks to neural networks, the trading robot will analyze big data, find patterns and predict price movements.

We will look at the development cycle of a trading robot: data collection, processing, sample expansion, feature engineering, model selection and training, creating a trading system via Python, and monitoring trades.

Working in Python has its own advantages: speed in the field of machine learning, as well as the ability to select and generate features. Exporting models to ONNX requires exactly the same feature generation logic as in Python, which is not easy. That is why I have selected online trading via Python.

Author: Yevgeniy Koshtenko

 
A sensible approach 👍 interesting way of selecting traits.
 
Thank you very much for this interesting article. I haven't used python before, but you got me interested in learning this powerful tool. I am looking forward to new publications and to follow the trend!
 

В задаче прогнозирования EURUSD мы добавили бинарный столбец "labels", указывающий, превысило ли следующее изменение цены спред и комиссию.

By the way, out of over 700,000 pieces of data, the price changed by more than the spread in only 70,000 cases.

EURUSD has ~0 spread 90% of the time. You are working with H1 data. How did you get this result?

 
Кстати, искушенный в машинном обучении читатель уже давно понял, что мы в итоге разработаем модель классификации, а не регрессии. Мне больше нравятся регрессионные модели, я в них вижу немного больше логики для прогнозирования, нежели в моделях классификации. 
It seems like the second sentence contradicts the first.
 

Фича инжиниринг — преобразование исходных данных в набор признаков для обучения моделей машинного обучения. Цель — найти наиболее информативные признаки. Есть ручной подход (человек выбирает признаки) и автоматический (с помощью алгоритмов).

We will use an automatic approach. We will apply the new feature creation method to automatically extract the best features from our data. We will then select the most informative ones from the resulting set.

The best feature for price prediction turned out to be the opening price itself. Signs based on moving averages, price increments, standard deviation, daily and monthly price changes were included in the top. Automatically generated signs turned out to be uninformative.

There is a question about the quality of the algorithms for feature generation, or rather its complete absence.


A person generated all the attributes from OHLCT data - five columns in total. You are pushing the feature generation algorithm onto a much larger number of initial features. It's hard to imagine that the chip generation algorithm couldn't reproduce the simplest MA chip.

 
I liked the language, style and presentation of information in the article. Thank you to the author!
 

Feature clustering combines similar features into groups to reduce the number of features. This helps to get rid of redundant data, reduce correlation and simplify the model without overfitting. The best feature for price prediction turned out to be the opening price itself.

Did the clustering throw out the HLC prices because they fell into the same cluster as the O-price?

If the price turned out to be the best sign for its prediction (and the other signs are its derivatives), does it mean that we should forget about the other signs and it is reasonable to add more input data by moving to a lower timeframe and taking prices of other symbols as signs?

 

Prices should, of course, be removed from the training sample, as MO will not perform adequately on the new data, especially if they fall outside the training range.

The high informativeness of prices arises from the uniqueness of their values, i.e. it is easier for the algorithm to remember or match prices with labels.

In MO practice, not only uninformative features are removed, but also suspiciously overinformative features, which are raw prices.

In an ideal scenario, there should be several attributes that are +- equally informative. That is, there are no clear leaders or outsiders. This means that none of the attributes litter the training and do not pull the blanket on themselves.
 
Maxim Dmitrievsky #:

Prices should, of course, be removed from the training sample, as MO will not perform adequately on the new data, especially if they fall outside the training range.

If we go to returns, the feature generation algorithm is obliged to generate a cumulative sum, which will be the same prices. At the same time, it will not be known that these are the prices.

 
fxsaber #:

If we go to returns, then the algorithm of feature generation is obliged to generate a cumulative sum, which will be the same prices. It will not be known that these are the prices.

I don't get it

All signs should be pseudo-stationary, like increments. Raw prices should be removed from the training.