Specific test agent doesn't stay connected to the cloud - page 3

 
Shalem Loritsch #:

Oh this is good news!  I had written (in a post that was deleted for some reason) that while the bug is 100% reproducible, they would not be able to reproduce it unless they were on a sufficiently fast machine, and that there's only several CPUs fast enough to trigger this issue and they're expensive.  So I'm glad they didn't give up after their initial finding of "unable to reproduce" and proceeded to find the bug.  I just had it manifest again (my PR, which had been hovering in the upper 240s crept up to 251 and it acted up again), and I was just about to post more information: It seems that there are two issues; an instant issue in the MetaTester program itself, and a separate issue on the server.  The one I had posted about was the instant version, where the PR is seen under 250 on my Agents dashboard, but when raising clock speeds, the agents immediately start cancelling all incoming jobs (and they keep coming).  Another user later posted the other version, where the PR updates on the server/dashboard above 250 and jobs stop coming in and you have to somehow get that number down before it will start receiving jobs again—and this can take time/uninstalling after lowering the clock speed; speed changes do not take effect immediately with this version of the issue.

When I saw the comment "As we can see, you do receive payments for tasks, so do the other users that reported problems", my response to that would be, "Well yes; I have reduced the clock speed on my computer below stock to work around the issue; I can raise it back up and all earnings from that PC will immediately cease."  Thank you for your efforts pursuing and pressing the issue, and for the others who were also able to confirm the issue and provide logs.  Speaking of bugs and logs, they just released version 3630, and it writes corrupted log files—random characters here and there, partially repeated lines, on multiple machines—I've never seen such before.  Anyway, I will tentatively wait for the next version and push my clocks higher and report back my findings...

This is not a release, it's a beta. So don't use it unless you want to be beta-tester.

 
Shalem Loritsch #:

Oh this is good news!  I had written (in a post that was deleted for some reason) that while the bug is 100% reproducible, they would not be able to reproduce it unless they were on a sufficiently fast machine, and that there's only several CPUs fast enough to trigger this issue and they're expensive.  So I'm glad they didn't give up after their initial finding of "unable to reproduce" and proceeded to find the bug.  I just had it manifest again (my PR, which had been hovering in the upper 240s crept up to 251 and it acted up again), and I was just about to post more information: It seems that there are two issues; an instant issue in the MetaTester program itself, and a separate issue on the server.  The one I had posted about was the instant version, where the PR is seen under 250 on my Agents dashboard, but when raising clock speeds, the agents immediately start cancelling all incoming jobs (and they keep coming).  Another user later posted the other version, where the PR updates on the server/dashboard above 250 and jobs stop coming in and you have to somehow get that number down before it will start receiving jobs again—and this can take time/uninstalling after lowering the clock speed; speed changes do not take effect immediately with this version of the issue.

When I saw the comment "As we can see, you do receive payments for tasks, so do the other users that reported problems", my response to that would be, "Well yes; I have reduced the clock speed on my computer below stock to work around the issue; I can raise it back up and all earnings from that PC will immediately cease."  Thank you for your efforts pursuing and pressing the issue, and for the others who were also able to confirm the issue and provide logs.  Speaking of bugs and logs, they just released version 3630, and it writes corrupted log files—random characters here and there, partially repeated lines, on multiple machines—I've never seen such before.  Anyway, I will tentatively wait for the next version and push my clocks higher and report back my findings...

It did took some insisting with the support, but I am glad they finally found it.

About the PR>250 on dashboard issue that you mentioned - I am not sure it's the case.

Yesterday fiddling around with overclock/stockclock/underclock to provide logs to support, I noticed my PR moved from 237 to 251 when I overclocked, but it didn't move back down to 237 when I underclocked again. I am not sure how often (or when) the PR score is refreshed at the dashboard, though.

So right now it still displays as 251 at the dashboard, but my agents are receiving and processing cloud jobs normally.

My guess is that after aborting so many jobs the server might put you in a kind of blacklist, avoiding to send you jobs for a while - And that will be the case for most of the time you have a "correct" PR >250.

 
Alain Verleyen #:

This is not a release, it's a beta. So don't use it unless you want to be beta-tester.

It automatically updated itself when I started MT5; I did not ask to be on the beta update channel.  Regardless, a few days later, version 3640 (with the same bug) was automatically pushed to all stand-alone MetaTester Agents without the MT5 Terminal.  It appears that this logging corruption bug has been fixed in version 3644, which automatically updated when I started MT5 yesterday.  I don't mind being a beta tester as long as there is a clear and active mechanism through which to provide feedback to the development team.  My hope is that they are following this thread, which is one of the reasons I mentioned it earlier.

While we're on the topic of logging and they're playing with that code (again, I'm hoping a developer is following this thread and will see this), it would be nice if MetaTester would flush buffered log entries to the file more frequently.  During long testing jobs, MANY HOURS (like, I've seen it go 18+ hours before) can go by with NOTHING new getting written to the log file since the last flush ending on "tester agent shutdown finished" for the previous (finished) job.  If the computer/tester crashes during this time while running test passes, there's absolutely nothing in the log file indicating that it was even working on a job, as all that information only gets flushed to the file at the END of the job.  I think a good way of addressing this issue would be to remember the GetTickCount64 value each time they flush the log to the disk, and then every time there's an incoming log entry, check the current GetTickCount64 against the stored value, and trigger a flush to the file if more than, say, 15 seconds had elapsed since the last flush.

Additionally, to fine-tune the above idea a bit, at the initial "4412 bytes of account info loaded" stage, it should preemptively update this remembered tick count value to the current GetTickCount64 value so the log doesn't flush with just that one line, imposing a 15+ second delay in the important lines that quickly follow.  Then, immediately after logging the hardware info ("Intel Core i7-10700K  @ 3.80GHz, 48983 MB"), it should clear that remembered tick count to 0, so that the next line (which should be "optimization pass 40552 started") will flush itself and everything before it directly to the file (I suggest this way because we don't want every optimization pass starting to force a flush to the log file, as some passes are only milliseconds long and there can be thousands of them).  This way, the start of a new job will be logged to the file right away with full information; and as passes are completed, that progress information will be periodically flushed to the file (e.g. "40552 : passed in 0:00:10.100", no more frequently than every 15 seconds), and if the job consists of a single, hours-long pass, at least the job ID and initial pass information will have already been flushed to the log file at the beginning of the process.  This information could come in handy for detecting PCs that have stability issues (as opposed to being restarted by the user); the tester agent could look through the log file at startup and see that it started a job, but observe the absence of a finish/stop/cancel event, thereby realizing there was a crash and report that finding to the server for a PR demotion/disconnect timeout.  Presently, all this job/pass detail is only flushed to the log file at the END of a job, so such crash detection is not possible and can only be guessed server side (which I believe is happening since about 10 months ago, but often incorrectly for whatever reason).


Emerson Gomes #:

About the PR>250 on dashboard issue that you mentioned - I am not sure it's the case.

Yesterday fiddling around with overclock/stockclock/underclock to provide logs to support, I noticed my PR moved from 237 to 251 when I overclocked, but it didn't move back down to 237 when I underclocked again. I am not sure how often (or when) the PR score is refreshed at the dashboard, though.

So right now it still displays as 251 at the dashboard, but my agents are receiving and processing cloud jobs normally.

My guess is that after aborting so many jobs the server might put you in a kind of blacklist, avoiding to send you jobs for a while - And that will be the case for most of the time you have a "correct" PR >250.

Its behavior appears to have changed right about the time I wrote that post and you wrote your reply.  Somewhere around version 3630 or 3640 the issue of "cancellation of all incoming jobs immediately upon experiencing a very fast processor, and resuming normal operation immediately upon loading the processor down or dropping its speed by 100 MHz" seems to have gone away, and I have since restored normal boost clocks on my i9-13900k CPU with no ill effects (PR seems to be pegged at 249 though; I expected it to quickly increase above that like it did before, but it has held stable for a week now on this PC).  So one part of this bug appears to have been fixed.

However, I believe the longer term issue with the updated PR on the website still exists, although it appears to act differently now: With a reported PR above 250, jobs now come in, but many of them still quickly abort (at least with version 3646; its hard to test this bug because it involves starting over again to get the PR to update higher than 249, and then I'm stuck with it not working correctly, have to drop clocks, start over yet again to lock in the PR below 249 so it will go back to working normally—which is rather tedious).  It's like they fixed whatever was causing the immediate issue upon the starting of a job with a faster CPU with an old PR, but it still doesn't like the higher PR value when one exists.  To be clear: I just started over with version 3646, got a PR of 260; a dozen jobs came in, and every single one of them spontaneously aborted as before.  But at least now I can go back, slow the CPU down, start over again, get my 249 PR back, and then restore my CPU to the same high speed as before and it continues to work.  Previously, with a higher CPU speed it would immediately not work—every time, regardless of the shown PR value.  I'll try again on a later version.  Some progress appears to have been made; hopefully, all will be resolved shortly!

 

I have not seen any Changelog from the last agent builds, but apparently the issue has been fixed. Currently my CPU can process jobs at stock clock.

I was curious to know what the fix was exactly.

 
Comments that do not relate to this topic, have been moved to "Off-topic posts".
 

I still have the same problem for 2 weeks. My Cpu is 7950X and it is disconnected all the time. the interesting thing is that I have two other systems with intel cpu that are slower (PR is about 150) but they are disconnected too!

I am using agent build 3661.

 
b.mohammadi #:

I still have the same problem for 2 weeks. My Cpu is 7950X and it is disconnected all the time. the interesting thing is that I have two other systems with intel cpu that are slower (PR is about 150) but they are disconnected too!

I am using agent build 3661.

I don't think it's the same issue
 
Emerson Gomes #:
I don't think it's the same issue

I think so.

For testing purpose, I passed the metatester through proxy now I see this error over and over again:

- agent3.mql5.net(127.216.0.12):443 error : Could not connect to proxy 127.0.0.1(127.0.0.1):9091 - connection attempt failed with error 10061

the proxy works fine with no problem. 

is there any problem with agentX.mql5.net?

 
b.mohammadi #:

I think so.

For testing purpose, I passed the metatester through proxy now I see this error over and over again:

- agent3.mql5.net(127.216.0.12):443 error : Could not connect to proxy 127.0.0.1(127.0.0.1):9091 - connection attempt failed with error 10061

the proxy works fine with no problem. 

is there any problem with agentX.mql5.net?

I re-installed the metatester. Now, it works with proxy and connected to cloud