SSH task is running with muchvarying response times since the new... - VisualCron - Forum

Community forum

dchris
2011-05-27T20:15:24Z
Hello

we notice that since we have installed the 5.7.4 server a special task

"SSH to remote server with a touch command"

is running with very much differing response times every minute.

With the previous version 5.6.9 on the server we had an absolute stable reponse time of 3,1 seconds to open up SSH connection and touch a file on the remote server.

Now the response times vary between 3 and 90 seconds.

We could see from visualcron log files that this behavious began exactly when we started the visuelcron with the new version.

The ssh connection itself running over VPN connection to the remote are very stable and we can open a SSH session with another client on the same server within half a second. Also the touch command takes less than a second if the command is set of manually within the SSH session.

The task is running every minute. Is there a different behaviour or know issues in the new version or changes regarding internal SSH handling since version 5.6.9?

Thank you for your help

Dennis


dchris attached the following image(s):
dchris
2011-05-27T21:00:24Z
Hello,

some more information that I just found out.

It seems as if the amount of concurrent SSH tasks is the problem.

If there is just one ssh taks running it is very fast if we start 10 at the same time or we have ssh tasks that take a longer time and we get to a point where 10 ore more are running concurrently they all become really slow until they have finished one by one and the overall amount of tasks is back to less then 3 or 4 .

The one special SSH Job with the touch command I mentioned in my previous post is just the one that we monitor most regarding response times and that runns every minute.

I guess that will allow you to reproduce this problem on your end and hopefully find a solution to get back to the performance of recent versions.

Best regards
Dennis
Support
2011-05-30T21:44:58Z
It sounds natural that concurrent Tasks could cause slowdown, either on Windows computer that hosts VisualCron or on linux machine.

I trust you when you say that it might be slower than previous version but still it would be interesting if you could see where the problem is. Let say you have one core on your server and 100% cpu is used when having many concurrent connections. Then we could say that the problem is definitely in the new version.

We are waiting for a new test environment to test some more ourselves.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
Support
2011-05-30T22:51:04Z
We got our environment up again and did a very simple test. We called "sleep 10" on our ssh server and edited the Job to run even though if it is running.

At high speed we started the Job 20 times. All Jobs ran for about 10-11 seconds which seems normal. With this test we can see that:

1. number of connections is not a problem
2. calling sleep is requiring very little from remote system so we know that we are not being affected by remote system

One idea that we had was that the problem was handling large output from the remote system, either in our component or in VisualCron. So, we modified the script to "cat" a 500kB file after sleeping 10 seconds. Same result here.

Our conclusion is that your specific command is cpu intensitive or is affected by simulaneous work at the server. I suggest you try open 10 windows against your server and try running your script and measure time as well. Or create a script that forks of 10 process and measure time of each.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2011-05-31T12:31:26Z
Hello I just checkd our server again.

When there are 10 SSH Jobs running the processor load on visualcron server is not going higher den 3-5%.
As the processor time consuming parts are called remotely on the ssh server I would not expect that the type of remotely called task could affect the visualcron processor load.

I can see again that the more of them are running concurrently the longer takes the responsetime for each of them. It seems as if ther would be some kind of internal thread handling that slows down the processing.

For example I just had the situation of 15 concurrent SSH tasks for one of them I checkd the log of the remotely called task which took only 1 second. The overall time to call this process from Visualcron was 75 seconds! So we do have 74 seconds overhead

The amount of log data coming back from the ssh commands is not very high. Maybe 2-3 lines of text.

I would suggest that you try a setup where you clone the ssh jobs 10 times and call a ssh command with sleep 30 seconds.

Set the trigger to let them start every minute at the same second.

That would be close to our situation here.

Best regards

Dennis
Support
2011-05-31T12:37:46Z
dchris wrote:

As the processor time consuming parts are called remotely on the ssh server I would not expect that the type of remotely called task could affect the visualcron processor load



Correct, but if the ssh server machine takes longer time to process items (because of concurrent running) then it affects the time it takes to run the ssh command.

Your clone example that you want us to run is not harder on VisualCron than the Job we ran. We ran 20 of the same Job within 5 seconds. So there were 20 instance of the Job running on the ssh server. We could try 20 at the same second but I don't think there would be a difference as each connection is processed in it's own thread and has nothing to do with each other.

Also, I wonder about your setup. By default you cannot run the same Job more than one at a time. So you have unchecked this limitation? Is there any reason you want it to run more than one instance at time?



Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
Support
2011-05-31T12:43:49Z
dchris wrote:

For example I just had the situation of 15 concurrent SSH tasks for one of them I checkd the log of the remotely called task which took only 1 second. The overall time to call this process from Visualcron was 75 seconds! So we do have 74 seconds overhead



Are you sure that all 15 took 1 second to run? What is that based on, your script, your ssh server connection. All I am saying that there could be a lot of factors here and I am sure we will find one test that will make this clear.

I think you need to test the same script as us. Just call a sleep command. If there is no problem with running 15 sleep commands at the same time we can rule out the connection and internal VisualCron processing.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2011-05-31T12:48:06Z
Hello

please have a look at my last post. I had added some more details there you where just to fast with your response.😁

Especially see my notice about overhaed time.

Maybe there is some misunderstanding regarding setup

We do have 40 different jobs that we call on remote serves via SSH not all of the running on the same server. I dont complain if these jobs are taking to long but the overhead time to call them is making us a hard time here.

With version 5.6.9 we had the same amout of concurrent SSH jobs runing remotely but there was no overhaed at all if the job finished on the remote server within 5 seconds I always had a response within 5,5 seconds

Best regards

Dennis
Support
2011-05-31T12:50:10Z
Ok, to sum my two above posts up:

I am not saying there isn't a problem with VisualCron but if you do my test (and turn of your existing SSH Tasks) we will get closer to the problem.

We will test your suggested setup as well.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2011-05-31T13:04:55Z
Support wrote:


I think you need to test the same script as us. Just call a sleep command. If there is no problem with running 15 sleep commands at the same time we can rule out the connection and internal VisualCron processing.




Please have a look at my initial post to this topic:

we do have a very simple job "hearbeat" that opens a ssh connection and just runs a touch command to change the timestamp of one file on the ssh server. This job runs every minute.
Please see the screenshot of processing times.
There you see that response times go up and down in waves.
These waves correspond to the amount of concurently running ssh jobs.

The heartbeat jobs goes to a different server than all the other concurrently running jobs so it can not be affected by any processor load of the other ssh tasks.

Best regards

Dennis
Support
2011-05-31T13:11:44Z
I don't know what your heart beat ssh Task does. That is why I wanted you to try a simple sleep command.

Also, in our discussions we focus on many Jobs at once. For that reason I would like to isolate the simple sleep command - running many at once, to see if that is a problem for you. During this test we need to stop any other ssh task so this test is clean.

Sorry for going back to this test all the time but we cannot reproduce the exact same environment you have and we do not have the insight to determine anything from here. That is why we need to do simple, isolated, tests that we also can do here.

Based on our tests result we decide how to go further.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2011-05-31T14:43:09Z
Hello

I did the test as suggested.

I prepared 11 jobs all of them running a ssh task with sleep 30 on one ssh sever and starting within second 15 and 25 every minute .

If I let just run on job it takes 32 seconds.

If I active the other 10 jobs I get response times up to 39 seconds.

The overhead seems not to bee to big here as if we run our regular ssh jobs parallel.

For some time during the test I also started the regular SSH tasks parallel to the test jobs and response times went up to 60 seconds for the "sleep 30" jobs.

Could it maybe be some response content mointoring of the ssh tasks?
Maybe the tasks behave different if the tasks brings back response logs or not.

Best regards

Dennis
Support
2011-05-31T15:01:48Z
dchris wrote:

Hello

I did the test as suggested.

I prepared 11 jobs all of them running a ssh task with sleep 30 on one ssh sever and starting within second 15 and 25 every minute .

If I let just run on job it takes 32 seconds.

If I active the other 10 jobs I get response times up to 39 seconds.

The overhead seems not to bee to big here as if we run our regular ssh jobs parallel.

For some time during the test I also started the regular SSH tasks parallel to the test jobs and response times went up to 60 seconds for the "sleep 30" jobs.

Could it maybe be some response content mointoring of the ssh tasks?
Maybe the tasks behave different if the tasks brings back response logs or not.

Best regards

Dennis



We did test with high output as well, about 500kB. That did not make a difference. Still it very strange that you get up to 39 seconds. This can certainly not be reproduced here.

We need to think about how we can reproduce this. Is it possible to connect to your server from us?

Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2011-05-31T17:04:19Z
Hi Henrik

we just got to the point where the amount of SSH tasks summing up due to very slow processing seems to be just to high.
The client could not connect to the server anymore and the only solution was to restart the service and even had to kill the process of visualCronService in the windows task manager.

We consider to go back to version 5.6.9 until this problem is solved.
This would allow us to determine if it is a problem of the new version 5.7.4 or not.

Should this be possible or will we run in problems when we try to go back to the old version?

Best regards

Dennis
Support
2011-05-31T17:36:27Z
Export your settings to file first. Then downgrade. We will probably prepare some kind of logging what the Tasks are doing at the moment - for next version. Maybe it will be easier to debug then.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2011-06-06T18:42:59Z
Hi Henrik

we just dicovered that the problem with task response times is not bound to ssh tasks.

We have installed a Job with a webservice task that calls a webservice that always finishes within 3 seconds. This job response times varried like we dicovered with the ssh jobs from 3 seconds to 50 seconds and it was related to the amount of tasks executed parallel at the same time.

Maybe that helps you to find the problem or a potentiel cause specially on our server.

I have not rolled back yet the server version to 5.6.9.

But this test will be next.

Best regards

Dennis
Support
2011-06-06T18:52:57Z
Only general thing that is new should be the database logging. You could try to go to Server settings->Log->Database settings and uncheck database logging. Should not make a difference but is worth a try to test.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2011-06-06T19:22:28Z
Hi Henrik

I just rolled back the server to 5.6.9.

There is a massive change to really fast execution times now.

With version 5.7.4 we had an average ovehead of 20-40 seconds going up to peaks of 400 seconds to call various types of jobs when runing 20 jobs concurrently.
Now I see no overhead time at all. If the ssh jobs runs on the remote server 5 seconds I get the result back in 5,2 seconds in all cases regardless how many Jobs are running concurrently.
In fact the amount of concurrent jobs just decreased because all jobs finish much faster with this old version.

For the moment there is no need to upgrade to a higher version than 5.6.9 but I would not like to get stuck there. What can I do to help you solving this problem?

One important thing I noticed is that I have the feeling the windows task manager shows much more action regarding processor load with version 5.6.9 than with the new version 5.7.4.
Maybe this can be a hint for you.

Did the logging change somehow so that it could be an IO problem on the server or some other hard to find reason like thread pool handling?

Best regards

Dennis
dchris
2011-06-06T20:02:33Z
Hello Henrik

while you suggested in your post to switch of database logging I had the same idea writing on my post the same time .
So I thought its worth to upgrade back to 5.7.4 and give it a try

BINGO😁 It seems to be the database logging.

The moment I switched database logging of suddenly the jobs finished as fast as in version 5.6.9

The webservice I mentioned earlier has a response time below 1 second now and the ssh task with the touch command on the remote server takes 1,3 seconds now.

The database logging is very useful new feature I hope you can find a fix for this and I can reactivate it in the next server version. Maybe it has to do with locking and insert strategy?

So far thanks for your great support.

Best regards

Dennis
Support
2011-06-06T21:06:31Z
Great, thanks for the testing!

What we will do is to add some debugging and send that version to you. We have not, so far, been able to reproduce this ourselves so we are not sure exactly where the lock is. We hope that you can test this new version for us and come back with a log file. We will create this new version later this week.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
Support
2011-06-07T10:23:12Z
Another idea which is based on the first is that we do a database cleanup every 15 minutes. During that time insert is locked. Do you feel that this happen about every 15 minutes or more often?
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2011-06-07T12:03:03Z
Have a look at the screenshot of the first post

There you see that the slow execution times happen about every 5 minutes and they change like waves from fast execution to slower until the execution time reaches a peak and becomes fast again after 2 minutes.

Best regards

Dennis
Support
2011-06-07T12:11:05Z
Please install this version:

http://neteject.com/down...on/VisualCron5.7.5-2.exe 

You can send us the file log_serverDATE.txt when the first abnormal, long time, execution has appeared. Also, please tell us when in time this happened and what job, task name.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
Support
2011-06-17T09:07:46Z
Where you able to test the new file? The log file would be very valuable for us.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2011-06-19T11:35:09Z
As I understand the new Client Version will force us to upgrade all of our Clients and Servers
Or is the new Client compatible to the old server?

If I upgrade temporary will I be able to go back to 5.7.4 Server without problems?

Regards Dennis
Scroll to Top