SSH tasks raise processor load until Visualscron server stops working - VisualCron - Forum

Community forum

dchris
2009-12-09T22:53:35Z
We run Visualcron 5.3.1 Server on Windows XP SP3 .

We recently added some more ssh tasks to the settings.
These ssh tasks call scripts on a remote Ubuntu server using password for authentication.

The new tasks are executed about one every minute and do usually not take more than 30 seconds to finish on the remote server. Some few of them take a few minutes more.

After some time ( 1-3 hours) we notice that the Visualcron server takes about 50% of processor load and the task are executed slower and slower.
The only help is to restart the Visualcron service.

Additionally we noticed that the "running jobs" display shows ssh tasks running that have definately finished hours ago. Looking at the Visualcron client list of tasks these same tasks are displayed as successfully finished. Even after a server restart these tasks still show up as running in the "running jobs" display.

We also notice that the client does not update the status of the tasks. Sometimes a task still shows running minutes after it was finished on the server. Only a disconnect and connect refeshes the task list and shows correct status.

We do have the feeling that Visualcron keeps some process in the process list even when they are done on the remote server and this causes the system to raise processor time usage.

Please advise what could cause this behavior and how we could get Visualcron running properly.
Support
2009-12-09T23:26:44Z
Thanks for the report - we will examine this right away.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
Support
2009-12-09T23:44:38Z
Can you test the latest version. We fixed a disconnection problem in 5.3.3.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
Support
2009-12-09T23:46:53Z
By the way, are you uploading or downloading or just execute commands?
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2009-12-10T14:03:58Z
We are just executing tasks remotely.

Installing Version 5.4.7 seems to have fixed the problem.

The last 3 hours I did not see the processor load going up.
Tasks in client list display are also updating as far as I can see.
And the "running jobs" display shows only the really running jobs now.

Thank you for the quick answers!

Support
2009-12-10T14:04:38Z
Great, thanks for the feedback.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2010-01-06T11:19:28Z
Unfortunaely the problem was not really solved.
Installing version 5.4.7 seemed to have helped for some time but we still have problems with the visualcron server (currently installed 5.4.9)

Every few hours the processor load goes up to 100% and the server stops working.

Looking at the protocols (visualcron server_log) and what tasks where executed right before the load goes up we might have found a possible reason for this behavior.

We run ssh commands that create std output.
Due to debugging on our end this can be only a few lines but also up to 3 MB.
Some tasks return these 3 MB within 2 hours some others generate these 3MB within 40 seconds.

I guess that the server can not cope with too many data coming in too fast.

Is this a know issue?
Could you verify that this could be the problem?

It would be great if this problem could be fixed as we would like to have the possibility to see the std out response within visualcron task output fields.


Support
2010-01-06T11:22:49Z
Thanks for the report,

we will try reproducing with same amount of data and get back to you.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
Support
2010-01-07T21:03:31Z
1. How often do you run your Task?
2. Is it possible to provide a test account for us which we can test to reproduce this?

We have done some further cleanup in the SSH Task. Please test this version:

http://www.visualcron.co....aspx?g=posts&t=1022 
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2010-01-08T17:58:45Z
We have about 100 tasks running, some of them every 10 minutes some every day.

The ones with the big amount of data where running about once every hour.

We had removed the std output of the tasks that seemed to cause the problem.

It looked again as if the problem was gone but after some time I noticed the same problem and this time it was not caused by a task with many output.

Al least it has not only to do with the amount of std output, maybe that is just one of the things which causes to let this happen more often.

Before we reduced the output of ssh whe had this problem every 2 hours now it happend once a day.

We have installed 5.5.0 beta and will let you know if the problem will still occur.

....

Update:

Version 5.5.0 has not helped.


After 1 hour we had the same problem again.
This time I see two processes running in the seperate "Running Jobs" window for more than an hour that usuallly finish within seconds.
It seemed as if they are still running on the server somehow but the VC scheduler has started and finished both in the meantime. ( executing is only allowed in the process settings if process is not running yet)

see "026 fti-pnrmailer-glo-1zx" and "026 ostSorter-glo-1zx-16"
(see attachment)

The jobs even still show as running in the Running Jobs Window after I restarted the service.

regards

File Attachment(s):
visualcron_client_547.jpg (307kb) downloaded 41 time(s).
Support
2010-01-09T15:19:01Z
We will let it run every minute for a couple of hours. In your case it seems that the SSH server does not disconnect - can you try adding a timeout on each Job?

You say that you have 100 Tasks but is it the same Tasks that keep hanging or could it be different?
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
Support
2010-01-10T11:59:08Z
We had a Task running every minute (downloading 3 mb each time) for 4 hours. This did not affect memory, processor usage or left any leaks behind. Either it is not related to SSH at all - or something special is happening against your servers. Is it possible to test this against your servers (from here)?
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2010-01-11T17:36:56Z
I could not really find the one task that caused the load to go up.
It rather seems to be some special constellation that causes this to happen.

Maybe you could try to run not one task every minute but maybe 5 tasks with different or random start times every 45 seconds to 1 minute. Maybe the problem occurs if more than 1 SSH task is running at a time.
The processor load looks like the server is running in an endless loop.

Maybe you could try to disconnect the network just when a process is running. I noticed that we had a massive occurence of the problem when one of the servers we are adressing by ssh had a diskspace problem and ths ssh tasks finished with failure messages.

It would be very difficult to give you acccess to our proceses as each of the processes is running on life data and affects our customers.
Unfortunately we do not have a test environment for these processes.

Is there maybe a switch to set the debugging to a higher level?

dchris
2010-01-13T11:11:13Z
Here some more information regarding our own investigation of the problem:

Please see the attached files.

The screenshot shows that the memory usage is 155MB.
The processor load of th VC Server is 50% wich means 100% of one core ((we are running on a dualcore) is blocked by VC.


The logfile output stopped right at 10:12:16 just while it was writing one more line to the logfile. Later I saw that for some reason the log output seemd to be postponed for some minutes as the logfile showed all lines after restart.


I see many attempts in the log to reduce the memory consumption before, but it seemed to stay around 155MB even when the logiles always says after cleanup only 2-7MB are in usage.
I know that the memory is controlled by .NET which could be a reason for this. This post is supposed to be just one more hint to find out what causes the processor load to go up and stop VC to run the Jobs.

I also noticed the logfile shows a sudden jump regarding memory usage within 5 minutes:

13.01.2010 09:19:10 Debug Saving output files and clearing output memory.
13.01.2010 09:19:10 Debug Memory usage before cleanup: 32718848
13.01.2010 09:19:10 Debug Memory usage after cleanup: 2703360
13.01.2010 09:19:10 Debug Number of running Threads: 65

....

13.01.2010 09:24:10 Debug Saving output files and clearing output memory.
13.01.2010 09:24:10 Debug Memory usage before cleanup: 167104512
13.01.2010 09:24:10 Debug Memory usage after cleanup: 2826240
13.01.2010 09:24:10 Debug Number of running Threads: 65

The only solution to fix the high processor load was to restart the service.

After the restart of VC the memory usage in the task manger shows only 23MB and seems to be stable.
File Attachment(s):
visualcron_memory.jpg (414kb) downloaded 35 time(s).
Support
2010-01-13T11:49:13Z
Yes, memory handling in .NET is a little special. We have done some tests with disconnects and many SSH command running at the same time. But we have not reproduced the error yet.

I am not 100% that this error is related to SSH Task.

What happens if you try to stop a running Job (that should have finished a while ago)?
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
Support
2010-01-13T11:52:18Z
It seems that you use a File Trigger in polling mode. Any reason for using polling method (Is this a Samba share?).
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2010-01-13T12:42:54Z
We have recently added a file trigger in polling mode as the file is located on a linux server and we are running a samba connection there.

The problem should not be related to this as it occured many times before we had that trigger added.

For some time we have no jobs showing as running that in fact have already finished in the regular Job Grid. (status column)
I assume that one of the last updates has fixed this.

But as mentioned in a post before we noticed that the seperate "running jobs" window shows jobs running that have finished minutes or hours ago and that show up as finished in the regular grid. See screenshots in post from 8 jan 17:58
If I use the stop button there nothing is happening.

Support
2010-01-13T13:13:11Z
Did you try to stop Job recently? It was fixed in the latest version we sent to you.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2010-01-13T16:30:38Z
We just had the problem again and after restart of the server I checked the running jobs.

It looks like the ghost jobs that only show as running in the running jobs window where created when I killed the service for restart of the server.
The server was restarted at 15:40 and the jobs #464 #465 #469 are all marked to have started that time. Could it be some kind of race condition when two ormore ssh jobs start the same second?

Maybe these jobs where the ones that caused the processor load problem initially.
Not shure about that as I can not see any hints in the server log regarding this.

As you can see in the screenshot the jobs show up as completed normally and had been started in the meantime.

By the way using the "Stop" button did not help. The ghost jobs kept running in the "running jobs" window.

Server is running on 5.5.0 and the cleint 5.4.7.
Was this fixed on server end or client end?
dchris attached the following image(s):
dchris
2010-01-16T17:50:36Z
I think we are getting closer:

I recently installed temporarily a second server to have the possibility of moving jobs on two different instances.
Idea was to find out if a special job causes the processor load issue.

I found out that we do have the problem on both servers now. Both run SSH Jobs but they are all adressing different servers and call different tasks. Therefore it seems to be not a special Job that causes the behavior.

On one of them I got a out of memory issue today. Please see attached logfile with the according message 15:45 to send you the OOM Report and trace information.

It looked like the system tried alreay to send the log to visualcron.com. The restart that was triggered due to the server log seemed to fail as the log stops a few lines later and starts again 1 hour later when I started the service manually.

I hope this helps you to find out where in the code the memory leak comes from and this will solve the problem we have all the time.

By The way I downloaded Beta 5.5.0 but the server info and also the logfile still shows
Server version: 5.4.9.37386 - (2009-12-22)
Is this because of the Beta status or went somethin wrong with the update.
I uninstalled the server 5.4.9 before installing 5.5.0
but still it shows server version 5.4.9
Support
2010-01-17T23:31:26Z
Thanks for your feedback. We looked at the log file. The error is probably not the reason for the problem but rather a result of it.

The only thing that looks strange in your logs is these lines:

SSHClient (OnError) OnData = Error 106 (1683)

These appear multiple times. What does all these OnError lines mean to you. Sometimes they are more verbose.

It would be interesting, on your new Server, if you could add one Job at a time and see which one of the affect memory.

I maybe asked this before, but are you only using SSH Tasks on your new server?
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
Support
2010-01-18T11:32:43Z
Another idea is to add a Performance counter on the process VisualCron service (cpu utilization). Then we would know exactly when cpu utilization is increased and maybe can do some matching against the log file.
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2010-01-18T15:33:26Z
Do you mean these DEBUG entries like the ones below?

The data after the "=" is line by line what the remotely called ssh command produces via the std output. It is the data that is shown in the first response column of the task grid in the visualcron client.
For me it looked like Visualcron has some functionality that can be set to analyse the output in order to trigger taks depending on certain patterns.

e.g.:


13.01.2010 07:10:07 Debug SSH (OnError) OnData = running launcher.xml with target run-OstScheduleChange-GLO-57NC
(921)
13.01.2010 07:10:07 Debug SSH (OnError) OnData = running launcher.xml with target run-OstScheduleChange-GLO-501D
(924)
13.01.2010 07:10:07 Debug SSH (OnError) OnData = running launcher.xml with target run-OstPnrMailer-GLO-1ZX
(923)
13.01.2010 07:10:09 Debug SSH (OnError) OnData = Runing FTI Schedulechange Automation... (924)
13.01.2010 07:10:09 Debug SSH (OnError) OnData =
(924)
13.01.2010 07:10:09 Debug SSH (OnError) OnData = Runing FTI Schedulechange Automation... (921)
13.01.2010 07:10:09 Debug SSH (OnError) OnData =
(921)
13.01.2010 07:10:09 Debug SSH (OnError) OnData = Runing Ostteam PNR Mailer Process... (923)
13.01.2010 07:10:09 Debug SSH (OnError) OnData =
(923)
13.01.2010 07:10:11 Debug SSH (OnError) OnData = INFO [OstScheduleChangeJob]: Ostteam Schedule Change Job success:
result info:
success: true
protocol: [process number: 86587]. 86587 --- PNRs on 57NC Q22CD count:1

(921)
13.01.2010 07:10:11 Debug SSH (OnError) OnData = INFO [OstScheduleChangeJob]: Ostteam Schedule Change Job success:
result info:
success: true
protocol: [process number: 86586]. 86586 --- PNRs on 501D Q22CD count:1

(924)
13.01.2010 07:10:11 Debug SSH (OnError) OnData = INFO [OstPnrMailerJob]: Ostteam PnrMailer Job success:
result info:
success: true
protocol: 86588

(923)


we do not need to analyse the output. If this check can be switched off it might help to solve the problem.



Regarding the performance counter:

I had this idea already some weeks ago.

One of the task in the logfile is the heartbeat task.

This job runs once a minute and calls a ssh task on a romte server which simply touches a file there.

This job takes exactly 5,5 seconds if everything is fine. Whenever it takes longer: e.g. 12 seconds we run a second task to restart the server.

This was the only way we still could use visualcron even with this problem.

Unfortunately it did not give me a clear hint om wich task was the reason for the processor load as it where always different tasks that were called right before the load went up.

We are running mainly ssh tasks, there is one file trigger task but this was added when we had the load problem already and there is one backup task to backup the VC settings daily.


One more thing regarding memory consumption:

We notice that the windwos task manager shows us a constant grow of the memory in use.

e.g.: When we restart the VC service the overall memory usage is 450MB and 6 hours later it is 1,35GB. Restarting VC servie brings it down to 450MB again. So the growing memory usage is definately related to the VC service.
Support
2010-01-18T19:38:18Z
I believe that the CPU problem is caused by the memory problem. One way is to track memory and cpu using the built in performance counter in Windows. Since we do not have access to your server (and cannot reproduce it here) we need help tracking down which Task that causes this problem, and maybe then, we can try to mimic this Task here.

We have done tests here and we know that we have a lot of other users that are running the SSH Task without problems so there must be something specific here.

One idea is that one SSH Task is not exited in a correct way and that this Task generates output in the background. Do you have a Task that keeps outputting data all the time or do every Task have a fixed output (just output and then stop)?
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
dchris
2010-01-18T21:50:19Z
As we do have VNC running on the server that also runs Visualcron I could arrange that you get access to the server and see yourself what is going on.

Would that help somehow?

I added two performace counter for the visualcron service.
One for processor time used.
As we are running on a dualcore server the max. proc time visualcron can reach as we notices in the windwos task manager is 50%. The trigger is set to write a line to a logfile whenever the performance counter measure a proc usage of more then 40% for more than 20 seconds.

The other one for memory used in swap file:
This one will write a line to the logfile if more than 150MB of swapfile is used by the visualcron service for more than 1 minute

We will have this running now for some time and finally can send you the logfile as well als the new performance trigger file.

All our tasks are finishing with an exit code and the visualcron client grid shows that they finished.
As posted some time ago we also had the feeling that a lot of ssh std output could cause the problem.
Your response was that you have tested this for some hours with many ssh tasks and response amount of 3MB and could not reproduce the problem.
If you got back 3MB of response in ssh std out you should have seen the debug lines in your server log file as well otherwise I wonder why only our version is producing "debug ssh (onerr) =" lines.
Scroll to Top