We are not large - we have just 8 VisualCron Serves running.
With Over 400 / 250 Active/inactive / manual Jobs in PRD/QA/DEV environment. PRD production is running 24hours a day, Reliably is good as we are on SSD/VMs with VEEAM Migration/Backup. Planning to improve this a lot by end of Q1 '19.

We regularly do processing on one system that triggers processing in another server(s) then split/aggregate process results to yet another. VisualCron is the SPOF; when the protocol version jumps or a client can not communicate with all the environments it requires us to standup a VDI for each version so will we can manage each of the three environments that are often orchestrating three to four subordinate environments. Task counts under jobs are sometimes over 40 and often call segments of other jobs as modules.

We need for compliance reason to configure our PRD & QA for HA/LB clustering and would appreciate any technical conversation to that end - we eliminate single point of failures that have occurred.

This is not a whine - we know we are blessed - we need adult conversation with experts hat have been there.

sorry for the late reply. It seems this post was missed for som reason. I think we should continue the discussion on email as we do have some general plans about failover. Please email us.
