Quantcast
Channel: THWACK: All Content - Server & Application Monitor
Viewing all articles
Browse latest Browse all 12281

SAM Polling Frequency/Timeout vs Node Polling Intervals - Best Practices?

$
0
0

Hey everyone,

 

I'm reaching out to the community in hopes that somebody can point us in the right direction and possibly provide a "best practice" on the polling frequency for SAM vs the polling intervals for a node and how this all ties together.  We have been running Orion for about 6 months now and I'm still not sure that we have a handle on the best way to setup our application polling and node polling intervals to make sure that we are getting notified in a decent amount of time from the alert manager (that's a mouthful).  To give everyone a quick overview, we currently have 523 nodes that we're monitoring.  Our original design was based on rating the criticality of the single application or group of applications and set the polling frequency for those applications as low as we could (60 seconds).  The idea being, we would be notified in 1 minute that our SQL Server Services were stopped on a production server and we could take immediate action on it.  In comparison, we are currently migrating away from a different monitoring solution that polls every 15 seconds, which is why we want to get that instant satisfaction of knowing immediately.

 

As I started configuring the templates with individual/group checks and setting the polling frequency that low, I noticed our SAM application polling rate skyrocket up to 89%, and we're 75% done with setting up the templates/checks that we're doing with our existing product.  Seeing how powerful Orion is, I would like to expand those further beyond what the other application is capable of doing.  The learn more help section stated that we can drop the polling interval of the nodes themselves to something less frequent, but it doesn't seem to apply to our situation.  Before I have to go stand up another polling engine and start splitting all the nodes between the 2 engines and re-structure our environment, is anyone else doing their monitoring this way or are we thinking about this backwards?

 

Would somebody be able to explain the "flow" of how an alert is triggered?  I would think that the application is scanned based on the template polling frequency (60 seconds for example), if the service is down, the condition is set to false, and the alert manager takes over and goes through it's process of emailing us, etc, but I can't find anything that proves this theory and honestly, have not found any other forum posts of others talking about how they have their environments setup to compare to.

 

We are looking at attending a week long training course in Orion where I might be able to have some of these questions answered, but i was hoping to get info from people who have this setup in a production environment rather than a lab.  I'm open for ideas/thoughts/insight on these issues that we're experiencing and also open to see how other people have structured their environment so I have a basis for comparison.  As I said before, we are looking for the instant satisfaction/gratification of knowing when something is down so our team can take immediate action on it before we get complaints from end users.

 

Thanks   


Viewing all articles
Browse latest Browse all 12281

Trending Articles