On 07/12/2013 10:28 AM, Brenda J. Butler wrote: > > > I don't know oswatcher, but based on your description the following > would be usefule for you: > > > munin (keeps a contstant sized database, which thins out as you look back > in time). 10sec look and it looks like overkill but I will look at it more. > > nagios Definitely overkill. Using nagios for other things but what I'm after is not monitoring as much as a tool to use after the monitoring alerted that something is bad. At that point I want to know what did lead up to all memory used up or what process that did consume all cpu/io since once the alert happens it many time gets resolved with a big shotgun like a reboot (like when they accidentally started 40 instances of a java app on a server designed for 4) and we are left to tell what happened without logs. On 07/12/2013 01:36 PM, Jeffrey Moncrieff wrote:> > You can also try zenoss. > Will check on that later > > In both cases, if there is some test they don't already do, you can > write your own and have them use it. > Well, google did find https://github.com/stephenlang/scrutiny and that's about the closest I seen to what I'm looking for but a bit to basic. Since after all it's not that much to it I started writing something that I will try out over the weekend. I know one challenge will be to be able to actually collect anything when the system is crawling but anything is better then what we have now which is nothing (besides 1 minute sar data which tend to stop before system dies). /ps
Attachment:
signature.asc
Description: OpenPGP digital signature