home | list info | list archive | date index | thread index

Re: [OCLUG-Tech] oswatcher alternative, collector of top/ps/iostat/vmstat/... info

  • Subject: Re: [OCLUG-Tech] oswatcher alternative, collector of top/ps/iostat/vmstat/... info
  • From: "Peter Sjöberg" <lpaseen [ at ] gmail [ dot ] com>
  • Date: Sun, 14 Jul 2013 10:46:10 -0400
On 07/13/2013 10:55 PM, Brenda J. Butler wrote:
> 
> I'm curious why nagios/munin are overkill.  I think they exactly match
> your requirements.
My requirement is not monitoring - that is managed in a different way.
My problem is that something happened and I need to find out what and
why. While nagios can alert that the load is high on a server it would't
say exactly why and when I get to the system the cause may be gone.
Having a program that collects the output of "top -b -c -n 2 -i" and
other similar commands to a file every minute would help me see what was
going on and that's about all I need.
Besides that it's also other issues with nagios in our env. Nagios is a
centralized app with a web interface. It's no way we can install nagios
locally on every server and network wise it's not one server that can
talk to all plus that we already have other ways to monitor production
servers.

> Scheduling the tests and keeping track of the result in a scalable way
> can be a bit complicated - the actual tests are basically plugins.
> nagios and munin come with a few built-in tests (basically, the ones
> you want to see) and the rest are plugins, probably in separate
> packages.
Using nagios+nrpe in the lab to keep an eye on some non prod servers for
our self and even written some small plugins to add monitoring of some
in house apps.

> It's a bit annoying to learn nagios config language though, I have to
> admit.
I have managed to figure out some of it but it takes a while to get used to.
> Munin is way less complicated
Looked a little more at it and while it's not for the original issue I
think I will implement it at home.
>, but the thinning of data as
> time goes by annoys me.  Then again, it was one of your requirements.
Actually, I simplified it and just drop it all together after 2 days.
> The graphs are a bonus. 
Had some fun trying to interpretate graphs from some collected SAR data
from a server that had something like 300 SAN luns over shared
path=every "disk" did show up 3 times and the output graphs where one
per page in a huge pdf file. Then "grep" on the raw data was so much
easier to handle.
 You don't have to look at them if you don't
> want to.
> 
> I haven't looked at zenoss, but will keep an eye open for it.
https://github.com/lpaseen/nyss
> 
> bjb
> 
> 
> On Fri, Jul 12, 2013 at 11:49:23PM -0400, Peter Sjöberg wrote:
>> On 07/12/2013 10:28 AM, Brenda J. Butler wrote:
>>>
>>>
>>> I don't know oswatcher, but based on your description the following
>>> would be usefule for you:
>>>
>>>
>>> munin (keeps a contstant sized database, which thins out as you look back
>>> in time).
>> 10sec look and it looks like overkill but I will look at it more.
>>
>>>
>>> nagios
>> Definitely overkill. Using nagios for other things but what I'm after is
>> not monitoring as much as a tool to use after the monitoring alerted
>> that something is bad. At that point I want to know what did lead up to
>> all memory used up or what process that did consume all cpu/io since
>> once the alert happens it many time gets resolved with a big shotgun
>> like a reboot (like when they accidentally started 40 instances of a
>> java app on a server designed for 4) and we are left to tell what
>> happened without logs.
>>
>>
>> On 07/12/2013 01:36 PM, Jeffrey Moncrieff wrote:>
>>> You can also try zenoss.
>>>
>> Will check on that later
>>
>>>
>>> In both cases, if there is some test they don't already do, you can
>>> write your own and have them use it.
>>>
>> Well, google did find https://github.com/stephenlang/scrutiny and that's
>> about the closest I seen to what I'm looking for but a bit to basic.
>>
>> Since after all it's not that much to it I started writing something
>> that I will try out over the weekend. I know one challenge will be to be
>> able to actually collect anything when the system is crawling but
>> anything is better then what we have now which is nothing (besides 1
>> minute sar data which tend to stop before system dies).
>>
>> /ps
>>
> 
> 
> 
>> _______________________________________________
>> Linux mailing list
>> Linux [ at ] lists [ dot ] oclug [ dot ] on [ dot ] ca
>> http://oclug.on.ca/mailman/listinfo/linux
> 
> ---end quoted text---
> 


Attachment: signature.asc
Description: OpenPGP digital signature