Akamit Blog

Enterprise sysadmin's blog

  • You are here: 
  • Home
  • 2010 June

Monitoring server overheat with ipmitool

Posted on June 22nd, 2010

Recently I was faced with sudden night Sun x4600 server shutdown. Investigation reveals that there was an conditioning failure and machine goes down on its own. I dig into logs and found that the shutdown was initiated by server’s system controller. Excerpt from the log follows:

System ACPI Power State : sys.acpi : S5/G2: soft-off
Power Supply : ps1.pwrok : State Deasserted
Power Supply : ps3.pwrok : State Deasserted
Power Supply : ps2.pwrok : State Deasserted
Temperature : p2.t_amb : Upper Non-recoverable going high : reading 46 > threshold 45 degrees C
Hot removal of /SYS/PS0
Entity Presence : ps0.prsnt : Device Absent
Processor : p0.cardfail : State Asserted
Temperature : p0.t_amb : Upper Critical going high : reading 39 > threshold 38 degrees C
Processor : p3.cardfail : State Asserted
Temperature : p3.t_amb : Upper Critical going high : reading 39 > threshold 38 degrees C
Processor : p1.cardfail : State Asserted
Temperature : p1.t_amb : Upper Critical going high : reading 39 > threshold 38 degrees C
Processor : p2.cardfail : State Asserted
Temperature : p2.t_amb : Upper Critical going high : reading 39 > threshold 38 degrees C

This nagios plugin for monitoring server overheat with ipmi will get me informed about that kind of event in the future and eliminate unexpected downtime.

Filed under Plugins, Sun Hardware | No Comments »

Monitoring HDS AMS storage with nagios using SNMP protocol

Posted on June 16th, 2010

We are going to get rid of HDS proprietary hi-track software and use industry standard SNMP protocol to monitor Hitachi midrange storage systems. NMS of choice will be Nagios. There are 2 ways of snmp monitoring in general and both are supported by Nagios. Your required to configure your storage system’s SNMP agent and specify SNMP community and SNMP trap destination. When done, check if it is working by executing check_snmp plugin.

# ./check_snmp -H ams-ctl0 -C public -o sysDescr.0 -P 1
SNMP OK - HITACHI  DF600F           Ver 0781/A-M |

Ok, you done with SNMP agent configuration, lets begin configure SNMP manager, i.e. Nagios.

1.  Active monitoring using plugin executing SNMP GET request for some OID. Here is a small Hitachi AMS storage monitoring nagios plugin which will do the task. It is ready for run with nagios embedded perl interpreter (ePN).

2. Passive monitoring using SNMP traps handling.
First, we need to install NetSNMP project’s snmptrapd daemon and point it to the program which will be handling all the traps coming in. We choose to run snmptt on every trap event. snmptrapd configuration will look like:

traphandle default /usr/sbin/snmptt
disableAuthorization yes
donotlogtraps  yes

Next, we configure snmptt iteslf to give it some understanding of what to do on receiving traps. Open snmptt.ini config and create section [TrapFiles]:

[TrapFiles]
snmptt_conf_files = <<END
/usr/local/etc/snmptt.conf.AMS500
END

Next, create snmptt.conf.AMS500 by running snmpttconvertmib tool on dfraid.mib which resides on AMS500 SNMP CD.

# export PATH=$PATH:/usr/local/bin
# snmpttconvertmib --in=dfraid.mib \
> --out=/usr/local/etc/snmptt.conf.AMS500 \
> --exec='/usr/local/nagios/libexec/eventhandlers/submit_check_result $r TRAP 1'

snmpttconvertmib calls snmptranslate from NetSNMP package not using full path, so you should correct your path to include directory in which snmptranaslate resides. Next, we define TRAP service.
nagios templates.cfg:

define service {
   name                    snmptrap
   use                     generic-service
   register                0
   service_description     TRAP
   is_volatile             1
   max_check_attempts      1
   normal_check_interval   1
   retry_check_interval    1
   passive_checks_enabled  1
   check_period            none
   check_command           check-host-alive
   notification_interval   31536000
}

And we use this template when defining actual services like this:

define service{
        use                             snmptrap,alltime_sms
        host_name                   amsctl0
        }

alltime_sms is a host template with defined contact groups, having sms targets in it.

Summary: storage sends trap, snmptrapd daemon handle it by calling snmptt trap handler, snmptt then calls submit_check_result script to submit passive check result to nagios. Nagios dispatches this submission to corresponding host service and takes appropiate action.

Filed under Nagios, Plugins, Storage | No Comments »

Solaris SVM metareplace within resyncing metadevice

Posted on June 15th, 2010

Our SVM metadevice created out of 6 disks, 3 + 3 mirror (RAID 1+0). Once upon a time shit happens. We got 2 disks out of service. So, trying to fix the situation. After metareplacing first disk, we stuck here:

# metastat d100
d100: Mirror
    Submirror 0: d101
      State: Needs maintenance 
    Submirror 1: d102
      State: Resyncing    
    Resync in progress: 14 % done
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 429803712 blocks (204 GB)
 
d101: Submirror of d100
    State: Needs maintenance 
    Size: 429803712 blocks (204 GB)
    Stripe 0: (interlace: 32 blocks)
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t1d0s0          0     No     Maintenance   Yes 
        c0t2d0s0      10176     No            Okay   Yes 
        c0t3d0s0      10176     No            Okay   Yes 
 
 
d102: Submirror of d100
    State: Resyncing    
    Size: 429803712 blocks (204 GB)
    Stripe 0: (interlace: 32 blocks)
        Device     Start Block  Dbase        State Reloc Hot Spare
        c1t5d0s0          0     No            Okay   Yes 
        c1t6d0s0      10176     No            Okay   Yes 
        c1t7d0s0      10176     No       Resyncing   Yes 
 
 
Device Relocation Information:
Device   Reloc  Device ID
c0t1d0   Yes    id1,ssd@w20000000871535bd
c0t2d0   Yes    id1,ssd@w20000004cf9b5527
c0t3d0   Yes    id1,ssd@w2000000087140df1
c1t5d0   Yes    id1,ssd@w500000e010796100
c1t6d0   Yes    id1,ssd@w2000000c506e16bb
c1t7d0   Yes    id1,ssd@w20000000871f1fa3

So far so good… trying to bring the second disk back to the volume. But oops…

# metareplace -e  d100 c0t1d0s0
metareplace: srv01: d100: resync in progress

So it seems we need to wait the first resync to complete?

Filed under Solaris | 2 Comments »