Monitoring server overheat with ipmitool
Posted on June 22nd, 2010
Recently I was faced with sudden night Sun x4600 server shutdown. Investigation reveals that there was an conditioning failure and machine goes down on its own. I dig into logs and found that the shutdown was initiated by server’s system controller. Excerpt from the log follows:
System ACPI Power State : sys.acpi : S5/G2: soft-off Power Supply : ps1.pwrok : State Deasserted Power Supply : ps3.pwrok : State Deasserted Power Supply : ps2.pwrok : State Deasserted Temperature : p2.t_amb : Upper Non-recoverable going high : reading 46 > threshold 45 degrees C Hot removal of /SYS/PS0 Entity Presence : ps0.prsnt : Device Absent Processor : p0.cardfail : State Asserted Temperature : p0.t_amb : Upper Critical going high : reading 39 > threshold 38 degrees C Processor : p3.cardfail : State Asserted Temperature : p3.t_amb : Upper Critical going high : reading 39 > threshold 38 degrees C Processor : p1.cardfail : State Asserted Temperature : p1.t_amb : Upper Critical going high : reading 39 > threshold 38 degrees C Processor : p2.cardfail : State Asserted Temperature : p2.t_amb : Upper Critical going high : reading 39 > threshold 38 degrees C
This nagios plugin for monitoring server overheat with ipmi will get me informed about that kind of event in the future and eliminate unexpected downtime.
Filed under Plugins, Sun Hardware | No Comments »
Monitoring HDS AMS storage with nagios using SNMP protocol
Posted on June 16th, 2010
We are going to get rid of HDS proprietary hi-track software and use industry standard SNMP protocol to monitor Hitachi midrange storage systems. NMS of choice will be Nagios. There are 2 ways of snmp monitoring in general and both are supported by Nagios. Your required to configure your storage system’s SNMP agent and specify SNMP community and SNMP trap destination. When done, check if it is working by executing check_snmp plugin.
# ./check_snmp -H ams-ctl0 -C public -o sysDescr.0 -P 1 SNMP OK - HITACHI DF600F Ver 0781/A-M |
Ok, you done with SNMP agent configuration, lets begin configure SNMP manager, i.e. Nagios.
1. Active monitoring using plugin executing SNMP GET request for some OID. Here is a small Hitachi AMS storage monitoring nagios plugin which will do the task. It is ready for run with nagios embedded perl interpreter (ePN).
2. Passive monitoring using SNMP traps handling.
First, we need to install NetSNMP project’s snmptrapd daemon and point it to the program which will be handling all the traps coming in. We choose to run snmptt on every trap event. snmptrapd configuration will look like:
traphandle default /usr/sbin/snmptt disableAuthorization yes donotlogtraps yes
Next, we configure snmptt iteslf to give it some understanding of what to do on receiving traps. Open snmptt.ini config and create section [TrapFiles]:
[TrapFiles] snmptt_conf_files = <<END /usr/local/etc/snmptt.conf.AMS500 END
Next, create snmptt.conf.AMS500 by running snmpttconvertmib tool on dfraid.mib which resides on AMS500 SNMP CD.
# export PATH=$PATH:/usr/local/bin # snmpttconvertmib --in=dfraid.mib \ > --out=/usr/local/etc/snmptt.conf.AMS500 \ > --exec='/usr/local/nagios/libexec/eventhandlers/submit_check_result $r TRAP 1'
snmpttconvertmib calls snmptranslate from NetSNMP package not using full path, so you should correct your path to include directory in which snmptranaslate resides. Next, we define TRAP service.
nagios templates.cfg:
define service {
name snmptrap
use generic-service
register 0
service_description TRAP
is_volatile 1
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
passive_checks_enabled 1
check_period none
check_command check-host-alive
notification_interval 31536000
}And we use this template when defining actual services like this:
define service{
use snmptrap,alltime_sms
host_name amsctl0
}alltime_sms is a host template with defined contact groups, having sms targets in it.
Summary: storage sends trap, snmptrapd daemon handle it by calling snmptt trap handler, snmptt then calls submit_check_result script to submit passive check result to nagios. Nagios dispatches this submission to corresponding host service and takes appropiate action.
Filed under Nagios, Plugins, Storage | No Comments »
Solaris SVM metareplace within resyncing metadevice
Posted on June 15th, 2010
Our SVM metadevice created out of 6 disks, 3 + 3 mirror (RAID 1+0). Once upon a time shit happens. We got 2 disks out of service. So, trying to fix the situation. After metareplacing first disk, we stuck here:
# metastat d100
d100: Mirror
Submirror 0: d101
State: Needs maintenance
Submirror 1: d102
State: Resyncing
Resync in progress: 14 % done
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 429803712 blocks (204 GB)
d101: Submirror of d100
State: Needs maintenance
Size: 429803712 blocks (204 GB)
Stripe 0: (interlace: 32 blocks)
Device Start Block Dbase State Reloc Hot Spare
c0t1d0s0 0 No Maintenance Yes
c0t2d0s0 10176 No Okay Yes
c0t3d0s0 10176 No Okay Yes
d102: Submirror of d100
State: Resyncing
Size: 429803712 blocks (204 GB)
Stripe 0: (interlace: 32 blocks)
Device Start Block Dbase State Reloc Hot Spare
c1t5d0s0 0 No Okay Yes
c1t6d0s0 10176 No Okay Yes
c1t7d0s0 10176 No Resyncing Yes
Device Relocation Information:
Device Reloc Device ID
c0t1d0 Yes id1,ssd@w20000000871535bd
c0t2d0 Yes id1,ssd@w20000004cf9b5527
c0t3d0 Yes id1,ssd@w2000000087140df1
c1t5d0 Yes id1,ssd@w500000e010796100
c1t6d0 Yes id1,ssd@w2000000c506e16bb
c1t7d0 Yes id1,ssd@w20000000871f1fa3So far so good… trying to bring the second disk back to the volume. But oops…
# metareplace -e d100 c0t1d0s0 metareplace: srv01: d100: resync in progress
So it seems we need to wait the first resync to complete?
Filed under Solaris | 2 Comments »