利用megacli巡检raid阵列
1. 背景
这篇wiki是几乎从这篇文章照搬过来的,我公司的环境也和作者的非常相似.
在我们的线上环境中,有大量的物理实体服务器,主要用于对配置要求很高的Hadoop集群。 通常在这些服务器中,都配置了RAID卡并且挂载有16块大小至少为3T的硬盘,由于Hadoop集群的IO密集型特征,不少硬盘经常不堪重负而损坏,因此对RAID磁盘健康状态的检查,非常有必要。
2. 具体配置
整个脚本的思路如下:
- 通过MegaCli64分别获取异常状态的信息,通常有Degrade,Offline,Critical,Failed等状态;
- 将获取到的异常状态汇总,并提取出有问题的磁盘槽位信息;
3. 执行示例
/usr/local/nagios/libexec/check_megaraid_status
- check_megaraid_status.sh
#!/bin/bash # # Script to check MegaRaidCLI Failed drives # Works on servers with ONE RAID controller # # Example: # CRIT - Virtual Drives: {Degraded: 0, Offline: 2}, Physical Disks: {Critical: 0, Failed: 2}, # Bad Drives: [{adapter: 0, enclID: 2, slot: 7, Span ref: 8, Row: 0}, {adapter: 0, enclID: 2, slot: 1, Span ref: 2, Row: 0}] if [[ -x /opt/MegaRAID/MegaCli/MegaCli64 ]]; then megaraid_bin="sudo /opt/MegaRAID/MegaCli/MegaCli64" elif [[ -x /opt/MegaRAID/MegaCli/MegaCli ]]; then megaraid_bin="sudo /opt/MegaRAID/MegaCli/MegaCli" else echo "ERROR. No such MegaCli command" exit 1 fi anyissue=$(${megaraid_bin} -AdpAllInfo -aAll | grep -E 'Degrade|[[:space:]][[:space:]]Failed|[[:space:]][[:space:]]Offline' | awk '/[1-9]/ {print $0}' | wc -l) degrade=$(${megaraid_bin} -AdpAllInfo -aAll | grep -E 'Degrade' | awk '/[0-9]/ {print $3}') critical=$(${megaraid_bin} -AdpAllInfo -aAll | grep -E 'Critical' | awk '/[0-9]/ {print $4}') offline=$(${megaraid_bin} -AdpAllInfo -aAll | grep -E '[[:space:]][[:space:]]Offline' | awk '/[0-9]/ {print $3}') failed=$(${megaraid_bin} -AdpAllInfo -aAll | grep -E '[[:space:]][[:space:]]Failed' | awk '/[0-9]/ {print $4}') if [[ ${anyissue} -ge 1 ]]; then ${megaraid_bin} -CfgDsply -aALL > /tmp/Cfgdsply.txt failed_lines=$(grep -n "Failed" /tmp/Cfgdsply.txt | cut -d':' -f1) for failed_line in ${failed_lines}; do sed -n "1,${failed_line}p" /tmp/Cfgdsply.txt > /tmp/Cfgdsply_tofailed.txt tac /tmp/Cfgdsply_tofailed.txt > /tmp/backw_Cfgdsply.txt fadpt=$(grep -m 1 "Adapter" /tmp/backw_Cfgdsply.txt | cut -d':' -f2 | sed -e "s/ //g") enclID=$(grep -m 1 "Enclosure Device ID" /tmp/backw_Cfgdsply.txt | cut -d':' -f2 | sed -e "s/ //g") slot=$(grep -m 1 "Slot Number" /tmp/backw_Cfgdsply.txt | cut -d':' -f2 | sed -e "s/ //g") spanref=$(grep -m 1 "Span Reference" /tmp/backw_Cfgdsply.txt | cut -d':' -f2 | sed -e "s/ //g" | cut -d'x' -f2 | cut -c 2) row=$(grep -m 1 "Physical Disk:" /tmp/backw_Cfgdsply.txt | cut -d':' -f2 | sed -e "s/ //g") if [[ -z "${bad_drives_info}" ]]; then bad_drives_info="{adapter: "${fadpt}", enclID: "${enclID}", slot: "${slot}", Span ref: "${spanref}", Row: "${row}"}" else bad_drives_info="{adapter: "${fadpt}", enclID: "${enclID}", slot: "${slot}", Span ref: "${spanref}", Row: "${row}"}, ${bad_drives_info}" fi done echo "CRIT. Virtual Drives: {Degraded: "${degrade}", Offline: "${offline}"}, Physical Disks: {Critical: "${critical}", Failed: "${failed}"}, Bad Drives: [${bad_drives_info}]" # clean up temp files rm -f /tmp/Cfgdsply.txt rm -f /tmp/Cfgdsply_tofailed.txt rm -f /tmp/backw_Cfgdsply.txt rm -f MegaSAS.log* rm -f CmdTool.log* exit 2 else if [[ -z "${degrade}" ]] || [[ -z "${critical}" ]] || [[ -z "${offline}" ]] || [[ -z "${failed}" ]]; then echo "OK. No disk issue" else echo "OK. No disk issue. Virtual Drives: { Degraded: "${degrade}", Offline: "${offline}" }, Physical Disks: {Failed: "${failed}"}" fi # clean up temp files rm -f MegaSAS.log* rm -f CmdTool.log* exit 0 fi
输出:
CRIT - Virtual Drives: {Degraded: 0, Offline: 2}, Physical Disks: {Critical: 0, Failed: 2}, Bad Drive: [{adapter: 0, enclID: 2, slot: 7, Span ref: 8, Row: 0}, {adapter: 0, enclID: 2, slot: 1, Span ref: 2, Row: 0}]
4. REFERENCE
- hardware/lsi/利用megacli巡检raid阵列.txt
- 最后更改: 2019/04/16 18:31
- (外部编辑)