利用megacli巡检raid阵列

这篇wiki是几乎从这篇文章照搬过来的,我公司的环境也和作者的非常相似.

在我们的线上环境中,有大量的物理实体服务器,主要用于对配置要求很高的Hadoop集群。 通常在这些服务器中,都配置了RAID卡并且挂载有16块大小至少为3T的硬盘,由于Hadoop集群的IO密集型特征,不少硬盘经常不堪重负而损坏,因此对RAID磁盘健康状态的检查,非常有必要。

整个脚本的思路如下:

  1. 通过MegaCli64分别获取异常状态的信息,通常有Degrade,Offline,Critical,Failed等状态;
  2. 将获取到的异常状态汇总,并提取出有问题的磁盘槽位信息;

/usr/local/nagios/libexec/check_megaraid_status

脚本地址

check_megaraid_status.sh
#!/bin/bash
#
# Script to check MegaRaidCLI Failed drives
# Works on servers with ONE RAID controller
#
# Example:
#   CRIT - Virtual Drives: {Degraded: 0, Offline: 2}, Physical Disks: {Critical: 0, Failed: 2}, 
#   Bad Drives: [{adapter: 0, enclID: 2, slot: 7, Span ref: 8, Row: 0}, {adapter: 0, enclID: 2, slot: 1, Span ref: 2, Row: 0}]
 
if [[ -x /opt/MegaRAID/MegaCli/MegaCli64 ]]; then
  megaraid_bin="sudo /opt/MegaRAID/MegaCli/MegaCli64"
elif [[ -x /opt/MegaRAID/MegaCli/MegaCli ]]; then
  megaraid_bin="sudo /opt/MegaRAID/MegaCli/MegaCli"
else
  echo "ERROR. No such MegaCli command"
  exit 1
fi
 
anyissue=$(${megaraid_bin} -AdpAllInfo -aAll | grep -E 'Degrade|[[:space:]][[:space:]]Failed|[[:space:]][[:space:]]Offline' | awk '/[1-9]/ {print $0}' | wc -l)
 
degrade=$(${megaraid_bin} -AdpAllInfo -aAll | grep -E 'Degrade' | awk '/[0-9]/ {print $3}')
critical=$(${megaraid_bin} -AdpAllInfo -aAll | grep -E 'Critical' | awk '/[0-9]/ {print $4}')
offline=$(${megaraid_bin} -AdpAllInfo -aAll | grep -E '[[:space:]][[:space:]]Offline' | awk '/[0-9]/ {print $3}')
failed=$(${megaraid_bin} -AdpAllInfo -aAll | grep -E '[[:space:]][[:space:]]Failed' | awk '/[0-9]/ {print $4}')
 
if [[ ${anyissue} -ge 1 ]]; then
  ${megaraid_bin} -CfgDsply -aALL > /tmp/Cfgdsply.txt
  failed_lines=$(grep -n "Failed" /tmp/Cfgdsply.txt | cut -d':' -f1)
 
  for failed_line in ${failed_lines}; do
    sed -n "1,${failed_line}p" /tmp/Cfgdsply.txt > /tmp/Cfgdsply_tofailed.txt
    tac /tmp/Cfgdsply_tofailed.txt > /tmp/backw_Cfgdsply.txt
 
    fadpt=$(grep -m 1 "Adapter" /tmp/backw_Cfgdsply.txt | cut -d':' -f2 | sed -e "s/ //g")
    enclID=$(grep -m 1 "Enclosure Device ID" /tmp/backw_Cfgdsply.txt | cut -d':' -f2 | sed -e "s/ //g")
    slot=$(grep -m 1 "Slot Number" /tmp/backw_Cfgdsply.txt | cut -d':' -f2 | sed -e "s/ //g")
    spanref=$(grep -m 1 "Span Reference" /tmp/backw_Cfgdsply.txt | cut -d':' -f2 | sed -e "s/ //g" | cut -d'x' -f2 | cut -c 2)
    row=$(grep -m 1 "Physical Disk:" /tmp/backw_Cfgdsply.txt | cut -d':' -f2 | sed -e "s/ //g")
 
    if [[ -z "${bad_drives_info}" ]]; then
      bad_drives_info="{adapter: "${fadpt}", enclID: "${enclID}", slot: "${slot}", Span ref: "${spanref}", Row: "${row}"}"
    else
      bad_drives_info="{adapter: "${fadpt}", enclID: "${enclID}", slot: "${slot}", Span ref: "${spanref}", Row: "${row}"}, ${bad_drives_info}"
    fi
  done
 
  echo "CRIT. Virtual Drives: {Degraded: "${degrade}", Offline: "${offline}"}, Physical Disks: {Critical: "${critical}", Failed: "${failed}"}, Bad Drives: [${bad_drives_info}]"
 
  # clean up temp files
  rm -f /tmp/Cfgdsply.txt
  rm -f /tmp/Cfgdsply_tofailed.txt
  rm -f /tmp/backw_Cfgdsply.txt
  rm -f MegaSAS.log*
  rm -f CmdTool.log*
 
  exit 2
else
  if [[ -z "${degrade}" ]] || [[ -z "${critical}" ]] || [[ -z "${offline}" ]] || [[ -z "${failed}" ]]; then
    echo "OK. No disk issue"
  else
    echo "OK. No disk issue. Virtual Drives: { Degraded: "${degrade}", Offline: "${offline}" }, Physical Disks: {Failed: "${failed}"}"
  fi
 
  # clean up temp files
  rm -f MegaSAS.log*
  rm -f CmdTool.log*
 
  exit 0
fi

输出:

CRIT - Virtual Drives: {Degraded: 0, Offline: 2}, Physical Disks: {Critical: 0, Failed: 2}, Bad Drive: [{adapter: 0, enclID: 2, slot: 7, Span ref: 8, Row: 0}, {adapter: 0, enclID: 2, slot: 1, Span ref: 2, Row: 0}]

  • hardware/lsi/利用megacli巡检raid阵列.txt
  • 最后更改: 2019/04/16 18:31
  • (外部编辑)