This is my little handbook to get smart informations directly from drives attached to HPE Smart Array, tested on a p420i and, as soon as I’ll test on other devices, i’ll upgrade this handbook.
The type of device needed by smartctl is “cciss,N” and N is the disk number, starting from 0 to poll data from and to issue commands to.
The device /dev/sg0 may vary but, in a single controller situation (slot=0 in ssacli) it would probably be sg0
In this example I’ll poll smart data from the disk in bay 4 on my ProLiant DL385p G8 server.
~# smartctl -a -d cciss,3 /dev/sg0 smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.114-1-pve] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: TOSHIBA Product: AL13SXB600N Revision: 0101 Compliance: SPC-3 User Capacity: 600,127,266,816 bytes [600 GB] Logical block size: 512 bytes Rotation Rate: 15000 rpm Form Factor: 2.5 inches Logical Unit id: 0x5000039978139990 Serial number: 6960A0DTF6TA Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Wed Aug 18 21:38:35 2021 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Disabled or Not Supported === START OF READ SMART DATA SECTION === SMART Health Status: FIRMWARE IMPENDING FAILURE TOO MANY BLOCK REASSIGNS [asc=5d, ascq=64] Current Drive Temperature: 40 C Drive Trip Temperature: 65 C Accumulated power on time, hours:minutes 5723:15 Manufactured in week 23 of year 2019 Specified cycle count over device lifetime: 50000 Accumulated start-stop cycles: 14 Specified load-unload count over device lifetime: 600000 Accumulated load-unload cycles: 694 Elements in grown defect list: 9981 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 14260 244 213 0 128578.681 31 write: 0 0 0 0 0 972.518 0 Non-medium error count: 45 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Failed in segment --> 3 5723 - [0x6 0x5d 0x64] # 2 Background short Failed in segment --> 3 5723 - [0x6 0x5d 0x64] Long (extended) Self-test duration: 3772 seconds [62.9 minutes]
In this case the disk is defective and the controller refuses do do anything with, the SAN which comes from, MSA2324i, gave me the following errors in logs
There is a problem with a FRU. (FRU type: disk, enclosure: 1, device ID: 4, vendor: TOSHIB, product ID: AL13SXB600N , SN: 6960A0DTF6TA, version: 0101, related event serial number: A18466, related event code: 62) An error was detected by a disk drive. (disk: channel: 0, ID: 4, SN: 6960A0DTF6TA, enclosure: 1, slot: 5)(Key,Code,Qual:0x1,0x5D,0x64)(CDB:Rd 0000003f 0001)(CmdSpc:0x0, FRU:0x0, SnsKeySpc:0x0)(Recovered Error, firmware impending failure too many block reassigns) A disk drive reported a SMART event. (disk: channel: 0, ID: 4, SN: 6960A0DTF6TA, enclosure: 1, slot: 5) sense key:Recovered Error(0x01) ASC:0x5D ASCQ:0x64 firmware impending failure too many block reassigns Info:0x00000000
So I decided to put it in my server to get a more accurate diagnosis and got that it was correct.
Once is known the mapping of drives in the array, would be simple to configure smartd daemon to get emails on failing devices.