This is my little handbook to get smart informations directly from drives attached to HPE Smart Array, tested on a p420i and, as soon as I’ll test on other devices, i’ll upgrade this handbook.
The type of device needed by smartctl is “cciss,N” and N is the disk number, starting from 0 to poll data from and to issue commands to.
The device /dev/sg0 may vary but, in a single controller situation (slot=0 in ssacli) it would probably be sg0
In this example I’ll poll smart data from the disk in bay 4 on my ProLiant DL385p G8 server.
~# smartctl -a -d cciss,3 /dev/sg0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.114-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: TOSHIBA
Product: AL13SXB600N
Revision: 0101
Compliance: SPC-3
User Capacity: 600,127,266,816 bytes [600 GB]
Logical block size: 512 bytes
Rotation Rate: 15000 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000039978139990
Serial number: 6960A0DTF6TA
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Wed Aug 18 21:38:35 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported
=== START OF READ SMART DATA SECTION ===
SMART Health Status: FIRMWARE IMPENDING FAILURE TOO MANY BLOCK REASSIGNS [asc=5d, ascq=64]
Current Drive Temperature: 40 C
Drive Trip Temperature: 65 C
Accumulated power on time, hours:minutes 5723:15
Manufactured in week 23 of year 2019
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 14
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 694
Elements in grown defect list: 9981
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 14260 244 213 0 128578.681 31
write: 0 0 0 0 0 972.518 0
Non-medium error count: 45
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Failed in segment --> 3 5723 - [0x6 0x5d 0x64]
# 2 Background short Failed in segment --> 3 5723 - [0x6 0x5d 0x64]
Long (extended) Self-test duration: 3772 seconds [62.9 minutes]
In this case the disk is defective and the controller refuses do do anything with, the SAN which comes from, MSA2324i, gave me the following errors in logs
There is a problem with a FRU. (FRU type: disk, enclosure: 1, device ID: 4, vendor: TOSHIB, product ID: AL13SXB600N , SN: 6960A0DTF6TA, version: 0101, related event serial number: A18466, related event code: 62)
An error was detected by a disk drive. (disk: channel: 0, ID: 4, SN: 6960A0DTF6TA, enclosure: 1, slot: 5)(Key,Code,Qual:0x1,0x5D,0x64)(CDB:Rd 0000003f 0001)(CmdSpc:0x0, FRU:0x0, SnsKeySpc:0x0)(Recovered Error, firmware impending failure too many block reassigns)
A disk drive reported a SMART event. (disk: channel: 0, ID: 4, SN: 6960A0DTF6TA, enclosure: 1, slot: 5) sense key:Recovered Error(0x01) ASC:0x5D ASCQ:0x64 firmware impending failure too many block reassigns Info:0x00000000
So I decided to put it in my server to get a more accurate diagnosis and got that it was correct.
Once is known the mapping of drives in the array, would be simple to configure smartd daemon to get emails on failing devices.