I don’t really know if it is an issue in PVE kernel or in linux kernel but upgrading from 5.3.x to a 5.4.x kernel makes QLogic adapter stop working with point-to-point protocol fiberchannel links.
The module qla2xxx in linux kernel changed in 5.4.x PVE version making links unavailable using point-to-point protocol.
The scenario
A cluster of 4 pve hosts using a HP MSA 2050 FC SAN, each host (various rack proliant servers) has an HP QLogic adapter
“QLogic HPAJ764A – HPE 82Q 8Gb Dual Port PCI-e FC HBA”
connected directly to a port on the SAN with each HBA port, one to controller A, the other to controller B.
The upgrade from Proxmox PVE 6.1-8 to Proxmox PVE 6.2-4 was planned and conducted in the usual way, host by host by the following steps.
- migrate all vm from the host to be upgraded;
- update the repository;
- upgrade all packages,
- reboot;
- test all features working;
- migrate back all vm to the freshly upgraded host.
When the first machine returned up after reboot, it showed on console messages about timeout on FC ports.
rport-2:0-0: blocked FC remote port time out: removing target and saving binding rport-4:0-0: blocked FC remote port time out: removing target and saving binding
These errors were also present in dmesg together with other errors.
[ 872.821885] scsi host4: qla2xxx [ 872.832162] qla2xxx [0000:0b:00.1]-00fb:4: QLogic HPAJ764A - HPE 82Q 8Gb Dual Port PCI-e FC HBA. [ 872.832173] qla2xxx [0000:0b:00.1]-00fc:4: ISP2532: PCIe (5.0GT/s x8) @ 0000:0b:00.1 hdma+ host#=4 fw=8.07.00 (90d5). [ 873.243009] qla2xxx [0000:0b:00.1]-500a:4: LOOP UP detected (8 Gbps). [ 886.545479] rport-2:0-0: blocked FC remote port time out: removing target and saving binding [ 887.569457] rport-4:0-0: blocked FC remote port time out: removing target and saving binding [ 915.975383] qla2xxx [0000:0b:00.1]-5039:4: Async-tmf error - hdl=6 completion status(28). [ 915.975638] qla2xxx [0000:0b:00.1]-8030:4: TM IOCB failed (102). [ 918.945929] qla2xxx [0000:0b:00.0]-5039:2: Async-tmf error - hdl=6 completion status(28). [ 918.946193] qla2xxx [0000:0b:00.0]-8030:2: TM IOCB failed (102).
Nothing helped, even if issuing lip to reset the interface.
echo "1" > /sys/class/fc_host/host4/issue_lip
This made links down and up again with no result in getting the storage available again.
[ 925.844633] qla2xxx [0000:0b:00.1]-500b:4: LOOP DOWN detected (2 3 0 0). [ 926.697302] qla2xxx [0000:0b:00.1]-500a:4: LOOP UP detected (8 Gbps). [ 929.055264] qla2xxx [0000:0b:00.0]-500b:2: LOOP DOWN detected (2 3 0 0). [ 929.908041] qla2xxx [0000:0b:00.0]-500a:2: LOOP UP detected (8 Gbps).
After a lot of researching a link was discovered on IBM website explaining each configuration possibility of their version of QLogic adapter. This made an investigation process starting and optimizing the adapters that were being used AS-IS because all worked fine out of the box. None of the operations solved the issue.
At this point the attention switched on the SAN that showed no log at all about an hypothetical refusal in communicating with the host. But the problem was found in the way the SAN is default configured to have the FC ports set with auto speed negotiation and fixed protocol as point-to-point since the first suggested scenario used is, like the one described, with hypervisors directly connected to the SAN.
The solution
The solution was quite simple, to set up each port in automatic protocol negotiation in the SAN’s settings. It wasn’t shown any variation in speed and responsiveness of systems after the change of protocol.
can you please post a reference to the IBM link you mentioned into the article?
Here it is the link.
Thanks for asking because I believe that the link was in the text, now I’ll correct the article.