Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

ata: libata: avoid long timeouts on hot-unplugged SATA DAS

When a SATA DAS enclosure is connected behind a Thunderbolt PCIe
switch, hot-unplugging the whole enclosure causes pciehp to tear down
the PCI hierarchy before the SCSI layer issues SYNCHRONIZE CACHE and
START STOP UNIT for the disks.

libata still queues these commands and the AHCI driver tries to access
the HBA registers even though the PCI channel is already offline. This
results in a series of timeouts and error recovery attempts, e.g.:

[ 824.778346] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
[ 891.612720] ata8.00: qc timeout after 5000 msecs (cmd 0xec)
[ 902.876501] ata8.00: qc timeout after 10000 msecs (cmd 0xec)
[ 934.107998] ata8.00: qc timeout after 30000 msecs (cmd 0xec)
[ 936.206431] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
...
[ 1006.298356] ata1.00: qc timeout after 5000 msecs (cmd 0xec)
[ 1017.561926] ata1.00: qc timeout after 10000 msecs (cmd 0xec)
[ 1048.791790] ata1.00: qc timeout after 30000 msecs (cmd 0xec)
[ 1050.890035] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

With this patch applied, the same hot-unplug looks like:

[ 59.965496] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
[ 60.002502] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
...
[ 60.103050] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

In this test setup with two disks, the hot-unplug sequence shrinks from
about 226 seconds (~3.8 minutes) between the Link Down event and the
last SYNCHRONIZE CACHE failure to under a second. Without this patch the
total delay grows roughly with the number of disks, because each disk
gets its own SYNCHRONIZE CACHE and qc timeout series.

If the underlying PCI device is already gone, these commands cannot
succeed anyway. Avoid issuing them by introducing
ata_adapter_is_online(), which checks pci_channel_offline() for
PCI-based hosts. It is used from ata_scsi_find_dev() to return NULL,
causing the SCSI layer to fail new commands with DID_BAD_TARGET
immediately, and from ata_qc_issue() to bail out before touching the
HBA registers.

Since such failures would otherwise trigger libata error handling,
ata_adapter_is_online() is also consulted from ata_scsi_port_error_handler().
When the adapter is offline, libata skips ap->ops->error_handler(ap) and
completes error handling using the existing path, rather than running
a full EH sequence against a dead adapter.

With this change, SYNCHRONIZE CACHE and START STOP UNIT commands
issued during hot-unplug fail quickly once the PCI channel is offline,
without qc timeout spam or long libata EH delays.

Suggested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Henry Tseng <henrytseng@qnap.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>

authored by

Henry Tseng and committed by
Damien Le Moal
151cabd1 e8fe0981

+30 -1
+24
drivers/ata/libata-core.c
··· 2358 2358 return false; 2359 2359 } 2360 2360 2361 + bool ata_adapter_is_online(struct ata_port *ap) 2362 + { 2363 + struct device *dev; 2364 + 2365 + if (!ap || !ap->host) 2366 + return false; 2367 + 2368 + dev = ap->host->dev; 2369 + if (!dev) 2370 + return false; 2371 + 2372 + if (dev_is_pci(dev) && 2373 + pci_channel_offline(to_pci_dev(dev))) 2374 + return false; 2375 + 2376 + return true; 2377 + } 2378 + 2361 2379 static int ata_dev_config_ncq(struct ata_device *dev, 2362 2380 char *desc, size_t desc_sz) 2363 2381 { ··· 5090 5072 5091 5073 qc->flags |= ATA_QCFLAG_ACTIVE; 5092 5074 ap->qc_active |= 1ULL << qc->tag; 5075 + 5076 + /* Make sure the device is still accessible. */ 5077 + if (!ata_adapter_is_online(ap)) { 5078 + qc->err_mask |= AC_ERR_HOST_BUS; 5079 + goto sys_err; 5080 + } 5093 5081 5094 5082 /* 5095 5083 * We guarantee to LLDs that they will have at least one
+2 -1
drivers/ata/libata-eh.c
··· 736 736 spin_unlock_irqrestore(ap->lock, flags); 737 737 738 738 /* invoke EH, skip if unloading or suspended */ 739 - if (!(ap->pflags & (ATA_PFLAG_UNLOADING | ATA_PFLAG_SUSPENDED))) 739 + if (!(ap->pflags & (ATA_PFLAG_UNLOADING | ATA_PFLAG_SUSPENDED)) && 740 + ata_adapter_is_online(ap)) 740 741 ap->ops->error_handler(ap); 741 742 else { 742 743 /* if unloading, commence suicide */
+3
drivers/ata/libata-scsi.c
··· 2982 2982 { 2983 2983 struct ata_device *dev = __ata_scsi_find_dev(ap, scsidev); 2984 2984 2985 + if (!ata_adapter_is_online(ap)) 2986 + return NULL; 2987 + 2985 2988 if (unlikely(!dev || !ata_dev_enabled(dev))) 2986 2989 return NULL; 2987 2990
+1
drivers/ata/libata.h
··· 94 94 extern void swap_buf_le16(u16 *buf, unsigned int buf_words); 95 95 extern bool ata_phys_link_online(struct ata_link *link); 96 96 extern bool ata_phys_link_offline(struct ata_link *link); 97 + bool ata_adapter_is_online(struct ata_port *ap); 97 98 extern void ata_dev_init(struct ata_device *dev); 98 99 extern void ata_link_init(struct ata_port *ap, struct ata_link *link, int pmp); 99 100 extern int sata_link_init_spd(struct ata_link *link);