All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATH stable 5.15,5.10 0/4] Fix EBS volume attach on AWS ARM instances
@ 2022-11-28 17:08 Luiz Capitulino
  2022-11-28 17:08 ` [PATH stable 5.15,5.10 1/4] genirq/msi: Shutdown managed interrupts with unsatifiable affinities Luiz Capitulino
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Luiz Capitulino @ 2022-11-28 17:08 UTC (permalink / raw)
  To: stable, maz; +Cc: tglx, lcapitulino, Luiz Capitulino

Hi,

[ Marc, can you help reviewing? Esp. the first patch? ]

This series of backports from upstream to stable 5.15 and 5.10 fixes an issue
we're seeing on AWS ARM instances where attaching an EBS volume (which is a
nvme device) to the instance after offlining CPUs causes the device to take
several minutes to show up and eventually nvme kworkers and other threads start
getting stuck.

This series fixes the issue for 5.15.79 and 5.10.155. I can't reproduce it
on 5.4. Also, I couldn't reproduce this on x86 even w/ affected kernels.

An easy reproducer is:

1. Start an ARM instance with 32 CPUs
2. Once the instance is booted, offline all CPUs but CPU 0. Eg:
   # for i in $(seq 1 32); do chcpu -d $i; done
3. Once the CPUs are offline, attach an EBS volume
4. Watch lsblk and dmesg in the instance

Eventually, you get this stack trace:

[   71.842974] pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802
[   71.843966] pci 0000:00:1f.0: reg 0x10: [mem 0x00000000-0x00003fff]
[   71.845149] pci 0000:00:1f.0: PME# supported from D0 D1 D2 D3hot D3cold
[   71.846694] pci 0000:00:1f.0: BAR 0: assigned [mem 0x8011c000-0x8011ffff]
[   71.848458] ACPI: \_SB_.PCI0.GSI3: Enabled at IRQ 38
[   71.850852] nvme nvme1: pci function 0000:00:1f.0
[   71.851611] nvme 0000:00:1f.0: enabling device (0000 -> 0002)
[  135.887787] nvme nvme1: I/O 22 QID 0 timeout, completion polled
[  197.328276] nvme nvme1: I/O 23 QID 0 timeout, completion polled
[  197.329221] nvme nvme1: 1/0/0 default/read/poll queues
[  243.408619] INFO: task kworker/u64:2:275 blocked for more than 122 seconds.
[  243.409674]       Not tainted 5.15.79 #1
[  243.410270] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.411389] task:kworker/u64:2   state:D stack:    0 pid:  275 ppid:     2 flags:0x00000008
[  243.412602] Workqueue: events_unbound async_run_entry_fn
[  243.413417] Call trace:
[  243.413797]  __switch_to+0x15c/0x1a4
[  243.414335]  __schedule+0x2bc/0x990
[  243.414849]  schedule+0x68/0xf8
[  243.415334]  schedule_timeout+0x184/0x340
[  243.415946]  wait_for_completion+0xc8/0x220
[  243.416543]  __flush_work.isra.43+0x240/0x2f0
[  243.417179]  flush_work+0x20/0x2c
[  243.417666]  nvme_async_probe+0x20/0x3c
[  243.418228]  async_run_entry_fn+0x3c/0x1e0
[  243.418858]  process_one_work+0x1bc/0x460
[  243.419437]  worker_thread+0x164/0x528
[  243.420030]  kthread+0x118/0x124
[  243.420517]  ret_from_fork+0x10/0x20
[  258.768771] nvme nvme1: I/O 20 QID 0 timeout, completion polled
[  320.209266] nvme nvme1: I/O 21 QID 0 timeout, completion polled

For completion, I tested the same test-case on x86 with this series applied
on 5.15.79 and 5.10.155 as well. It works as expected.

Thanks,

Marc Zyngier (4):
  genirq/msi: Shutdown managed interrupts with unsatifiable affinities
  genirq: Always limit the affinity to online CPUs
  irqchip/gic-v3: Always trust the managed affinity provided by the core
    code
  genirq: Take the proposed affinity at face value if force==true

 drivers/irqchip/irq-gic-v3-its.c |  2 +-
 kernel/irq/manage.c              | 31 +++++++++++++++++++++++--------
 kernel/irq/msi.c                 |  7 +++++++
 3 files changed, 31 insertions(+), 9 deletions(-)

-- 
2.37.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-11-30 17:14 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-28 17:08 [PATH stable 5.15,5.10 0/4] Fix EBS volume attach on AWS ARM instances Luiz Capitulino
2022-11-28 17:08 ` [PATH stable 5.15,5.10 1/4] genirq/msi: Shutdown managed interrupts with unsatifiable affinities Luiz Capitulino
2022-11-28 17:08 ` [PATH stable 5.15,5.10 2/4] genirq: Always limit the affinity to online CPUs Luiz Capitulino
2022-11-28 17:08 ` [PATH stable 5.15,5.10 3/4] irqchip/gic-v3: Always trust the managed affinity provided by the core code Luiz Capitulino
2022-11-28 17:08 ` [PATH stable 5.15,5.10 4/4] genirq: Take the proposed affinity at face value if force==true Luiz Capitulino
2022-11-28 17:53 ` [PATH stable 5.15,5.10 0/4] Fix EBS volume attach on AWS ARM instances Marc Zyngier
2022-11-28 18:27   ` [PATH stable 5.15, 5.10 " Luiz Capitulino
2022-11-30  3:12     ` Luiz Capitulino
2022-11-30 17:11 ` [PATH stable 5.15,5.10 " Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.