linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv4 0/3] nvme: NSHEAD_DISK_LIVE fixes
@ 2024-09-06 10:16 Hannes Reinecke
  2024-09-06 10:16 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Hannes Reinecke @ 2024-09-06 10:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Hi all,

I'm having a testcase which repeatedly deletes namespaces on the target
and creates new namespaces, and aggressively re-using NSIDs for the
new namespaces.
To throw in more fun these namespaces are created on different nodes
in the cluster, where only the paths local to the cluster node are
active, and all other paths are inaccessible.

Essentially it's doing something like:

echo 0 > ${ns}/enable
rm ${ns}
<random delay>
mkdir ${ns}
echo "<dev>" > ${ns}/device_path
echo "<grpid>" > ${ns}/ana_grpid
uuidgen > ${ns}/device_uuid
echo 1 > ${ns}/enable

repeatedly with several namespaces and several ANA groups.
This leads to an unrecoverable system where the scanning processes
are stuck in the partition scanning code triggered via
'device_add_disk()' waiting for I/O which will never
come.
There are two parts to fixing this:
We need to ensure the NSHEAD_DISK_LIVE is properly set when the
ns_head is live, and unset when the last path is gone.
And we need to trigger the requeue list after NSHEAD_DISK_LIVE
has been cleared to flush all outstanding I/O.

Turns out there's another corner case; when running the same test
but not removing the namespaces while changing the UUID we end up
with I/Os constantly being retried, and we are unable to even
disconnect the controller. To fix this we should set the
'failfast' flag for the controller when disconnecting to ensure
that all I/O is aborted.

With these patches (and the queue freeze patchset from hch) the problem
is resolved and the testcase runs without issues.
I see to get the testcase added to blktests.

As usual, comments and reviews are welcome.

Changes to v3:
- Update patch description as suggested by Sagi
- Drop patch to requeue I/O after ANA state changes

Changes to v2:
- Include reviews from Sagi
- Drop the check for NSHEAD_DISK_LIVE in nvme_available_path()
- Add a patch to requeue I/O if the ANA state changed
- Set the 'failfast' flag when removing controllers

Changes to the original submission:
- Drop patch to remove existing namespaces on ID mismatch
- Combine patches updating NSHEAD_DISK_LIVE handling
- requeue I/O after NSHEAD_DISK_LIVE has been cleared

Hannes Reinecke (3):
  nvme-multipath: fixup typo when clearing DISK_LIVE
  nvme-multipath: avoid hang on inaccessible namespaces
  nvme: 'nvme disconnect' hangs after remapping namespaces

 drivers/nvme/host/core.c      |  7 +++++++
 drivers/nvme/host/multipath.c | 14 +++++++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

-- 
2.35.3



^ permalink raw reply	[flat|nested] 14+ messages in thread
* [PATCHv5 0/3] nvme: NSHEAD_DISK_LIVE fixes
@ 2024-09-09  7:19 Hannes Reinecke
  2024-09-09  7:19 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
  0 siblings, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2024-09-09  7:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Hi all,

I'm having a testcase which repeatedly deletes namespaces on the target
and creates new namespaces, and aggressively re-using NSIDs for the
new namespaces.
To throw in more fun these namespaces are created on different nodes
in the cluster, where only the paths local to the cluster node are
active, and all other paths are inaccessible.

Essentially it's doing something like:

echo 0 > ${ns}/enable
rm ${ns}
<random delay>
mkdir ${ns}
echo "<dev>" > ${ns}/device_path
echo "<grpid>" > ${ns}/ana_grpid
uuidgen > ${ns}/device_uuid
echo 1 > ${ns}/enable

repeatedly with several namespaces and several ANA groups.
This leads to an unrecoverable system where the scanning processes
are stuck in the partition scanning code triggered via
'device_add_disk()' waiting for I/O which will never
come.
There are two parts to fixing this:
We need to ensure the NSHEAD_DISK_LIVE is properly set when the
ns_head is live, and unset when the last path is gone.
And we need to trigger the requeue list after NSHEAD_DISK_LIVE
has been cleared to flush all outstanding I/O.

Turns out there's another corner case; when running the same test
but not removing the namespaces while changing the UUID we end up
with I/Os constantly being retried, and we are unable to even
disconnect the controller. To fix this we should disabled retries
for failed commands when the controller state is 'DELETING' as there
really is no point in trying to retry the command when the controller
is shut down.

With these patches (and the queue freeze patchset from hch) the problem
is resolved and the testcase runs without issues.
I see to get the testcase added to blktests.

As usual, comments and reviews are welcome.

Changes to v4:
- Disabled command retries when the controller is removed instead of
  (ab-)using the failfast flag

Changes to v3:
- Update patch description as suggested by Sagi
- Drop patch to requeue I/O after ANA state changes

Changes to v2:
- Include reviews from Sagi
- Drop the check for NSHEAD_DISK_LIVE in nvme_available_path()
- Add a patch to requeue I/O if the ANA state changed
- Set the 'failfast' flag when removing controllers

Changes to the original submission:
- Drop patch to remove existing namespaces on ID mismatch
- Combine patches updating NSHEAD_DISK_LIVE handling
- requeue I/O after NSHEAD_DISK_LIVE has been cleared

Hannes Reinecke (3):
  nvme-multipath: fixup typo when clearing DISK_LIVE
  nvme-multipath: avoid hang on inaccessible namespaces
  nvme: 'nvme disconnect' hangs after remapping namespaces

 drivers/nvme/host/core.c      |  7 ++++++-
 drivers/nvme/host/multipath.c | 14 +++++++++++---
 2 files changed, 17 insertions(+), 4 deletions(-)

-- 
2.35.3



^ permalink raw reply	[flat|nested] 14+ messages in thread
* [PATCHv6 0/3] nvme: NSHEAD_DISK_LIVE fixes
@ 2024-09-11  9:51 Hannes Reinecke
  2024-09-11  9:51 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
  0 siblings, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2024-09-11  9:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Hi all,

I'm having a testcase which repeatedly deletes namespaces on the target
and creates new namespaces, and aggressively re-using NSIDs for the
new namespaces.
To throw in more fun these namespaces are created on different nodes
in the cluster, where only the paths local to the cluster node are
active, and all other paths are inaccessible.

Essentially it's doing something like:

echo 0 > ${ns}/enable
rm ${ns}
<random delay>
mkdir ${ns}
echo "<dev>" > ${ns}/device_path
echo "<grpid>" > ${ns}/ana_grpid
uuidgen > ${ns}/device_uuid
echo 1 > ${ns}/enable

repeatedly with several namespaces and several ANA groups.
This leads to an unrecoverable system where the scanning processes
are stuck in the partition scanning code triggered via
'device_add_disk()' waiting for I/O which will never
come.
There are two parts to fixing this:
We need to ensure the NSHEAD_DISK_LIVE is properly set when the
ns_head is live, and unset when the last path is gone.
And we need to trigger the requeue list after NSHEAD_DISK_LIVE
has been cleared to flush all outstanding I/O.

Turns out there's another corner case; when running the same test
but not removing the namespaces while changing the UUID we end up
with I/Os constantly being retried. If this happend during partition
scan kicked off from device_add_disk() the system is stuck as the
scan_mutex will never be released. To fix this I've introduced a
NVME_NSHEAD_DISABLE_QUEUEING flag to inhibig queueing during scan,
such that device_add_disk() can make progress.

With these patches (and the queue freeze patchset from hch) the problem
is resolved and the testcase runs without issues.
I see to get the testcase added to blktests.

As usual, comments and reviews are welcome.

Changes to v5:
- Introduce NVME_NSHEAD_DISABLE_QUEUEING flag instead of disabling
  command retries

Changes to v4:
- Disabled command retries when the controller is removed instead of
  (ab-)using the failfast flag

Changes to v3:
- Update patch description as suggested by Sagi
- Drop patch to requeue I/O after ANA state changes

Changes to v2:
- Include reviews from Sagi
- Drop the check for NSHEAD_DISK_LIVE in nvme_available_path()
- Add a patch to requeue I/O if the ANA state changed
- Set the 'failfast' flag when removing controllers

Changes to the original submission:
- Drop patch to remove existing namespaces on ID mismatch
- Combine patches updating NSHEAD_DISK_LIVE handling
- requeue I/O after NSHEAD_DISK_LIVE has been cleared

Hannes Reinecke (3):
  nvme-multipath: fixup typo when clearing DISK_LIVE
  nvme-multipath: avoid hang on inaccessible namespaces
  nvme-multipath: skip inaccessible paths during partition scan

 drivers/nvme/host/multipath.c | 22 +++++++++++++++++++---
 drivers/nvme/host/nvme.h      |  1 +
 2 files changed, 20 insertions(+), 3 deletions(-)

-- 
2.35.3



^ permalink raw reply	[flat|nested] 14+ messages in thread
* [PATCH 0/3] nvme: NSHEAD_DISK_LIVE fixes
@ 2024-09-11 12:46 Hannes Reinecke
  2024-09-11 12:46 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
  0 siblings, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2024-09-11 12:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Hi all,

I'm having a testcase which repeatedly deletes namespaces on the target
and creates new namespaces, and aggressively re-using NSIDs for the
new namespaces.
To throw in more fun these namespaces are created on different nodes
in the cluster, where only the paths local to the cluster node are
active, and all other paths are inaccessible.

Essentially it's doing something like:

echo 0 > ${ns}/enable
rm ${ns}
<random delay>
mkdir ${ns}
echo "<dev>" > ${ns}/device_path
echo "<grpid>" > ${ns}/ana_grpid
uuidgen > ${ns}/device_uuid
echo 1 > ${ns}/enable

repeatedly with several namespaces and several ANA groups.
This leads to an unrecoverable system where the scanning processes
are stuck in the partition scanning code triggered via
'device_add_disk()' waiting for I/O which will never
come.
There are two parts to fixing this:
We need to ensure the NSHEAD_DISK_LIVE is properly set when the
ns_head is live, and unset when the last path is gone.
And we need to trigger the requeue list after NSHEAD_DISK_LIVE
has been cleared to flush all outstanding I/O.

Turns out there's another corner case; when running the same test
but not removing the namespaces while changing the UUID we end up
with I/Os constantly being retried. If this happend during partition
scan kicked off from device_add_disk() the system is stuck as the
scan_mutex will never be released. To fix this I've introduced a
NVME_NSHEAD_FAIL_ON_LAST_PATH flag to inhibig queueing during scan
when we only have one path left such that device_add_disk() can
make progress.

With these patches (and the queue freeze patchset from hch) the problem
is resolved and the testcase runs without issues.
I see to get the testcase added to blktests.

As usual, comments and reviews are welcome.

Changes to v6:
- Rename flag to NVME_NSHEAD_FAIL_ON_LAST_PATH and fail I/O only
  on the last path (Suggested by Sagi)
- Retrigger pending I/O on every ANA state change

Changes to v5:
- Introduce NVME_NSHEAD_DISABLE_QUEUEING flag instead of disabling
  command retries

Changes to v4:
- Disabled command retries when the controller is removed instead of
  (ab-)using the failfast flag

Changes to v3:
- Update patch description as suggested by Sagi
- Drop patch to requeue I/O after ANA state changes

Changes to v2:
- Include reviews from Sagi
- Drop the check for NSHEAD_DISK_LIVE in nvme_available_path()
- Add a patch to requeue I/O if the ANA state changed
- Set the 'failfast' flag when removing controllers

Changes to the original submission:
- Drop patch to remove existing namespaces on ID mismatch
- Combine patches updating NSHEAD_DISK_LIVE handling
- requeue I/O after NSHEAD_DISK_LIVE has been cleared

Hannes Reinecke (3):
  nvme-multipath: fixup typo when clearing DISK_LIVE
  nvme-multipath: avoid hang on inaccessible namespaces
  nvme-multipath: stuck partition scan on inaccessible paths

 drivers/nvme/host/multipath.c | 29 ++++++++++++++++++++++++++---
 drivers/nvme/host/nvme.h      |  1 +
 2 files changed, 27 insertions(+), 3 deletions(-)

-- 
2.35.3



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-09-12  9:37 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-06 10:16 [PATCHv4 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-06 10:16 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
2024-09-06 10:16 ` [PATCH 2/3] nvme-multipath: avoid hang on inaccessible namespaces Hannes Reinecke
2024-09-08  7:08   ` Sagi Grimberg
2024-09-06 10:16 ` [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces Hannes Reinecke
2024-09-06 10:35   ` Damien Le Moal
2024-09-06 10:42     ` Hannes Reinecke
2024-09-08  7:21   ` Sagi Grimberg
2024-09-09  6:22     ` Hannes Reinecke
2024-09-09  6:59       ` Hannes Reinecke
  -- strict thread matches above, loose matches on Subject: below --
2024-09-09  7:19 [PATCHv5 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-09  7:19 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
2024-09-11  9:51 [PATCHv6 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-11  9:51 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
2024-09-11 12:46 [PATCH 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-11 12:46 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
2024-09-12  9:37   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).