From: Sagi Grimberg <sagi@grimberg.me>
To: Caleb Sander <csander@purestorage.com>
Cc: Christoph Hellwig <hch@lst.de>,
paulmck@kernel.org, Keith Busch <kbusch@kernel.org>,
Jens Axboe <axboe@fb.com>,
linux-nvme@lists.infradead.org,
Uday Shankar <ushankar@purestorage.com>
Subject: Re: [PATCH] nvme: fix SRCU protection of nvme_ns_head list
Date: Thu, 24 Nov 2022 16:17:56 +0200 [thread overview]
Message-ID: <c19cbb68-ec39-0cfc-7e9b-772fa9d0dc77@grimberg.me> (raw)
In-Reply-To: <CADUfDZpv43JmNiriEVQ_mFxZVS8QAcmbJaG3rK5KoUBaC9s17g@mail.gmail.com>
On 11/24/22 02:12, Caleb Sander wrote:
> On Tue, Nov 22, 2022 at 7:08 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>>
>>
>>>> 3. removes ns from head sibling list + synchronize rcu
>>>> -> this should fence non-sleeping traversals (like revalidate_paths)
>>>
>>> Well, non-sleeping would only matter if those non-sleeping travesals
>>> are under rcu_read_lock(), but they are not. They are either part of
>>> a longer srcu critical section because other code can sleep, or in
>>> case of revalidate_paths unprotected at all (which this patch fixes).
>>
>> The original patch comment was that rcu_read_lock/unlock would be
>> sufficient and we don't need to touch nvme_ns_remove()
>>
>>>
>>>> Maybe it is OK to have it also srcu locked and just accept that
>>>> nshead sibling list is srcu protected. In that case, your patch
>>>> needs to extend the srcu also the clearing of current_head pointer.
>>>
>>> I don't see how nvme_mpath_clear_current_path needs (S)RCU protection.
>>> It never dereferences the current_path, it just checks is for pointer
>>> equality and if they match clears it to NULL. (I wonder if it should
>>> use cmpxchg though).
>>
>> Agree. it can stay out. because at this point it does not compete with
>> concurrent submissions due to prior synchronizations. The list traversal
>> needs to be under rcu lock.
>>
>>
>>>
>>>> But looking again at your bug report, you mention that there are
>>>> concurrent scans, one removing the ns and another accessing it.
>>>> That cannot happen due to the scan_lock held around this section afaict.
>>>>
>>>> I guess it can be that in general ns removal can compete with a scan
>>>> if due to some controller behavior that failed an identify command
>>>> transiently in a prior scan, and a subsequent scan finds it? worth
>>>> pinning down exactly what happened in the race you got because maybe we
>>>> have a different issue that may manifest in other issues.
>>>
>>> So scanning itself should be single threaded as it only happens from
>>> the workqueue. But nvme_ns_remove can be called from
>>> nvme_remove_namespaces in in 6.1 and earlier from the passthrough
>>> handler.
>>
>> The original patch report did not include any sequence that removes all
>> namespaces, and given that it came from RockyLinux 8.6 kernel, it is not
>> 6.1... Hence I think that we need to understand how a namespace removal
>> happened at the same time that the namespace is being scanned. Maybe
>> something else is broken.
>
> We can reliably cause the panic by sending a "NS Changed" AEN
> from multiple controllers at the same time, resulting in multiple scan works
> running concurrently for the same namespace heads.
That is fine.
> In our test, we have 4 controllers for the same subsystem, with 9 namespaces
> that are added one at a time, resulting in many AENs for each controller.
> We can see from the dmesg logs that the controllers' scan works overlap:
> [37311.530367] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping down
> [37311.530398] nvme nvme0: creating 32 I/O queues.
> [37311.819883] nvme nvme0: mapped 32/0/0 default/read/poll queues.
> [37311.828129] nvme nvme0: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.1.110:4420
> [37311.924908] nvme nvme1: queue_size 128 > ctrl sqsize 32, clamping down
> [37311.924935] nvme nvme1: creating 32 I/O queues.
> [37312.298561] nvme nvme1: mapped 32/0/0 default/read/poll queues.
> [37312.306296] nvme nvme1: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.3.110:4420
> [37312.400143] nvme nvme2: queue_size 128 > ctrl sqsize 32, clamping down
> [37312.400180] nvme nvme2: creating 32 I/O queues.
> [37312.671861] nvme nvme2: mapped 32/0/0 default/read/poll queues.
> [37312.678318] nvme nvme2: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.2.110:4420
> [37312.760833] nvme nvme3: queue_size 128 > ctrl sqsize 32, clamping down
> [37312.760860] nvme nvme3: creating 32 I/O queues.
> [37313.123490] nvme nvme3: mapped 32/0/0 default/read/poll queues.
> [37313.130407] nvme nvme3: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.4.110:4420
> [37313.654120] nvme nvme3: rescanning namespaces.
> [37313.654152] nvme nvme2: rescanning namespaces.
> [37313.654867] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.654876] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.655573] nvme nvme0: rescanning namespaces.
> [37313.655694] nvme nvme1: rescanning namespaces.
> [37313.656405] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.656445] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.897745] nvme nvme3: rescanning namespaces.
> [37313.897748] nvme nvme2: rescanning namespaces.
> [37313.898614] nvme0n2: detected capacity change from 0 to 11811160064
> [37313.907348] nvme nvme0: rescanning namespaces.
> [37313.907409] nvme nvme1: rescanning namespaces.
> [37314.191586] nvme nvme2: rescanning namespaces.
> [37314.191589] nvme nvme3: rescanning namespaces.
> [37314.193241] nvme nvme0: rescanning namespaces.
> [37314.193303] nvme nvme1: rescanning namespaces.
> [37314.205965] nvme0n3: detected capacity change from 0 to 11811160064
> [37314.206026] nvme0n3: detected capacity change from 0 to 11811160064
> [37314.206036] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000050
> [37314.206036] nvme0n3: detected capacity change from 0 to 11811160064
>
> I don't see any mechanism to prevent scan work from running concurrently.
> scan_lock is per-controller, but there are 4 controllers in our test.
> I'm not very familiar with work queues in Linux, but it looks like they can run
> multiple pieces of work concurrently.
The scan is running sequentially per-controller, nothing prevents
different controllers to scan at the same time.
> The nvme-wq appears not CPU bound and has a high max concurrency:
> $ cat /sys/bus/workqueue/devices/nvme-wq/per_cpu
> 0
> $ cat /sys/bus/workqueue/devices/nvme-wq/max_active
> 256
>
> At the end of the scan, nvme_remove_invalid_namespaces() is called,
> which I think explains how we end up with namespace removals during the scans.
I see. Makes sense that at least revalidate_paths is called in parallel
with ns remove which you indicated is being called.
next prev parent reply other threads:[~2022-11-24 14:18 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-18 23:27 [PATCH] nvme: fix SRCU protection of nvme_ns_head list Caleb Sander
2022-11-20 11:24 ` Sagi Grimberg
2022-11-21 7:40 ` Christoph Hellwig
2022-11-21 9:43 ` Sagi Grimberg
2022-11-21 14:57 ` Paul E. McKenney
2022-11-21 17:48 ` Caleb Sander
2022-11-21 17:59 ` Paul E. McKenney
2022-11-21 19:58 ` Caleb Sander
2022-11-22 0:25 ` Paul E. McKenney
2022-11-22 10:06 ` Sagi Grimberg
2022-11-22 12:14 ` Christoph Hellwig
2022-11-22 15:08 ` Sagi Grimberg
2022-11-24 0:12 ` Caleb Sander
2022-11-24 3:08 ` Chao Leng
2022-11-24 14:17 ` Sagi Grimberg [this message]
2022-12-01 21:27 ` Caleb Sander
2022-12-01 23:18 ` Paul E. McKenney
2022-11-24 0:24 ` [PATCH] nvme: fix (S)RCU protection of nvme_ns_head list (alternate) Caleb Sander
2022-11-24 14:19 ` Sagi Grimberg
2022-11-29 8:39 ` Christoph Hellwig
2022-11-30 8:25 ` Sagi Grimberg
2022-11-30 8:35 ` Christoph Hellwig
2022-11-30 8:40 ` Sagi Grimberg
2022-12-01 21:17 ` Caleb Sander
2022-12-02 1:21 ` Chao Leng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c19cbb68-ec39-0cfc-7e9b-772fa9d0dc77@grimberg.me \
--to=sagi@grimberg.me \
--cc=axboe@fb.com \
--cc=csander@purestorage.com \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=paulmck@kernel.org \
--cc=ushankar@purestorage.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox