* [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts
@ 2026-03-02 0:51 Dai Ngo
2026-03-03 15:33 ` Christoph Hellwig
0 siblings, 1 reply; 5+ messages in thread
From: Dai Ngo @ 2026-03-02 0:51 UTC (permalink / raw)
To: trondmy, anna; +Cc: linux-nfs
With SCSI layouts, the NFS client must not submit I/O to the data server
until the Persistent Reservation (PR) registration has completed.
Currently, bl_register_scsi() sets PNFS_BDEV_REGISTERED before performing
the PR operation. If multiple threads concurrently start I/O to the same
SCSI device, the first thread sets the flag and begins registration,
while other threads observe the flag, skip registration, and proceed to
issue I/O. Those I/Os can hit RESERVATION CONFLICT, forcing fall back to
the MDS.
Protect the registration path with a mutex so only one thread performs
PR registration at a time. Other threads wait for registration to finish
and only then re-check PNFS_BDEV_REGISTERED, ensuring no I/O is issued
until PR registration is complete.
This problem can be reproduced by running 'fio' test with this
workload:
[global]
ioengine=libaio # Also try 'io_uring'
direct=1 # Bypass local cache to stress pNFS paths
verify=0
rw=read # 100% Reads
bs=2M # Large blocks to saturate bandwidth
size=10G # Large enough to outsize any cache
runtime=60
time_based
group_reporting
directory=/tmp/mnt/testvm1 # Your pNFS mount point
[parallel-stress]
numjobs=4 # Multiple threads to trigger multiple DS connections
iodepth=8 # Keep the I/O pipeline full
[parallel-stress2]
numjobs=4 # Multiple threads to trigger multiple DS connections
iodepth=8 # Keep the I/O pipeline full
[parallel-stress3]
numjobs=4 # Multiple threads to trigger multiple DS connections
iodepth=8 # Keep the I/O pipeline full
[parallel-stress4]
numjobs=4 # Multiple threads to trigger multiple DS connections
iodepth=8 # Keep the I/O pipeline full
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
---
fs/nfs/blocklayout/blocklayout.h | 1 +
fs/nfs/blocklayout/dev.c | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 6da40ca19570..535db8b0e89c 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -117,6 +117,7 @@ struct pnfs_block_dev {
bool (*map)(struct pnfs_block_dev *dev, u64 offset,
struct pnfs_block_dev_map *map);
+ struct mutex pbd_mutex;
};
/* pnfs_block_dev flag bits */
diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
index cc6327d97a91..45630781f1a8 100644
--- a/fs/nfs/blocklayout/dev.c
+++ b/fs/nfs/blocklayout/dev.c
@@ -33,10 +33,14 @@ static bool bl_register_scsi(struct pnfs_block_dev *dev)
const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
int status;
- if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags))
+ mutex_lock(&dev->pbd_mutex);
+ if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags)) {
+ mutex_unlock(&dev->pbd_mutex);
return true;
+ }
status = ops->pr_register(bdev, 0, dev->pr_key, true);
+ mutex_unlock(&dev->pbd_mutex);
if (status) {
trace_bl_pr_key_reg_err(bdev, dev->pr_key, status);
return false;
@@ -572,6 +576,7 @@ bl_alloc_deviceid_node(struct nfs_server *server, struct pnfs_device *pdev,
top = kzalloc_obj(*top, gfp_mask);
if (!top)
goto out_free_volumes;
+ mutex_init(&top->pbd_mutex);
ret = bl_parse_deviceid(server, top, volumes, nr_volumes - 1, gfp_mask);
--
2.47.3
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts
2026-03-02 0:51 [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts Dai Ngo
@ 2026-03-03 15:33 ` Christoph Hellwig
2026-03-03 16:11 ` Chuck Lever
2026-03-03 18:21 ` Dai Ngo
0 siblings, 2 replies; 5+ messages in thread
From: Christoph Hellwig @ 2026-03-03 15:33 UTC (permalink / raw)
To: Dai Ngo; +Cc: trondmy, anna, linux-nfs
On Sun, Mar 01, 2026 at 04:51:23PM -0800, Dai Ngo wrote:
> This problem can be reproduced by running 'fio' test with this
> workload:
I wish we could wire this up somewhere. Not sure what the right
place for these kinds of nfs tests are, though.
> 2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> index 6da40ca19570..535db8b0e89c 100644
> --- a/fs/nfs/blocklayout/blocklayout.h
> +++ b/fs/nfs/blocklayout/blocklayout.h
> @@ -117,6 +117,7 @@ struct pnfs_block_dev {
>
> bool (*map)(struct pnfs_block_dev *dev, u64 offset,
> struct pnfs_block_dev_map *map);
> + struct mutex pbd_mutex;
Can you keep this up with the non-function pointer fields? I guess
pbd_registration_lock might be a more descriptive name, and comment
explaining what the lock protects also never hurts.
> diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
> index cc6327d97a91..45630781f1a8 100644
> --- a/fs/nfs/blocklayout/dev.c
> +++ b/fs/nfs/blocklayout/dev.c
> @@ -33,10 +33,14 @@ static bool bl_register_scsi(struct pnfs_block_dev *dev)
> const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
> int status;
>
> - if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags))
> + mutex_lock(&dev->pbd_mutex);
> + if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags)) {
> + mutex_unlock(&dev->pbd_mutex);
> return true;
> + }
This seems to only lock the registration side, and not the
unregistration side, which is a bit odd. If you fully protect
register/unregister we also don't need atomic bitops for
PNFS_BDEV_REGISTERED and have a more consistent locking scheme.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts
2026-03-03 15:33 ` Christoph Hellwig
@ 2026-03-03 16:11 ` Chuck Lever
2026-03-03 18:21 ` Dai Ngo
1 sibling, 0 replies; 5+ messages in thread
From: Chuck Lever @ 2026-03-03 16:11 UTC (permalink / raw)
To: Christoph Hellwig, Dai Ngo; +Cc: Trond Myklebust, Anna Schumaker, linux-nfs
On Tue, Mar 3, 2026, at 10:33 AM, Christoph Hellwig wrote:
> On Sun, Mar 01, 2026 at 04:51:23PM -0800, Dai Ngo wrote:
>> This problem can be reproduced by running 'fio' test with this
>> workload:
>
> I wish we could wire this up somewhere. Not sure what the right
> place for these kinds of nfs tests are, though.
Not fio specifically, but I spent some time last summer adding a
pNFS with iSCSI testing workflow to kdevops -- tests with fstests,
the git tool's regression suite, Mora's nfstest, and pynfs. Since
then I've been struggling to find resources to add this to the
NFSD test matrix ...
--
Chuck Lever
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts
2026-03-03 15:33 ` Christoph Hellwig
2026-03-03 16:11 ` Chuck Lever
@ 2026-03-03 18:21 ` Dai Ngo
2026-03-04 13:11 ` Christoph Hellwig
1 sibling, 1 reply; 5+ messages in thread
From: Dai Ngo @ 2026-03-03 18:21 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: trondmy, anna, linux-nfs
On 3/3/26 7:33 AM, Christoph Hellwig wrote:
> On Sun, Mar 01, 2026 at 04:51:23PM -0800, Dai Ngo wrote:
>> This problem can be reproduced by running 'fio' test with this
>> workload:
> I wish we could wire this up somewhere. Not sure what the right
> place for these kinds of nfs tests are, though.
remove in v2.
>
>> 2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
>> index 6da40ca19570..535db8b0e89c 100644
>> --- a/fs/nfs/blocklayout/blocklayout.h
>> +++ b/fs/nfs/blocklayout/blocklayout.h
>> @@ -117,6 +117,7 @@ struct pnfs_block_dev {
>>
>> bool (*map)(struct pnfs_block_dev *dev, u64 offset,
>> struct pnfs_block_dev_map *map);
>> + struct mutex pbd_mutex;
> Can you keep this up with the non-function pointer fields?
Can you please clarify this, you meant move this mutex up above
the (*map)() declaration?
> I guess
> pbd_registration_lock might be a more descriptive name, and comment
> explaining what the lock protects also never hurts.
Renamed to pbd_registration_mutex and add a description in v2.
>
>> diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
>> index cc6327d97a91..45630781f1a8 100644
>> --- a/fs/nfs/blocklayout/dev.c
>> +++ b/fs/nfs/blocklayout/dev.c
>> @@ -33,10 +33,14 @@ static bool bl_register_scsi(struct pnfs_block_dev *dev)
>> const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
>> int status;
>>
>> - if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags))
>> + mutex_lock(&dev->pbd_mutex);
>> + if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags)) {
>> + mutex_unlock(&dev->pbd_mutex);
>> return true;
>> + }
> This seems to only lock the registration side, and not the
> unregistration side, which is a bit odd.
The reason I did not use the mutex on unregistration is because
unregistration happens when the export is unmounted and I don't
see any race condition can happen at that time. Besides, even if
there is race condition on the unregistration the consequence is
a duplicate SCSI PR unregistration which is harmless.
However, if you think we should also protect the unregistration
then I can add it in. At the very least, it makes the code look
symmetric.
> If you fully protect
> register/unregister we also don't need atomic bitops for
> PNFS_BDEV_REGISTERED and have a more consistent locking scheme.
Even we fully protect register/unregister don't we still need the
PNFS_BDEV_REGISTERED bit so the others thread can check and skip
the register/unregister op?
Thanks,
-Dai
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts
2026-03-03 18:21 ` Dai Ngo
@ 2026-03-04 13:11 ` Christoph Hellwig
0 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2026-03-04 13:11 UTC (permalink / raw)
To: Dai Ngo; +Cc: Christoph Hellwig, trondmy, anna, linux-nfs
On Tue, Mar 03, 2026 at 10:21:07AM -0800, Dai Ngo wrote:
> > > struct pnfs_block_dev_map *map);
> > > + struct mutex pbd_mutex;
> > Can you keep this up with the non-function pointer fields?
>
> Can you please clarify this, you meant move this mutex up above
> the (*map)() declaration?
Yes.
> > > - if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags))
> > > + mutex_lock(&dev->pbd_mutex);
> > > + if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags)) {
> > > + mutex_unlock(&dev->pbd_mutex);
> > > return true;
> > > + }
> > This seems to only lock the registration side, and not the
> > unregistration side, which is a bit odd.
>
> The reason I did not use the mutex on unregistration is because
> unregistration happens when the export is unmounted and I don't
> see any race condition can happen at that time. Besides, even if
> there is race condition on the unregistration the consequence is
> a duplicate SCSI PR unregistration which is harmless.
>
> However, if you think we should also protect the unregistration
> then I can add it in. At the very least, it makes the code look
> symmetric.
And it clearly defines what the mutex protects, so please yes.
>
> > If you fully protect
> > register/unregister we also don't need atomic bitops for
> > PNFS_BDEV_REGISTERED and have a more consistent locking scheme.
>
> Even we fully protect register/unregister don't we still need the
> PNFS_BDEV_REGISTERED bit so the others thread can check and skip
> the register/unregister op?
Yes, but it doesn't need to use the atomic bit ops.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-03-04 13:11 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-02 0:51 [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts Dai Ngo
2026-03-03 15:33 ` Christoph Hellwig
2026-03-03 16:11 ` Chuck Lever
2026-03-03 18:21 ` Dai Ngo
2026-03-04 13:11 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox