public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts
@ 2026-03-02  0:51 Dai Ngo
  2026-03-03 15:33 ` Christoph Hellwig
  0 siblings, 1 reply; 5+ messages in thread
From: Dai Ngo @ 2026-03-02  0:51 UTC (permalink / raw)
  To: trondmy, anna; +Cc: linux-nfs

With SCSI layouts, the NFS client must not submit I/O to the data server
until the Persistent Reservation (PR) registration has completed.

Currently, bl_register_scsi() sets PNFS_BDEV_REGISTERED before performing
the PR operation. If multiple threads concurrently start I/O to the same
SCSI device, the first thread sets the flag and begins registration,
while other threads observe the flag, skip registration, and proceed to
issue I/O. Those I/Os can hit RESERVATION CONFLICT, forcing fall back to
the MDS.

Protect the registration path with a mutex so only one thread performs
PR registration at a time. Other threads wait for registration to finish
and only then re-check PNFS_BDEV_REGISTERED, ensuring no I/O is issued
until PR registration is complete.

This problem can be reproduced by running 'fio' test with this
workload:

[global]
ioengine=libaio     # Also try 'io_uring'
direct=1            # Bypass local cache to stress pNFS paths
verify=0
rw=read             # 100% Reads
bs=2M               # Large blocks to saturate bandwidth
size=10G            # Large enough to outsize any cache
runtime=60
time_based
group_reporting
directory=/tmp/mnt/testvm1 # Your pNFS mount point

[parallel-stress]
numjobs=4           # Multiple threads to trigger multiple DS connections
iodepth=8           # Keep the I/O pipeline full
[parallel-stress2]
numjobs=4           # Multiple threads to trigger multiple DS connections
iodepth=8           # Keep the I/O pipeline full
[parallel-stress3]
numjobs=4           # Multiple threads to trigger multiple DS connections
iodepth=8           # Keep the I/O pipeline full
[parallel-stress4]
numjobs=4           # Multiple threads to trigger multiple DS connections
iodepth=8           # Keep the I/O pipeline full

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
---
 fs/nfs/blocklayout/blocklayout.h | 1 +
 fs/nfs/blocklayout/dev.c         | 7 ++++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 6da40ca19570..535db8b0e89c 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -117,6 +117,7 @@ struct pnfs_block_dev {
 
 	bool (*map)(struct pnfs_block_dev *dev, u64 offset,
 			struct pnfs_block_dev_map *map);
+	struct mutex			pbd_mutex;
 };
 
 /* pnfs_block_dev flag bits */
diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
index cc6327d97a91..45630781f1a8 100644
--- a/fs/nfs/blocklayout/dev.c
+++ b/fs/nfs/blocklayout/dev.c
@@ -33,10 +33,14 @@ static bool bl_register_scsi(struct pnfs_block_dev *dev)
 	const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
 	int status;
 
-	if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags))
+	mutex_lock(&dev->pbd_mutex);
+	if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags)) {
+		mutex_unlock(&dev->pbd_mutex);
 		return true;
+	}
 
 	status = ops->pr_register(bdev, 0, dev->pr_key, true);
+	mutex_unlock(&dev->pbd_mutex);
 	if (status) {
 		trace_bl_pr_key_reg_err(bdev, dev->pr_key, status);
 		return false;
@@ -572,6 +576,7 @@ bl_alloc_deviceid_node(struct nfs_server *server, struct pnfs_device *pdev,
 	top = kzalloc_obj(*top, gfp_mask);
 	if (!top)
 		goto out_free_volumes;
+	mutex_init(&top->pbd_mutex);
 
 	ret = bl_parse_deviceid(server, top, volumes, nr_volumes - 1, gfp_mask);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts
  2026-03-02  0:51 [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts Dai Ngo
@ 2026-03-03 15:33 ` Christoph Hellwig
  2026-03-03 16:11   ` Chuck Lever
  2026-03-03 18:21   ` Dai Ngo
  0 siblings, 2 replies; 5+ messages in thread
From: Christoph Hellwig @ 2026-03-03 15:33 UTC (permalink / raw)
  To: Dai Ngo; +Cc: trondmy, anna, linux-nfs

On Sun, Mar 01, 2026 at 04:51:23PM -0800, Dai Ngo wrote:
> This problem can be reproduced by running 'fio' test with this
> workload:

I wish we could wire this up somewhere.  Not sure what the right
place for these kinds of nfs tests are, though.

>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> index 6da40ca19570..535db8b0e89c 100644
> --- a/fs/nfs/blocklayout/blocklayout.h
> +++ b/fs/nfs/blocklayout/blocklayout.h
> @@ -117,6 +117,7 @@ struct pnfs_block_dev {
>  
>  	bool (*map)(struct pnfs_block_dev *dev, u64 offset,
>  			struct pnfs_block_dev_map *map);
> +	struct mutex			pbd_mutex;

Can you keep this up with the non-function pointer fields?  I guess
pbd_registration_lock might be a more descriptive name, and comment
explaining what the lock protects also never hurts.

> diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
> index cc6327d97a91..45630781f1a8 100644
> --- a/fs/nfs/blocklayout/dev.c
> +++ b/fs/nfs/blocklayout/dev.c
> @@ -33,10 +33,14 @@ static bool bl_register_scsi(struct pnfs_block_dev *dev)
>  	const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
>  	int status;
>  
> -	if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags))
> +	mutex_lock(&dev->pbd_mutex);
> +	if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags)) {
> +		mutex_unlock(&dev->pbd_mutex);
>  		return true;
> +	}

This seems to only lock the registration side, and not the
unregistration side, which is a bit odd.  If you fully protect
register/unregister we also don't need atomic bitops for
PNFS_BDEV_REGISTERED and have a more consistent locking scheme.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts
  2026-03-03 15:33 ` Christoph Hellwig
@ 2026-03-03 16:11   ` Chuck Lever
  2026-03-03 18:21   ` Dai Ngo
  1 sibling, 0 replies; 5+ messages in thread
From: Chuck Lever @ 2026-03-03 16:11 UTC (permalink / raw)
  To: Christoph Hellwig, Dai Ngo; +Cc: Trond Myklebust, Anna Schumaker, linux-nfs



On Tue, Mar 3, 2026, at 10:33 AM, Christoph Hellwig wrote:
> On Sun, Mar 01, 2026 at 04:51:23PM -0800, Dai Ngo wrote:
>> This problem can be reproduced by running 'fio' test with this
>> workload:
>
> I wish we could wire this up somewhere.  Not sure what the right
> place for these kinds of nfs tests are, though.

Not fio specifically, but I spent some time last summer adding a
pNFS with iSCSI testing workflow to kdevops -- tests with fstests,
the git tool's regression suite, Mora's nfstest, and pynfs. Since
then I've been struggling to find resources to add this to the
NFSD test matrix ...


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts
  2026-03-03 15:33 ` Christoph Hellwig
  2026-03-03 16:11   ` Chuck Lever
@ 2026-03-03 18:21   ` Dai Ngo
  2026-03-04 13:11     ` Christoph Hellwig
  1 sibling, 1 reply; 5+ messages in thread
From: Dai Ngo @ 2026-03-03 18:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: trondmy, anna, linux-nfs


On 3/3/26 7:33 AM, Christoph Hellwig wrote:
> On Sun, Mar 01, 2026 at 04:51:23PM -0800, Dai Ngo wrote:
>> This problem can be reproduced by running 'fio' test with this
>> workload:
> I wish we could wire this up somewhere.  Not sure what the right
> place for these kinds of nfs tests are, though.

remove in v2.

>
>>   2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
>> index 6da40ca19570..535db8b0e89c 100644
>> --- a/fs/nfs/blocklayout/blocklayout.h
>> +++ b/fs/nfs/blocklayout/blocklayout.h
>> @@ -117,6 +117,7 @@ struct pnfs_block_dev {
>>   
>>   	bool (*map)(struct pnfs_block_dev *dev, u64 offset,
>>   			struct pnfs_block_dev_map *map);
>> +	struct mutex			pbd_mutex;
> Can you keep this up with the non-function pointer fields?

Can you please clarify this, you meant move this mutex up above
the (*map)() declaration?

>    I guess
> pbd_registration_lock might be a more descriptive name, and comment
> explaining what the lock protects also never hurts.

Renamed to pbd_registration_mutex and add a description in v2.

>
>> diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
>> index cc6327d97a91..45630781f1a8 100644
>> --- a/fs/nfs/blocklayout/dev.c
>> +++ b/fs/nfs/blocklayout/dev.c
>> @@ -33,10 +33,14 @@ static bool bl_register_scsi(struct pnfs_block_dev *dev)
>>   	const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
>>   	int status;
>>   
>> -	if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags))
>> +	mutex_lock(&dev->pbd_mutex);
>> +	if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags)) {
>> +		mutex_unlock(&dev->pbd_mutex);
>>   		return true;
>> +	}
> This seems to only lock the registration side, and not the
> unregistration side, which is a bit odd.

The reason I did not use the mutex on unregistration is because
unregistration happens when the export is unmounted and I don't
see any race condition can happen at that time. Besides, even if
there is race condition on the unregistration the consequence is
a duplicate SCSI PR unregistration which is harmless.

However, if you think we should also protect the unregistration
then I can add it in. At the very least, it makes the code look
symmetric.
  

>    If you fully protect
> register/unregister we also don't need atomic bitops for
> PNFS_BDEV_REGISTERED and have a more consistent locking scheme.

Even we fully protect register/unregister don't we still need the
PNFS_BDEV_REGISTERED bit so the others thread can check and skip
the register/unregister op?

Thanks,
-Dai


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts
  2026-03-03 18:21   ` Dai Ngo
@ 2026-03-04 13:11     ` Christoph Hellwig
  0 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2026-03-04 13:11 UTC (permalink / raw)
  To: Dai Ngo; +Cc: Christoph Hellwig, trondmy, anna, linux-nfs

On Tue, Mar 03, 2026 at 10:21:07AM -0800, Dai Ngo wrote:
> > >   			struct pnfs_block_dev_map *map);
> > > +	struct mutex			pbd_mutex;
> > Can you keep this up with the non-function pointer fields?
> 
> Can you please clarify this, you meant move this mutex up above
> the (*map)() declaration?

Yes.

> > > -	if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags))
> > > +	mutex_lock(&dev->pbd_mutex);
> > > +	if (test_and_set_bit(PNFS_BDEV_REGISTERED, &dev->flags)) {
> > > +		mutex_unlock(&dev->pbd_mutex);
> > >   		return true;
> > > +	}
> > This seems to only lock the registration side, and not the
> > unregistration side, which is a bit odd.
> 
> The reason I did not use the mutex on unregistration is because
> unregistration happens when the export is unmounted and I don't
> see any race condition can happen at that time. Besides, even if
> there is race condition on the unregistration the consequence is
> a duplicate SCSI PR unregistration which is harmless.
> 
> However, if you think we should also protect the unregistration
> then I can add it in. At the very least, it makes the code look
> symmetric.

And it clearly defines what the mutex protects, so please yes.

> 
> >    If you fully protect
> > register/unregister we also don't need atomic bitops for
> > PNFS_BDEV_REGISTERED and have a more consistent locking scheme.
> 
> Even we fully protect register/unregister don't we still need the
> PNFS_BDEV_REGISTERED bit so the others thread can check and skip
> the register/unregister op?

Yes, but it doesn't need to use the atomic bit ops.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-04 13:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-02  0:51 [PATCH 1/1] pNFS: Serialize SCSI PR registration to avoid reservation conflicts Dai Ngo
2026-03-03 15:33 ` Christoph Hellwig
2026-03-03 16:11   ` Chuck Lever
2026-03-03 18:21   ` Dai Ngo
2026-03-04 13:11     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox