[RFC PATCH 0/4] Snapshot of fixes for SCSI PR key registration

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Snapshot of fixes for SCSI PR key registration
@ 2024-06-19 17:39 cel
  2024-06-19 17:39 ` [RFC PATCH 1/4] nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg cel
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: cel @ 2024-06-19 17:39 UTC (permalink / raw)
  To: linux-nfs; +Cc: Christoph Hellwig, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

With "Fix premature PR key unregistration", generic/069 seems happy
now. It's kind of a brute-force fix, though. The race window narrows
significantly when "Use bulk page allocation APIs" is applied, which
suggests this issue might not appear in every environment.

However, I still see:
 - generic/108 throw PR-related block I/O errors
 - instances of double key registration and unregistration

Looking for comments and advice while I proceed with more
troubleshooting.

Chuck Lever (4):
  nfs/blocklayout: SCSI layout trace points for reservation key
    reg/unreg
  nfs/blocklayout: Report only when /no/ device is found
  nfs/blocklayout: Fix premature PR key unregistration
  nfs/blocklayout: Use bulk page allocation APIs

 fs/nfs/blocklayout/blocklayout.c |  9 ++++-
 fs/nfs/blocklayout/blocklayout.h |  1 +
 fs/nfs/blocklayout/dev.c         | 63 +++++++++++++++++++++-----------
 fs/nfs/nfs4trace.c               |  5 +++
 fs/nfs/nfs4trace.h               | 62 +++++++++++++++++++++++++++++++
 fs/nfs/pnfs_dev.c                | 15 +++-----
 6 files changed, 123 insertions(+), 32 deletions(-)

-- 
2.45.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/4] nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg
  2024-06-19 17:39 [RFC PATCH 0/4] Snapshot of fixes for SCSI PR key registration cel
@ 2024-06-19 17:39 ` cel
  2024-06-20  4:50   ` Christoph Hellwig
  2024-06-19 17:39 ` [RFC PATCH 2/4] nfs/blocklayout: Report only when /no/ device is found cel
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: cel @ 2024-06-19 17:39 UTC (permalink / raw)
  To: linux-nfs; +Cc: Christoph Hellwig, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

An administrator cannot take action on these messages, but the
reported errors might be helpful for troubleshooting. Transition
them to trace points so these events appear in the trace log and
can be easily lined up with other traced NFS client operations.

Examples:

   append_writer-6147  [000]    80.247393: bl_pr_key_reg:        device=sdb key=0x666dcdabf29514fe
   append_writer-6147  [000]    80.247842: bl_pr_key_unreg:      device=sdb key=0x666dcdabf29514fe

     umount.nfs4-6172  [002]    84.950409: bl_pr_key_unreg_err:  device=sdb key=0x666dcdabf29514fe error=24

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/blocklayout/dev.c | 33 +++++++++++----------
 fs/nfs/nfs4trace.c       |  5 ++++
 fs/nfs/nfs4trace.h       | 62 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+), 15 deletions(-)

diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
index 519c310c745d..b3828e5ee079 100644
--- a/fs/nfs/blocklayout/dev.c
+++ b/fs/nfs/blocklayout/dev.c
@@ -10,6 +10,7 @@
 #include <linux/pr.h>
 
 #include "blocklayout.h"
+#include "../nfs4trace.h"
 
 #define NFSDBG_FACILITY		NFSDBG_PNFS_LD
 
@@ -24,14 +25,17 @@ bl_free_device(struct pnfs_block_dev *dev)
 		kfree(dev->children);
 	} else {
 		if (dev->pr_registered) {
-			const struct pr_ops *ops =
-				file_bdev(dev->bdev_file)->bd_disk->fops->pr_ops;
+			struct block_device *bdev = file_bdev(dev->bdev_file);
+			const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
 			int error;
 
-			error = ops->pr_register(file_bdev(dev->bdev_file),
-				dev->pr_key, 0, false);
+			error = ops->pr_register(bdev, dev->pr_key, 0, false);
 			if (error)
-				pr_err("failed to unregister PR key.\n");
+				trace_bl_pr_key_unreg_err(bdev->bd_disk->disk_name,
+							  dev->pr_key, error);
+			else
+				trace_bl_pr_key_unreg(bdev->bd_disk->disk_name,
+						      dev->pr_key);
 		}
 
 		if (dev->bdev_file)
@@ -327,8 +331,9 @@ bl_parse_scsi(struct nfs_server *server, struct pnfs_block_dev *d,
 		struct pnfs_block_volume *volumes, int idx, gfp_t gfp_mask)
 {
 	struct pnfs_block_volume *v = &volumes[idx];
-	struct file *bdev_file;
+	struct block_device *bdev;
 	const struct pr_ops *ops;
+	struct file *bdev_file;
 	int error;
 
 	if (!bl_validate_designator(v))
@@ -346,8 +351,9 @@ bl_parse_scsi(struct nfs_server *server, struct pnfs_block_dev *d,
 	if (IS_ERR(bdev_file))
 		return PTR_ERR(bdev_file);
 	d->bdev_file = bdev_file;
+	bdev = file_bdev(bdev_file);
 
-	d->len = bdev_nr_bytes(file_bdev(d->bdev_file));
+	d->len = bdev_nr_bytes(bdev);
 	d->map = bl_map_simple;
 	d->pr_key = v->scsi.pr_key;
 
@@ -356,23 +362,20 @@ bl_parse_scsi(struct nfs_server *server, struct pnfs_block_dev *d,
 		goto out_blkdev_put;
 	}
 
-	pr_info("pNFS: using block device %s (reservation key 0x%llx)\n",
-		file_bdev(d->bdev_file)->bd_disk->disk_name, d->pr_key);
-
-	ops = file_bdev(d->bdev_file)->bd_disk->fops->pr_ops;
+	ops = bdev->bd_disk->fops->pr_ops;
 	if (!ops) {
 		pr_err("pNFS: block device %s does not support reservations.",
-				file_bdev(d->bdev_file)->bd_disk->disk_name);
+				bdev->bd_disk->disk_name);
 		error = -EINVAL;
 		goto out_blkdev_put;
 	}
 
-	error = ops->pr_register(file_bdev(d->bdev_file), 0, d->pr_key, true);
+	error = ops->pr_register(bdev, 0, d->pr_key, true);
 	if (error) {
-		pr_err("pNFS: failed to register key for block device %s.",
-				file_bdev(d->bdev_file)->bd_disk->disk_name);
+		trace_bl_pr_key_reg_err(bdev->bd_disk->disk_name, d->pr_key, error);
 		goto out_blkdev_put;
 	}
+	trace_bl_pr_key_reg(bdev->bd_disk->disk_name, d->pr_key);
 
 	d->pr_registered = true;
 	return 0;
diff --git a/fs/nfs/nfs4trace.c b/fs/nfs/nfs4trace.c
index d22c6670f770..74eff35fbc90 100644
--- a/fs/nfs/nfs4trace.c
+++ b/fs/nfs/nfs4trace.c
@@ -29,5 +29,10 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(ff_layout_read_error);
 EXPORT_TRACEPOINT_SYMBOL_GPL(ff_layout_write_error);
 EXPORT_TRACEPOINT_SYMBOL_GPL(ff_layout_commit_error);
 
+EXPORT_TRACEPOINT_SYMBOL_GPL(bl_pr_key_reg);
+EXPORT_TRACEPOINT_SYMBOL_GPL(bl_pr_key_reg_err);
+EXPORT_TRACEPOINT_SYMBOL_GPL(bl_pr_key_unreg);
+EXPORT_TRACEPOINT_SYMBOL_GPL(bl_pr_key_unreg_err);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(fl_getdevinfo);
 #endif
diff --git a/fs/nfs/nfs4trace.h b/fs/nfs/nfs4trace.h
index 4de8780a7c48..2af75b8e018d 100644
--- a/fs/nfs/nfs4trace.h
+++ b/fs/nfs/nfs4trace.h
@@ -2153,6 +2153,68 @@ TRACE_EVENT(ff_layout_commit_error,
 		)
 );
 
+DECLARE_EVENT_CLASS(pnfs_bl_pr_key_class,
+	TP_PROTO(
+		const char *name,
+		u64 key
+	),
+	TP_ARGS(name, key),
+	TP_STRUCT__entry(
+		__string(name, name)
+		__field(u64, key)
+	),
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->key = key;
+	),
+	TP_printk("device=%s key=0x%llx",
+		__get_str(name), __entry->key
+	)
+);
+
+#define DEFINE_NFS4_BLOCK_PRKEY_EVENT(name) \
+	DEFINE_EVENT(pnfs_bl_pr_key_class, name, \
+		TP_PROTO( \
+			const char *name, \
+			u64 key \
+		), \
+		TP_ARGS(name, key))
+DEFINE_NFS4_BLOCK_PRKEY_EVENT(bl_pr_key_reg);
+DEFINE_NFS4_BLOCK_PRKEY_EVENT(bl_pr_key_unreg);
+
+DECLARE_EVENT_CLASS(pnfs_bl_pr_key_err_class,
+	TP_PROTO(
+		const char *name,
+		u64 key,
+		int error
+	),
+	TP_ARGS(name, key, error),
+	TP_STRUCT__entry(
+		__string(name, name)
+		__field(u64, key)
+		__field(int, error)
+	),
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->key = key;
+		__entry->error = error;
+	),
+	TP_printk("device=%s key=0x%llx error=%d",
+		__get_str(name), __entry->key, __entry->error
+	)
+);
+
+#define DEFINE_NFS4_BLOCK_PRKEY_ERR_EVENT(name) \
+	DEFINE_EVENT(pnfs_bl_pr_key_err_class, name, \
+		TP_PROTO( \
+			const char *name, \
+			u64 key, \
+			int error \
+		), \
+		TP_ARGS(name, key, error))
+DEFINE_NFS4_BLOCK_PRKEY_ERR_EVENT(bl_pr_key_reg_err);
+DEFINE_NFS4_BLOCK_PRKEY_ERR_EVENT(bl_pr_key_unreg_err);
+
 #ifdef CONFIG_NFS_V4_2
 TRACE_DEFINE_ENUM(NFS4_CONTENT_DATA);
 TRACE_DEFINE_ENUM(NFS4_CONTENT_HOLE);
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 2/4] nfs/blocklayout: Report only when /no/ device is found
  2024-06-19 17:39 [RFC PATCH 0/4] Snapshot of fixes for SCSI PR key registration cel
  2024-06-19 17:39 ` [RFC PATCH 1/4] nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg cel
@ 2024-06-19 17:39 ` cel
  2024-06-20  4:36   ` Christoph Hellwig
  2024-06-20 12:17   ` Benjamin Coddington
  2024-06-19 17:39 ` [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration cel
  2024-06-19 17:39 ` [RFC PATCH 4/4] nfs/blocklayout: Use bulk page allocation APIs cel
  3 siblings, 2 replies; 30+ messages in thread
From: cel @ 2024-06-19 17:39 UTC (permalink / raw)
  To: linux-nfs; +Cc: Christoph Hellwig, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Since f931d8374cad ("nfs/blocklayout: refactor block device
opening"), an error is reported when no multi-path device is
found. But this isn't a fatal error if the subsequent device open
is successful. On systems without multi-path devices, this message
always appears whether there is a problem or not.

Instead, generate less system journal noise by reporting an error
only when both open attempts fail. The new error message is more
actionable since it indicates that there is a real configuration
issue to be addressed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/blocklayout/dev.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
index b3828e5ee079..356bc967fb5d 100644
--- a/fs/nfs/blocklayout/dev.c
+++ b/fs/nfs/blocklayout/dev.c
@@ -318,7 +318,7 @@ bl_open_path(struct pnfs_block_volume *v, const char *prefix)
 	bdev_file = bdev_file_open_by_path(devname, BLK_OPEN_READ | BLK_OPEN_WRITE,
 					NULL, NULL);
 	if (IS_ERR(bdev_file)) {
-		pr_warn("pNFS: failed to open device %s (%ld)\n",
+		dprintk("failed to open device %s (%ld)\n",
 			devname, PTR_ERR(bdev_file));
 	}
 
@@ -348,8 +348,11 @@ bl_parse_scsi(struct nfs_server *server, struct pnfs_block_dev *d,
 	bdev_file = bl_open_path(v, "dm-uuid-mpath-0x");
 	if (IS_ERR(bdev_file))
 		bdev_file = bl_open_path(v, "wwn-0x");
-	if (IS_ERR(bdev_file))
+	if (IS_ERR(bdev_file)) {
+		pr_err("pNFS: no device found for volume %*phN\n",
+			v->scsi.designator_len, v->scsi.designator);
 		return PTR_ERR(bdev_file);
+	}
 	d->bdev_file = bdev_file;
 	bdev = file_bdev(bdev_file);
 
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-19 17:39 [RFC PATCH 0/4] Snapshot of fixes for SCSI PR key registration cel
  2024-06-19 17:39 ` [RFC PATCH 1/4] nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg cel
  2024-06-19 17:39 ` [RFC PATCH 2/4] nfs/blocklayout: Report only when /no/ device is found cel
@ 2024-06-19 17:39 ` cel
  2024-06-20  5:06   ` Christoph Hellwig
  2024-06-20 13:51   ` Benjamin Coddington
  2024-06-19 17:39 ` [RFC PATCH 4/4] nfs/blocklayout: Use bulk page allocation APIs cel
  3 siblings, 2 replies; 30+ messages in thread
From: cel @ 2024-06-19 17:39 UTC (permalink / raw)
  To: linux-nfs; +Cc: Christoph Hellwig, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

During generic/069 runs with pNFS SCSI layouts, the NFS client emits
the following in the system journal:

kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
kernel: sd 6:0:0:1: reservation conflict
kernel: sd 6:0:0:1: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
kernel: sd 6:0:0:1: [sdb] tag#16 CDB: Write(10) 2a 00 00 00 00 50 00 00 08 00
kernel: reservation conflict error, dev sdb, sector 80 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
kernel: sd 6:0:0:1: reservation conflict
kernel: sd 6:0:0:1: reservation conflict
kernel: sd 6:0:0:1: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
kernel: sd 6:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
kernel: sd 6:0:0:1: [sdb] tag#18 CDB: Write(10) 2a 00 00 00 00 60 00 00 08 00
kernel: sd 6:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 00 00 00 58 00 00 08 00
kernel: reservation conflict error, dev sdb, sector 96 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
kernel: reservation conflict error, dev sdb, sector 88 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
systemd[1]: fstests-generic-069.scope: Deactivated successfully.
systemd[1]: fstests-generic-069.scope: Consumed 5.092s CPU time.
systemd[1]: media-test.mount: Deactivated successfully.
systemd[1]: media-scratch.mount: Deactivated successfully.
kernel: sd 6:0:0:1: reservation conflict
kernel: failed to unregister PR key.

This appears to be due to a race. bl_alloc_lseg() calls this:

561 static struct nfs4_deviceid_node *
562 bl_find_get_deviceid(struct nfs_server *server,
563                 const struct nfs4_deviceid *id, const struct cred *cred,
564                 gfp_t gfp_mask)
565 {
566         struct nfs4_deviceid_node *node;
567         unsigned long start, end;
568
569 retry:
570         node = nfs4_find_get_deviceid(server, id, cred, gfp_mask);
571         if (!node)
572                 return ERR_PTR(-ENODEV);

nfs4_find_get_deviceid() does a lookup without the spin lock first.
If it can't find a matching deviceid, it creates a new device_info
(which calls bl_alloc_deviceid_node, and that registers the device's
PR key).

Then it takes the nfs4_deviceid_lock and looks up the deviceid again.
If it finds it this time, bl_find_get_deviceid() frees the spare
(new) device_info, which unregisters the PR key for the same device.

Any subsequent I/O from this client on that device gets EBADE.

The umount later unregisters the device's PR key again.

To prevent this problem, register the PR key after the deviceid_node
lookup.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/blocklayout/blocklayout.c |  9 ++++++++-
 fs/nfs/blocklayout/blocklayout.h |  1 +
 fs/nfs/blocklayout/dev.c         | 29 +++++++++++++++++++++--------
 3 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 6be13e0ec170..75cc5e50bd37 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -571,8 +571,14 @@ bl_find_get_deviceid(struct nfs_server *server,
 	if (!node)
 		return ERR_PTR(-ENODEV);
 
-	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
+	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0) {
+		struct pnfs_block_dev *d =
+			container_of(node, struct pnfs_block_dev, node);
+		if (d->pr_reg)
+			if (d->pr_reg(d) < 0)
+				goto out_put;
 		return node;
+	}
 
 	end = jiffies;
 	start = end - PNFS_DEVICE_RETRY_TIMEOUT;
@@ -581,6 +587,7 @@ bl_find_get_deviceid(struct nfs_server *server,
 		goto retry;
 	}
 
+out_put:
 	nfs4_put_deviceid_node(node);
 	return ERR_PTR(-ENODEV);
 }
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index f1eeb4914199..8aabaf5218b8 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -116,6 +116,7 @@ struct pnfs_block_dev {
 
 	bool (*map)(struct pnfs_block_dev *dev, u64 offset,
 			struct pnfs_block_dev_map *map);
+	int (*pr_reg)(struct pnfs_block_dev *dev);
 };
 
 /* sector_t fields are all in 512-byte sectors */
diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
index 356bc967fb5d..3d2401820ef4 100644
--- a/fs/nfs/blocklayout/dev.c
+++ b/fs/nfs/blocklayout/dev.c
@@ -230,6 +230,26 @@ static bool bl_map_stripe(struct pnfs_block_dev *dev, u64 offset,
 	return true;
 }
 
+static int bl_register_scsi(struct pnfs_block_dev *d)
+{
+	struct block_device *bdev = file_bdev(d->bdev_file);
+	const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
+	int error;
+
+	if (d->pr_registered)
+		return 0;
+
+	error = ops->pr_register(bdev, 0, d->pr_key, true);
+	if (error) {
+		trace_bl_pr_key_reg_err(bdev->bd_disk->disk_name, d->pr_key, error);
+		return -error;
+	}
+
+	trace_bl_pr_key_reg(bdev->bd_disk->disk_name, d->pr_key);
+	d->pr_registered = true;
+	return 0;
+}
+
 static int
 bl_parse_deviceid(struct nfs_server *server, struct pnfs_block_dev *d,
 		struct pnfs_block_volume *volumes, int idx, gfp_t gfp_mask);
@@ -373,14 +393,7 @@ bl_parse_scsi(struct nfs_server *server, struct pnfs_block_dev *d,
 		goto out_blkdev_put;
 	}
 
-	error = ops->pr_register(bdev, 0, d->pr_key, true);
-	if (error) {
-		trace_bl_pr_key_reg_err(bdev->bd_disk->disk_name, d->pr_key, error);
-		goto out_blkdev_put;
-	}
-	trace_bl_pr_key_reg(bdev->bd_disk->disk_name, d->pr_key);
-
-	d->pr_registered = true;
+	d->pr_reg = bl_register_scsi;
 	return 0;
 
 out_blkdev_put:
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 4/4] nfs/blocklayout: Use bulk page allocation APIs
  2024-06-19 17:39 [RFC PATCH 0/4] Snapshot of fixes for SCSI PR key registration cel
                   ` (2 preceding siblings ...)
  2024-06-19 17:39 ` [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration cel
@ 2024-06-19 17:39 ` cel
  2024-06-20  4:44   ` Christoph Hellwig
  3 siblings, 1 reply; 30+ messages in thread
From: cel @ 2024-06-19 17:39 UTC (permalink / raw)
  To: linux-nfs; +Cc: Christoph Hellwig, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

nfs4_get_device_info() frequently requests more than a few pages
when provisioning a nfs4_deviceid_node object. Make this more
efficient by using alloc_pages_bulk_array(). This API is known to be
several times faster than an open-coded loop around alloc_page().

release_pages() is folio-enabled so it is also more efficient than
repeatedly invoking __free_pages().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/pnfs_dev.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/pnfs_dev.c b/fs/nfs/pnfs_dev.c
index 178001c90156..26a78d69acab 100644
--- a/fs/nfs/pnfs_dev.c
+++ b/fs/nfs/pnfs_dev.c
@@ -101,9 +101,8 @@ nfs4_get_device_info(struct nfs_server *server,
 	struct nfs4_deviceid_node *d = NULL;
 	struct pnfs_device *pdev = NULL;
 	struct page **pages = NULL;
+	int rc, i, max_pages;
 	u32 max_resp_sz;
-	int max_pages;
-	int rc, i;
 
 	/*
 	 * Use the session max response size as the basis for setting
@@ -125,11 +124,9 @@ nfs4_get_device_info(struct nfs_server *server,
 	if (!pages)
 		goto out_free_pdev;
 
-	for (i = 0; i < max_pages; i++) {
-		pages[i] = alloc_page(gfp_flags);
-		if (!pages[i])
-			goto out_free_pages;
-	}
+	i = alloc_pages_bulk_array(GFP_KERNEL, max_pages, pages);
+	if (i != max_pages)
+		goto out_free_pages;
 
 	memcpy(&pdev->dev_id, dev_id, sizeof(*dev_id));
 	pdev->layout_type = server->pnfs_curr_ld->id;
@@ -154,8 +151,8 @@ nfs4_get_device_info(struct nfs_server *server,
 		set_bit(NFS_DEVICEID_NOCACHE, &d->flags);
 
 out_free_pages:
-	while (--i >= 0)
-		__free_page(pages[i]);
+	if (i)
+		release_pages(pages, i);
 	kfree(pages);
 out_free_pdev:
 	kfree(pdev);
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 2/4] nfs/blocklayout: Report only when /no/ device is found
  2024-06-19 17:39 ` [RFC PATCH 2/4] nfs/blocklayout: Report only when /no/ device is found cel
@ 2024-06-20  4:36   ` Christoph Hellwig
  2024-06-20 14:59     ` Chuck Lever
  2024-06-20 12:17   ` Benjamin Coddington
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-06-20  4:36 UTC (permalink / raw)
  To: cel; +Cc: linux-nfs, Christoph Hellwig, Chuck Lever

On Wed, Jun 19, 2024 at 01:39:32PM -0400, cel@kernel.org wrote:
>  	if (IS_ERR(bdev_file)) {
> -		pr_warn("pNFS: failed to open device %s (%ld)\n",
> +		dprintk("failed to open device %s (%ld)\n",
>  			devname, PTR_ERR(bdev_file));

I'd just drop this one entirely.  Otherwise this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/4] nfs/blocklayout: Use bulk page allocation APIs
  2024-06-19 17:39 ` [RFC PATCH 4/4] nfs/blocklayout: Use bulk page allocation APIs cel
@ 2024-06-20  4:44   ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-06-20  4:44 UTC (permalink / raw)
  To: cel; +Cc: linux-nfs, Christoph Hellwig, Chuck Lever

On Wed, Jun 19, 2024 at 01:39:34PM -0400, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> nfs4_get_device_info() frequently requests more than a few pages
> when provisioning a nfs4_deviceid_node object.

Hmm.  Looks like nfs4_get_device_info uses max_resp_sz to size
the buffer.  In theory the max_deviceinfo_size field in
struct pnfs_layoutdriver_type caps it, but that isn't actually set
anywhere at all.

I guess we can't really do anything to size this better, but at least
for blocklayout where we usually get a single device or a handful
of stripes this is quite an overallocation.

And all that is just to pass it to xdr_init_decode_pages, which
is using it basically as a scratchpad.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg
  2024-06-19 17:39 ` [RFC PATCH 1/4] nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg cel
@ 2024-06-20  4:50   ` Christoph Hellwig
  2024-06-20  4:52     ` Christoph Hellwig
  2024-06-20 14:30     ` Chuck Lever
  0 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-06-20  4:50 UTC (permalink / raw)
  To: cel; +Cc: linux-nfs, Christoph Hellwig, Chuck Lever

>  #define NFSDBG_FACILITY		NFSDBG_PNFS_LD
>  
> @@ -24,14 +25,17 @@ bl_free_device(struct pnfs_block_dev *dev)
>  		kfree(dev->children);
>  	} else {
>  		if (dev->pr_registered) {
> -			const struct pr_ops *ops =
> -				file_bdev(dev->bdev_file)->bd_disk->fops->pr_ops;

If you touch this it might be worth returnin early before the else
above to reduce the indentation here a bit.

>  			if (error)
> -				pr_err("failed to unregister PR key.\n");
> +				trace_bl_pr_key_unreg_err(bdev->bd_disk->disk_name,
> +							  dev->pr_key, error);
> +			else
> +				trace_bl_pr_key_unreg(bdev->bd_disk->disk_name,
> +						      dev->pr_key);

I'd just pass the bdev to the tracepoint and derefence it there only
when tracing is enabled.  Note that the disk_name isn't really what
we'd want to trace anyway, as it misses the partition information.
The normal way to print the device name is the %pg printk specifier,
but I'm not sure how to correctly use that for tracing which wants
a string in the entry for binary tracing.

> +++ b/fs/nfs/nfs4trace.c
> @@ -29,5 +29,10 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(ff_layout_read_error);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(ff_layout_write_error);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(ff_layout_commit_error);
>  
> +EXPORT_TRACEPOINT_SYMBOL_GPL(bl_pr_key_reg);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(bl_pr_key_reg_err);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(bl_pr_key_unreg);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(bl_pr_key_unreg_err);

This is weird.  The trace points for nfsd really should be in
fs/nfsd/trace.h and not in fs/nfs/ as that would then pull in
the client code into the server.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg
  2024-06-20  4:50   ` Christoph Hellwig
@ 2024-06-20  4:52     ` Christoph Hellwig
  2024-06-20 14:30     ` Chuck Lever
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-06-20  4:52 UTC (permalink / raw)
  To: cel; +Cc: linux-nfs, Christoph Hellwig, Chuck Lever

On Thu, Jun 20, 2024 at 06:50:46AM +0200, Christoph Hellwig wrote:
> This is weird.  The trace points for nfsd really should be in
> fs/nfsd/trace.h and not in fs/nfs/ as that would then pull in
> the client code into the server.

Sorry.  This is of course client code and I just did not have
enough coffee yet.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-19 17:39 ` [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration cel
@ 2024-06-20  5:06   ` Christoph Hellwig
  2024-06-20 13:52     ` Benjamin Coddington
  2024-06-20 15:39     ` Chuck Lever
  2024-06-20 13:51   ` Benjamin Coddington
  1 sibling, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-06-20  5:06 UTC (permalink / raw)
  To: cel; +Cc: linux-nfs, Christoph Hellwig, Chuck Lever, Benjamin Coddington

On Wed, Jun 19, 2024 at 01:39:33PM -0400, cel@kernel.org wrote:
> -	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
> +	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0) {

It might be worth to invert this and keep the unavailable handling in
the branch as that's the exceptional case.   That code is also woefully
under-documented and could have really used a comment.

> +		struct pnfs_block_dev *d =
> +			container_of(node, struct pnfs_block_dev, node);
> +		if (d->pr_reg)
> +			if (d->pr_reg(d) < 0)
> +				goto out_put;

Empty line after variable declarations.  Also is there anything that
synchronizes the lookups here so that we don't do multiple registrations
in parallel?

> +
> +	if (d->pr_registered)
> +		return 0;
> +
> +	error = ops->pr_register(bdev, 0, d->pr_key, true);
> +	if (error) {
> +		trace_bl_pr_key_reg_err(bdev->bd_disk->disk_name, d->pr_key, error);
> +		return -error;

->pr_register returns either negative errnos or positive PR_STS_* values,
simply inverting the error here isn't doing the right thing.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 2/4] nfs/blocklayout: Report only when /no/ device is found
  2024-06-19 17:39 ` [RFC PATCH 2/4] nfs/blocklayout: Report only when /no/ device is found cel
  2024-06-20  4:36   ` Christoph Hellwig
@ 2024-06-20 12:17   ` Benjamin Coddington
  2024-06-20 14:10     ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Benjamin Coddington @ 2024-06-20 12:17 UTC (permalink / raw)
  To: cel; +Cc: linux-nfs, Christoph Hellwig, Chuck Lever

On 19 Jun 2024, at 13:39, cel@kernel.org wrote:

> From: Chuck Lever <chuck.lever@oracle.com>
>
> Since f931d8374cad ("nfs/blocklayout: refactor block device
> opening"), an error is reported when no multi-path device is
> found. But this isn't a fatal error if the subsequent device open
> is successful. On systems without multi-path devices, this message
> always appears whether there is a problem or not.
>
> Instead, generate less system journal noise by reporting an error
> only when both open attempts fail. The new error message is more
> actionable since it indicates that there is a real configuration
> issue to be addressed.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfs/blocklayout/dev.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
> index b3828e5ee079..356bc967fb5d 100644
> --- a/fs/nfs/blocklayout/dev.c
> +++ b/fs/nfs/blocklayout/dev.c
> @@ -318,7 +318,7 @@ bl_open_path(struct pnfs_block_volume *v, const char *prefix)
>  	bdev_file = bdev_file_open_by_path(devname, BLK_OPEN_READ | BLK_OPEN_WRITE,
>  					NULL, NULL);
>  	if (IS_ERR(bdev_file)) {
> -		pr_warn("pNFS: failed to open device %s (%ld)\n",
> +		dprintk("failed to open device %s (%ld)\n",
>  			devname, PTR_ERR(bdev_file));
>  	}
>
> @@ -348,8 +348,11 @@ bl_parse_scsi(struct nfs_server *server, struct pnfs_block_dev *d,
>  	bdev_file = bl_open_path(v, "dm-uuid-mpath-0x");
>  	if (IS_ERR(bdev_file))
>  		bdev_file = bl_open_path(v, "wwn-0x");
> -	if (IS_ERR(bdev_file))
> +	if (IS_ERR(bdev_file)) {
> +		pr_err("pNFS: no device found for volume %*phN\n",

pr_warn is more appropriate here because there can be clients mounting that
that don't have block devices for the filesystem.  That's not necessarily a
configuration problem.

Ben

> +			v->scsi.designator_len, v->scsi.designator);
>  		return PTR_ERR(bdev_file);
> +	}
>  	d->bdev_file = bdev_file;
>  	bdev = file_bdev(bdev_file);
>
> -- 
> 2.45.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-19 17:39 ` [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration cel
  2024-06-20  5:06   ` Christoph Hellwig
@ 2024-06-20 13:51   ` Benjamin Coddington
  2024-06-20 14:34     ` Chuck Lever
  1 sibling, 1 reply; 30+ messages in thread
From: Benjamin Coddington @ 2024-06-20 13:51 UTC (permalink / raw)
  To: cel; +Cc: linux-nfs, Christoph Hellwig, Chuck Lever

On 19 Jun 2024, at 13:39, cel@kernel.org wrote:

> From: Chuck Lever <chuck.lever@oracle.com>
>
> During generic/069 runs with pNFS SCSI layouts, the NFS client emits
> the following in the system journal:
>
> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
> kernel: sd 6:0:0:1: reservation conflict
> kernel: sd 6:0:0:1: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> kernel: sd 6:0:0:1: [sdb] tag#16 CDB: Write(10) 2a 00 00 00 00 50 00 00 08 00
> kernel: reservation conflict error, dev sdb, sector 80 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
> kernel: sd 6:0:0:1: reservation conflict
> kernel: sd 6:0:0:1: reservation conflict
> kernel: sd 6:0:0:1: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> kernel: sd 6:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> kernel: sd 6:0:0:1: [sdb] tag#18 CDB: Write(10) 2a 00 00 00 00 60 00 00 08 00
> kernel: sd 6:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 00 00 00 58 00 00 08 00
> kernel: reservation conflict error, dev sdb, sector 96 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
> kernel: reservation conflict error, dev sdb, sector 88 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
> systemd[1]: fstests-generic-069.scope: Deactivated successfully.
> systemd[1]: fstests-generic-069.scope: Consumed 5.092s CPU time.
> systemd[1]: media-test.mount: Deactivated successfully.
> systemd[1]: media-scratch.mount: Deactivated successfully.
> kernel: sd 6:0:0:1: reservation conflict
> kernel: failed to unregister PR key.
>
> This appears to be due to a race. bl_alloc_lseg() calls this:
>
> 561 static struct nfs4_deviceid_node *
> 562 bl_find_get_deviceid(struct nfs_server *server,
> 563                 const struct nfs4_deviceid *id, const struct cred *cred,
> 564                 gfp_t gfp_mask)
> 565 {
> 566         struct nfs4_deviceid_node *node;
> 567         unsigned long start, end;
> 568
> 569 retry:
> 570         node = nfs4_find_get_deviceid(server, id, cred, gfp_mask);
> 571         if (!node)
> 572                 return ERR_PTR(-ENODEV);
>
> nfs4_find_get_deviceid() does a lookup without the spin lock first.
> If it can't find a matching deviceid, it creates a new device_info
> (which calls bl_alloc_deviceid_node, and that registers the device's
> PR key).
>
> Then it takes the nfs4_deviceid_lock and looks up the deviceid again.
> If it finds it this time, bl_find_get_deviceid() frees the spare
> (new) device_info, which unregisters the PR key for the same device.
>
> Any subsequent I/O from this client on that device gets EBADE.
>
> The umount later unregisters the device's PR key again.
>
> To prevent this problem, register the PR key after the deviceid_node
> lookup.

Hi Chuck - nice catch, but I'm not seeing how we don't have the same problem
after this patch, instead it just seems like it moves the race.  What
prevents another process waiting to take the nfs4_deviceid_lock from
unregistering the same device?  I think we need another way to signal
bl_free_device that we don't want to unregister for the case where the new
device isn't added to nfs4_deviceid_cache.

No good ideas yet - maybe we can use a flag set within the
nfs4_deviceid_lock?

Ben

>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfs/blocklayout/blocklayout.c |  9 ++++++++-
>  fs/nfs/blocklayout/blocklayout.h |  1 +
>  fs/nfs/blocklayout/dev.c         | 29 +++++++++++++++++++++--------
>  3 files changed, 30 insertions(+), 9 deletions(-)
>
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 6be13e0ec170..75cc5e50bd37 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -571,8 +571,14 @@ bl_find_get_deviceid(struct nfs_server *server,
>  	if (!node)
>  		return ERR_PTR(-ENODEV);
>
> -	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
> +	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0) {
> +		struct pnfs_block_dev *d =
> +			container_of(node, struct pnfs_block_dev, node);
> +		if (d->pr_reg)
> +			if (d->pr_reg(d) < 0)
> +				goto out_put;
>  		return node;
> +	}
>
>  	end = jiffies;
>  	start = end - PNFS_DEVICE_RETRY_TIMEOUT;
> @@ -581,6 +587,7 @@ bl_find_get_deviceid(struct nfs_server *server,
>  		goto retry;
>  	}
>
> +out_put:
>  	nfs4_put_deviceid_node(node);
>  	return ERR_PTR(-ENODEV);
>  }
> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> index f1eeb4914199..8aabaf5218b8 100644
> --- a/fs/nfs/blocklayout/blocklayout.h
> +++ b/fs/nfs/blocklayout/blocklayout.h
> @@ -116,6 +116,7 @@ struct pnfs_block_dev {
>
>  	bool (*map)(struct pnfs_block_dev *dev, u64 offset,
>  			struct pnfs_block_dev_map *map);
> +	int (*pr_reg)(struct pnfs_block_dev *dev);
>  };
>
>  /* sector_t fields are all in 512-byte sectors */
> diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
> index 356bc967fb5d..3d2401820ef4 100644
> --- a/fs/nfs/blocklayout/dev.c
> +++ b/fs/nfs/blocklayout/dev.c
> @@ -230,6 +230,26 @@ static bool bl_map_stripe(struct pnfs_block_dev *dev, u64 offset,
>  	return true;
>  }
>
> +static int bl_register_scsi(struct pnfs_block_dev *d)
> +{
> +	struct block_device *bdev = file_bdev(d->bdev_file);
> +	const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
> +	int error;
> +
> +	if (d->pr_registered)
> +		return 0;
> +
> +	error = ops->pr_register(bdev, 0, d->pr_key, true);
> +	if (error) {
> +		trace_bl_pr_key_reg_err(bdev->bd_disk->disk_name, d->pr_key, error);
> +		return -error;
> +	}
> +
> +	trace_bl_pr_key_reg(bdev->bd_disk->disk_name, d->pr_key);
> +	d->pr_registered = true;
> +	return 0;
> +}
> +
>  static int
>  bl_parse_deviceid(struct nfs_server *server, struct pnfs_block_dev *d,
>  		struct pnfs_block_volume *volumes, int idx, gfp_t gfp_mask);
> @@ -373,14 +393,7 @@ bl_parse_scsi(struct nfs_server *server, struct pnfs_block_dev *d,
>  		goto out_blkdev_put;
>  	}
>
> -	error = ops->pr_register(bdev, 0, d->pr_key, true);
> -	if (error) {
> -		trace_bl_pr_key_reg_err(bdev->bd_disk->disk_name, d->pr_key, error);
> -		goto out_blkdev_put;
> -	}
> -	trace_bl_pr_key_reg(bdev->bd_disk->disk_name, d->pr_key);
> -
> -	d->pr_registered = true;
> +	d->pr_reg = bl_register_scsi;
>  	return 0;
>
>  out_blkdev_put:
> -- 
> 2.45.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20  5:06   ` Christoph Hellwig
@ 2024-06-20 13:52     ` Benjamin Coddington
  2024-06-20 13:58       ` Chuck Lever
  2024-06-20 14:15       ` Christoph Hellwig
  2024-06-20 15:39     ` Chuck Lever
  1 sibling, 2 replies; 30+ messages in thread
From: Benjamin Coddington @ 2024-06-20 13:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: cel, linux-nfs, Chuck Lever

On 20 Jun 2024, at 1:06, Christoph Hellwig wrote:

> On Wed, Jun 19, 2024 at 01:39:33PM -0400, cel@kernel.org wrote:
>> -	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
>> +	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0) {
>
> It might be worth to invert this and keep the unavailable handling in
> the branch as that's the exceptional case.   That code is also woefully
> under-documented and could have really used a comment.

The transient device handling in general, or just this bit of it?

>> +		struct pnfs_block_dev *d =
>> +			container_of(node, struct pnfs_block_dev, node);
>> +		if (d->pr_reg)
>> +			if (d->pr_reg(d) < 0)
>> +				goto out_put;
>
> Empty line after variable declarations.  Also is there anything that
> synchronizes the lookups here so that we don't do multiple registrations
> in parallel?

I don't think there is.  Do we get an error if we register twice?

Ben

>> +
>> +	if (d->pr_registered)
>> +		return 0;
>> +
>> +	error = ops->pr_register(bdev, 0, d->pr_key, true);
>> +	if (error) {
>> +		trace_bl_pr_key_reg_err(bdev->bd_disk->disk_name, d->pr_key, error);
>> +		return -error;
>
> ->pr_register returns either negative errnos or positive PR_STS_* values,
> simply inverting the error here isn't doing the right thing.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20 13:52     ` Benjamin Coddington
@ 2024-06-20 13:58       ` Chuck Lever
  2024-06-20 14:15       ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2024-06-20 13:58 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Christoph Hellwig, cel, linux-nfs

On Thu, Jun 20, 2024 at 09:52:59AM -0400, Benjamin Coddington wrote:
> On 20 Jun 2024, at 1:06, Christoph Hellwig wrote:
> 
> > On Wed, Jun 19, 2024 at 01:39:33PM -0400, cel@kernel.org wrote:
> >> -	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
> >> +	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0) {
> >
> > It might be worth to invert this and keep the unavailable handling in
> > the branch as that's the exceptional case.   That code is also woefully
> > under-documented and could have really used a comment.
> 
> The transient device handling in general, or just this bit of it?

I read Christoph's comment as referring specifically to the logic
in bl_find_get_deviceid(). 


> >> +		struct pnfs_block_dev *d =
> >> +			container_of(node, struct pnfs_block_dev, node);
> >> +		if (d->pr_reg)
> >> +			if (d->pr_reg(d) < 0)
> >> +				goto out_put;
> >
> > Empty line after variable declarations.  Also is there anything that
> > synchronizes the lookups here so that we don't do multiple registrations
> > in parallel?
> 
> I don't think there is.  Do we get an error if we register twice?

pr_register() does not throw an error in that case, so I didn't
protect against it. However, I could add atomic bit flags to
pnfs_block_dev to ensure registration is done only once, if we
believe that is necessary.


> Ben
> 
> >> +
> >> +	if (d->pr_registered)
> >> +		return 0;
> >> +
> >> +	error = ops->pr_register(bdev, 0, d->pr_key, true);
> >> +	if (error) {
> >> +		trace_bl_pr_key_reg_err(bdev->bd_disk->disk_name, d->pr_key, error);
> >> +		return -error;
> >
> > ->pr_register returns either negative errnos or positive PR_STS_* values,
> > simply inverting the error here isn't doing the right thing.
> 

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 2/4] nfs/blocklayout: Report only when /no/ device is found
  2024-06-20 12:17   ` Benjamin Coddington
@ 2024-06-20 14:10     ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-06-20 14:10 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: cel, linux-nfs, Christoph Hellwig, Chuck Lever

On Thu, Jun 20, 2024 at 08:17:12AM -0400, Benjamin Coddington wrote:
> > -	if (IS_ERR(bdev_file))
> > +	if (IS_ERR(bdev_file)) {
> > +		pr_err("pNFS: no device found for volume %*phN\n",
> 
> pr_warn is more appropriate here because there can be clients mounting that
> that don't have block devices for the filesystem.  That's not necessarily a
> configuration problem.

Yes, agreed.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20 13:52     ` Benjamin Coddington
  2024-06-20 13:58       ` Chuck Lever
@ 2024-06-20 14:15       ` Christoph Hellwig
  2024-06-20 14:18         ` Chuck Lever III
  2024-06-20 15:45         ` Benjamin Coddington
  1 sibling, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-06-20 14:15 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Christoph Hellwig, cel, linux-nfs, Chuck Lever

On Thu, Jun 20, 2024 at 09:52:59AM -0400, Benjamin Coddington wrote:
> On 20 Jun 2024, at 1:06, Christoph Hellwig wrote:
> 
> > On Wed, Jun 19, 2024 at 01:39:33PM -0400, cel@kernel.org wrote:
> >> -	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
> >> +	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0) {
> >
> > It might be worth to invert this and keep the unavailable handling in
> > the branch as that's the exceptional case.   That code is also woefully
> > under-documented and could have really used a comment.
> 
> The transient device handling in general, or just this bit of it?

Basically the code behind this NFS_DEVICEID_UNAVAILABLE check here.

> >> +		if (d->pr_reg)
> >> +			if (d->pr_reg(d) < 0)
> >> +				goto out_put;
> >
> > Empty line after variable declarations.  Also is there anything that
> > synchronizes the lookups here so that we don't do multiple registrations
> > in parallel?
> 
> I don't think there is.  Do we get an error if we register twice?

Yes.  That's the basically the same condition as the one that made
Chuck create this series.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20 14:15       ` Christoph Hellwig
@ 2024-06-20 14:18         ` Chuck Lever III
  2024-06-20 15:45         ` Benjamin Coddington
  1 sibling, 0 replies; 30+ messages in thread
From: Chuck Lever III @ 2024-06-20 14:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Benjamin Coddington, Chuck Lever, Linux NFS Mailing List



> On Jun 20, 2024, at 10:15 AM, Christoph Hellwig <hch@lst.de> wrote:
> 
> On Thu, Jun 20, 2024 at 09:52:59AM -0400, Benjamin Coddington wrote:
>> On 20 Jun 2024, at 1:06, Christoph Hellwig wrote:
>> 
>>> On Wed, Jun 19, 2024 at 01:39:33PM -0400, cel@kernel.org wrote:
>>>> - if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
>>>> + if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0) {
>>> 
>>> It might be worth to invert this and keep the unavailable handling in
>>> the branch as that's the exceptional case.   That code is also woefully
>>> under-documented and could have really used a comment.
>> 
>> The transient device handling in general, or just this bit of it?
> 
> Basically the code behind this NFS_DEVICEID_UNAVAILABLE check here.
> 
>>>> + if (d->pr_reg)
>>>> + if (d->pr_reg(d) < 0)
>>>> + goto out_put;
>>> 
>>> Empty line after variable declarations.  Also is there anything that
>>> synchronizes the lookups here so that we don't do multiple registrations
>>> in parallel?
>> 
>> I don't think there is.  Do we get an error if we register twice?
> 
> Yes.  That's the basically the same condition as the one that made
> Chuck create this series.

No, the problem I saw was that the device was unregistered early,
and that resulted in I/O errors and falling back to the MDS.

pr_register() doesn't fail if the device is already registered
(at least when the key is the same).

--
Chuck Lever



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg
  2024-06-20  4:50   ` Christoph Hellwig
  2024-06-20  4:52     ` Christoph Hellwig
@ 2024-06-20 14:30     ` Chuck Lever
  1 sibling, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2024-06-20 14:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: cel, linux-nfs

On Thu, Jun 20, 2024 at 06:50:46AM +0200, Christoph Hellwig wrote:
> >  #define NFSDBG_FACILITY		NFSDBG_PNFS_LD
> >  
> > @@ -24,14 +25,17 @@ bl_free_device(struct pnfs_block_dev *dev)
> >  		kfree(dev->children);
> >  	} else {
> >  		if (dev->pr_registered) {
> > -			const struct pr_ops *ops =
> > -				file_bdev(dev->bdev_file)->bd_disk->fops->pr_ops;
> 
> If you touch this it might be worth returnin early before the else
> above to reduce the indentation here a bit.
> 
> >  			if (error)
> > -				pr_err("failed to unregister PR key.\n");
> > +				trace_bl_pr_key_unreg_err(bdev->bd_disk->disk_name,
> > +							  dev->pr_key, error);
> > +			else
> > +				trace_bl_pr_key_unreg(bdev->bd_disk->disk_name,
> > +						      dev->pr_key);
> 
> I'd just pass the bdev to the tracepoint and derefence it there only
> when tracing is enabled.

I usually take that approach in hot paths, though this didn't seem
that performance critical and I assumed we didn't want to pull the
block device includes into everything that includes nfs4trace.h. But
I will look into that, it might not be that bad.


> Note that the disk_name isn't really what
> we'd want to trace anyway, as it misses the partition information.

Right, that part I was actually not sure about.


> The normal way to print the device name is the %pg printk specifier,
> but I'm not sure how to correctly use that for tracing which wants
> a string in the entry for binary tracing.

I'll see what can be done.


> > +++ b/fs/nfs/nfs4trace.c
> > @@ -29,5 +29,10 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(ff_layout_read_error);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(ff_layout_write_error);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(ff_layout_commit_error);
> >  
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(bl_pr_key_reg);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(bl_pr_key_reg_err);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(bl_pr_key_unreg);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(bl_pr_key_unreg_err);
> 
> This is weird.  The trace points for nfsd really should be in
> fs/nfsd/trace.h and not in fs/nfs/ as that would then pull in
> the client code into the server.
> 

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20 13:51   ` Benjamin Coddington
@ 2024-06-20 14:34     ` Chuck Lever
  2024-06-20 14:37       ` Christoph Hellwig
  2024-06-20 15:30       ` Benjamin Coddington
  0 siblings, 2 replies; 30+ messages in thread
From: Chuck Lever @ 2024-06-20 14:34 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: cel, linux-nfs, Christoph Hellwig

On Thu, Jun 20, 2024 at 09:51:46AM -0400, Benjamin Coddington wrote:
> On 19 Jun 2024, at 13:39, cel@kernel.org wrote:
> 
> > From: Chuck Lever <chuck.lever@oracle.com>
> >
> > During generic/069 runs with pNFS SCSI layouts, the NFS client emits
> > the following in the system journal:
> >
> > kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
> > kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
> > kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
> > kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
> > kernel: sd 6:0:0:1: reservation conflict
> > kernel: sd 6:0:0:1: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> > kernel: sd 6:0:0:1: [sdb] tag#16 CDB: Write(10) 2a 00 00 00 00 50 00 00 08 00
> > kernel: reservation conflict error, dev sdb, sector 80 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
> > kernel: sd 6:0:0:1: reservation conflict
> > kernel: sd 6:0:0:1: reservation conflict
> > kernel: sd 6:0:0:1: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> > kernel: sd 6:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> > kernel: sd 6:0:0:1: [sdb] tag#18 CDB: Write(10) 2a 00 00 00 00 60 00 00 08 00
> > kernel: sd 6:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 00 00 00 58 00 00 08 00
> > kernel: reservation conflict error, dev sdb, sector 96 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
> > kernel: reservation conflict error, dev sdb, sector 88 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
> > systemd[1]: fstests-generic-069.scope: Deactivated successfully.
> > systemd[1]: fstests-generic-069.scope: Consumed 5.092s CPU time.
> > systemd[1]: media-test.mount: Deactivated successfully.
> > systemd[1]: media-scratch.mount: Deactivated successfully.
> > kernel: sd 6:0:0:1: reservation conflict
> > kernel: failed to unregister PR key.
> >
> > This appears to be due to a race. bl_alloc_lseg() calls this:
> >
> > 561 static struct nfs4_deviceid_node *
> > 562 bl_find_get_deviceid(struct nfs_server *server,
> > 563                 const struct nfs4_deviceid *id, const struct cred *cred,
> > 564                 gfp_t gfp_mask)
> > 565 {
> > 566         struct nfs4_deviceid_node *node;
> > 567         unsigned long start, end;
> > 568
> > 569 retry:
> > 570         node = nfs4_find_get_deviceid(server, id, cred, gfp_mask);
> > 571         if (!node)
> > 572                 return ERR_PTR(-ENODEV);
> >
> > nfs4_find_get_deviceid() does a lookup without the spin lock first.
> > If it can't find a matching deviceid, it creates a new device_info
> > (which calls bl_alloc_deviceid_node, and that registers the device's
> > PR key).
> >
> > Then it takes the nfs4_deviceid_lock and looks up the deviceid again.
> > If it finds it this time, bl_find_get_deviceid() frees the spare
> > (new) device_info, which unregisters the PR key for the same device.
> >
> > Any subsequent I/O from this client on that device gets EBADE.
> >
> > The umount later unregisters the device's PR key again.
> >
> > To prevent this problem, register the PR key after the deviceid_node
> > lookup.
> 
> Hi Chuck - nice catch, but I'm not seeing how we don't have the same problem
> after this patch, instead it just seems like it moves the race.  What
> prevents another process waiting to take the nfs4_deviceid_lock from
> unregistering the same device?  I think we need another way to signal
> bl_free_device that we don't want to unregister for the case where the new
> device isn't added to nfs4_deviceid_cache.

That's a (related but) somewhat different issue. I haven't seen
that in practice so far.


> No good ideas yet - maybe we can use a flag set within the
> nfs4_deviceid_lock?

Well this smells like a use for a reference count on the block
device, but fs/nfs doesn't control the definition of that data
structure.


> Ben
> 
> >
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> >  fs/nfs/blocklayout/blocklayout.c |  9 ++++++++-
> >  fs/nfs/blocklayout/blocklayout.h |  1 +
> >  fs/nfs/blocklayout/dev.c         | 29 +++++++++++++++++++++--------
> >  3 files changed, 30 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> > index 6be13e0ec170..75cc5e50bd37 100644
> > --- a/fs/nfs/blocklayout/blocklayout.c
> > +++ b/fs/nfs/blocklayout/blocklayout.c
> > @@ -571,8 +571,14 @@ bl_find_get_deviceid(struct nfs_server *server,
> >  	if (!node)
> >  		return ERR_PTR(-ENODEV);
> >
> > -	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
> > +	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0) {
> > +		struct pnfs_block_dev *d =
> > +			container_of(node, struct pnfs_block_dev, node);
> > +		if (d->pr_reg)
> > +			if (d->pr_reg(d) < 0)
> > +				goto out_put;
> >  		return node;
> > +	}
> >
> >  	end = jiffies;
> >  	start = end - PNFS_DEVICE_RETRY_TIMEOUT;
> > @@ -581,6 +587,7 @@ bl_find_get_deviceid(struct nfs_server *server,
> >  		goto retry;
> >  	}
> >
> > +out_put:
> >  	nfs4_put_deviceid_node(node);
> >  	return ERR_PTR(-ENODEV);
> >  }
> > diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> > index f1eeb4914199..8aabaf5218b8 100644
> > --- a/fs/nfs/blocklayout/blocklayout.h
> > +++ b/fs/nfs/blocklayout/blocklayout.h
> > @@ -116,6 +116,7 @@ struct pnfs_block_dev {
> >
> >  	bool (*map)(struct pnfs_block_dev *dev, u64 offset,
> >  			struct pnfs_block_dev_map *map);
> > +	int (*pr_reg)(struct pnfs_block_dev *dev);
> >  };
> >
> >  /* sector_t fields are all in 512-byte sectors */
> > diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
> > index 356bc967fb5d..3d2401820ef4 100644
> > --- a/fs/nfs/blocklayout/dev.c
> > +++ b/fs/nfs/blocklayout/dev.c
> > @@ -230,6 +230,26 @@ static bool bl_map_stripe(struct pnfs_block_dev *dev, u64 offset,
> >  	return true;
> >  }
> >
> > +static int bl_register_scsi(struct pnfs_block_dev *d)
> > +{
> > +	struct block_device *bdev = file_bdev(d->bdev_file);
> > +	const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
> > +	int error;
> > +
> > +	if (d->pr_registered)
> > +		return 0;
> > +
> > +	error = ops->pr_register(bdev, 0, d->pr_key, true);
> > +	if (error) {
> > +		trace_bl_pr_key_reg_err(bdev->bd_disk->disk_name, d->pr_key, error);
> > +		return -error;
> > +	}
> > +
> > +	trace_bl_pr_key_reg(bdev->bd_disk->disk_name, d->pr_key);
> > +	d->pr_registered = true;
> > +	return 0;
> > +}
> > +
> >  static int
> >  bl_parse_deviceid(struct nfs_server *server, struct pnfs_block_dev *d,
> >  		struct pnfs_block_volume *volumes, int idx, gfp_t gfp_mask);
> > @@ -373,14 +393,7 @@ bl_parse_scsi(struct nfs_server *server, struct pnfs_block_dev *d,
> >  		goto out_blkdev_put;
> >  	}
> >
> > -	error = ops->pr_register(bdev, 0, d->pr_key, true);
> > -	if (error) {
> > -		trace_bl_pr_key_reg_err(bdev->bd_disk->disk_name, d->pr_key, error);
> > -		goto out_blkdev_put;
> > -	}
> > -	trace_bl_pr_key_reg(bdev->bd_disk->disk_name, d->pr_key);
> > -
> > -	d->pr_registered = true;
> > +	d->pr_reg = bl_register_scsi;
> >  	return 0;
> >
> >  out_blkdev_put:
> > -- 
> > 2.45.1
> 

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20 14:34     ` Chuck Lever
@ 2024-06-20 14:37       ` Christoph Hellwig
  2024-06-20 15:30       ` Benjamin Coddington
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-06-20 14:37 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Benjamin Coddington, cel, linux-nfs, Christoph Hellwig

On Thu, Jun 20, 2024 at 10:34:03AM -0400, Chuck Lever wrote:
> > No good ideas yet - maybe we can use a flag set within the
> > nfs4_deviceid_lock?
> 
> Well this smells like a use for a reference count on the block
> device, but fs/nfs doesn't control the definition of that data
> structure.

The block device actually has multiple refcounts, up to three
depending on how you count.  I'm not sure how that would solve
anything here, though.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 2/4] nfs/blocklayout: Report only when /no/ device is found
  2024-06-20  4:36   ` Christoph Hellwig
@ 2024-06-20 14:59     ` Chuck Lever
  0 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2024-06-20 14:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: cel, linux-nfs

On Thu, Jun 20, 2024 at 06:36:05AM +0200, Christoph Hellwig wrote:
> On Wed, Jun 19, 2024 at 01:39:32PM -0400, cel@kernel.org wrote:
> >  	if (IS_ERR(bdev_file)) {
> > -		pr_warn("pNFS: failed to open device %s (%ld)\n",
> > +		dprintk("failed to open device %s (%ld)\n",
> >  			devname, PTR_ERR(bdev_file));
> 
> I'd just drop this one entirely.

Actually I'd like to keep this dprintk because it displays the whole
path to the device. That has diagnostic value, IMO.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20 14:34     ` Chuck Lever
  2024-06-20 14:37       ` Christoph Hellwig
@ 2024-06-20 15:30       ` Benjamin Coddington
  2024-06-20 15:46         ` Chuck Lever
  1 sibling, 1 reply; 30+ messages in thread
From: Benjamin Coddington @ 2024-06-20 15:30 UTC (permalink / raw)
  To: Chuck Lever; +Cc: cel, linux-nfs, Christoph Hellwig

On 20 Jun 2024, at 10:34, Chuck Lever wrote:

> On Thu, Jun 20, 2024 at 09:51:46AM -0400, Benjamin Coddington wrote:
>> On 19 Jun 2024, at 13:39, cel@kernel.org wrote:
>>
>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>
>>> During generic/069 runs with pNFS SCSI layouts, the NFS client emits
>>> the following in the system journal:
>>>
>>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
>>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
>>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
>>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
>>> kernel: sd 6:0:0:1: reservation conflict
>>> kernel: sd 6:0:0:1: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>>> kernel: sd 6:0:0:1: [sdb] tag#16 CDB: Write(10) 2a 00 00 00 00 50 00 00 08 00
>>> kernel: reservation conflict error, dev sdb, sector 80 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
>>> kernel: sd 6:0:0:1: reservation conflict
>>> kernel: sd 6:0:0:1: reservation conflict
>>> kernel: sd 6:0:0:1: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>>> kernel: sd 6:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>>> kernel: sd 6:0:0:1: [sdb] tag#18 CDB: Write(10) 2a 00 00 00 00 60 00 00 08 00
>>> kernel: sd 6:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 00 00 00 58 00 00 08 00
>>> kernel: reservation conflict error, dev sdb, sector 96 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
>>> kernel: reservation conflict error, dev sdb, sector 88 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
>>> systemd[1]: fstests-generic-069.scope: Deactivated successfully.
>>> systemd[1]: fstests-generic-069.scope: Consumed 5.092s CPU time.
>>> systemd[1]: media-test.mount: Deactivated successfully.
>>> systemd[1]: media-scratch.mount: Deactivated successfully.
>>> kernel: sd 6:0:0:1: reservation conflict
>>> kernel: failed to unregister PR key.
>>>
>>> This appears to be due to a race. bl_alloc_lseg() calls this:
>>>
>>> 561 static struct nfs4_deviceid_node *
>>> 562 bl_find_get_deviceid(struct nfs_server *server,
>>> 563                 const struct nfs4_deviceid *id, const struct cred *cred,
>>> 564                 gfp_t gfp_mask)
>>> 565 {
>>> 566         struct nfs4_deviceid_node *node;
>>> 567         unsigned long start, end;
>>> 568
>>> 569 retry:
>>> 570         node = nfs4_find_get_deviceid(server, id, cred, gfp_mask);
>>> 571         if (!node)
>>> 572                 return ERR_PTR(-ENODEV);
>>>
>>> nfs4_find_get_deviceid() does a lookup without the spin lock first.
>>> If it can't find a matching deviceid, it creates a new device_info
>>> (which calls bl_alloc_deviceid_node, and that registers the device's
>>> PR key).
>>>
>>> Then it takes the nfs4_deviceid_lock and looks up the deviceid again.
>>> If it finds it this time, bl_find_get_deviceid() frees the spare
>>> (new) device_info, which unregisters the PR key for the same device.
>>>
>>> Any subsequent I/O from this client on that device gets EBADE.
>>>
>>> The umount later unregisters the device's PR key again.
>>>
>>> To prevent this problem, register the PR key after the deviceid_node
>>> lookup.
>>
>> Hi Chuck - nice catch, but I'm not seeing how we don't have the same problem
>> after this patch, instead it just seems like it moves the race.  What
>> prevents another process waiting to take the nfs4_deviceid_lock from
>> unregistering the same device?  I think we need another way to signal
>> bl_free_device that we don't want to unregister for the case where the new
>> device isn't added to nfs4_deviceid_cache.
>
> That's a (related but) somewhat different issue. I haven't seen
> that in practice so far.
>
>
>> No good ideas yet - maybe we can use a flag set within the
>> nfs4_deviceid_lock?
>
> Well this smells like a use for a reference count on the block
> device, but fs/nfs doesn't control the definition of that data
> structure.

I think we need two things to fix this race:
 - a way to determine if the current thread is the one
   that added the device to the to the cache, if so do the register
 - a way to determine if we're freeing because we lost the race to the
   cache, if so don't un-register.

We can get both with a flag that's always set within the nfs4_deviceid_lock,
something like NFS_DEVICEID_INIT.  If set, it signals we need to register in
the case we keep the device, or skip de-registration in the case where we
lost the race and throw it out.  We still need this patch to do the
registration after it lands in the cache.

Ben


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20  5:06   ` Christoph Hellwig
  2024-06-20 13:52     ` Benjamin Coddington
@ 2024-06-20 15:39     ` Chuck Lever
  1 sibling, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2024-06-20 15:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: cel, linux-nfs, Benjamin Coddington

On Thu, Jun 20, 2024 at 07:06:14AM +0200, Christoph Hellwig wrote:
> On Wed, Jun 19, 2024 at 01:39:33PM -0400, cel@kernel.org wrote:
> > -	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
> > +	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0) {
> 
> It might be worth to invert this and keep the unavailable handling in
> the branch as that's the exceptional case.   That code is also woefully
> under-documented and could have really used a comment.
> 
> > +		struct pnfs_block_dev *d =
> > +			container_of(node, struct pnfs_block_dev, node);
> > +		if (d->pr_reg)
> > +			if (d->pr_reg(d) < 0)
> > +				goto out_put;
> 
> Empty line after variable declarations.  Also is there anything that
> synchronizes the lookups here so that we don't do multiple registrations
> in parallel?

Is there something (more than d->pr_registered) that prevents that
today?


> > +
> > +	if (d->pr_registered)
> > +		return 0;
> > +
> > +	error = ops->pr_register(bdev, 0, d->pr_key, true);
> > +	if (error) {
> > +		trace_bl_pr_key_reg_err(bdev->bd_disk->disk_name, d->pr_key, error);
> > +		return -error;
> 
> ->pr_register returns either negative errnos or positive PR_STS_* values,
> simply inverting the error here isn't doing the right thing.

I don't see pr_register() returning a negative errno, but I will
remove the "-" and document the expected return values.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20 14:15       ` Christoph Hellwig
  2024-06-20 14:18         ` Chuck Lever III
@ 2024-06-20 15:45         ` Benjamin Coddington
  2024-06-20 15:48           ` Chuck Lever
  1 sibling, 1 reply; 30+ messages in thread
From: Benjamin Coddington @ 2024-06-20 15:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: cel, linux-nfs, Chuck Lever

On 20 Jun 2024, at 10:15, Christoph Hellwig wrote:

> On Thu, Jun 20, 2024 at 09:52:59AM -0400, Benjamin Coddington wrote:
>> On 20 Jun 2024, at 1:06, Christoph Hellwig wrote:
>>
>>> On Wed, Jun 19, 2024 at 01:39:33PM -0400, cel@kernel.org wrote:
>>>> -	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
>>>> +	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0) {
>>>
>>> It might be worth to invert this and keep the unavailable handling in
>>> the branch as that's the exceptional case.   That code is also woefully
>>> under-documented and could have really used a comment.
>>
>> The transient device handling in general, or just this bit of it?
>
> Basically the code behind this NFS_DEVICEID_UNAVAILABLE check here.

How about..


diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 6be13e0ec170..665c054784b4 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -564,25 +564,30 @@ bl_find_get_deviceid(struct nfs_server *server,
                gfp_t gfp_mask)
 {
        struct nfs4_deviceid_node *node;
-       unsigned long start, end;

 retry:
        node = nfs4_find_get_deviceid(server, id, cred, gfp_mask);
        if (!node)
                return ERR_PTR(-ENODEV);

-       if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
-               return node;
+       /* Transient devices are left in the cache with a timeout to minimize
+        * sending GETDEVINFO after every LAYOUTGET */
+       if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags)) {
+               unsigned long start, end;

-       end = jiffies;
-       start = end - PNFS_DEVICE_RETRY_TIMEOUT;
-       if (!time_in_range(node->timestamp_unavailable, start, end)) {
-               nfs4_delete_deviceid(node->ld, node->nfs_client, id);
-               goto retry;
+               end = jiffies;
+               start = end - PNFS_DEVICE_RETRY_TIMEOUT;
+
+               /* Maybe the device is back, let's look for it again */
+               if (!time_in_range(node->timestamp_unavailable, start, end)) {
+                       nfs4_delete_deviceid(node->ld, node->nfs_client, id);
+                       goto retry;
+               }
+               nfs4_put_deviceid_node(node);
+               return ERR_PTR(-ENODEV);
        }

-       nfs4_put_deviceid_node(node);
-       return ERR_PTR(-ENODEV);
+       return node;
 }

 static int


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20 15:30       ` Benjamin Coddington
@ 2024-06-20 15:46         ` Chuck Lever
  2024-06-20 15:56           ` Benjamin Coddington
  0 siblings, 1 reply; 30+ messages in thread
From: Chuck Lever @ 2024-06-20 15:46 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: cel, linux-nfs, Christoph Hellwig

On Thu, Jun 20, 2024 at 11:30:54AM -0400, Benjamin Coddington wrote:
> On 20 Jun 2024, at 10:34, Chuck Lever wrote:
> 
> > On Thu, Jun 20, 2024 at 09:51:46AM -0400, Benjamin Coddington wrote:
> >> On 19 Jun 2024, at 13:39, cel@kernel.org wrote:
> >>
> >>> From: Chuck Lever <chuck.lever@oracle.com>
> >>>
> >>> During generic/069 runs with pNFS SCSI layouts, the NFS client emits
> >>> the following in the system journal:
> >>>
> >>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
> >>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
> >>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
> >>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
> >>> kernel: sd 6:0:0:1: reservation conflict
> >>> kernel: sd 6:0:0:1: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> >>> kernel: sd 6:0:0:1: [sdb] tag#16 CDB: Write(10) 2a 00 00 00 00 50 00 00 08 00
> >>> kernel: reservation conflict error, dev sdb, sector 80 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
> >>> kernel: sd 6:0:0:1: reservation conflict
> >>> kernel: sd 6:0:0:1: reservation conflict
> >>> kernel: sd 6:0:0:1: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> >>> kernel: sd 6:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> >>> kernel: sd 6:0:0:1: [sdb] tag#18 CDB: Write(10) 2a 00 00 00 00 60 00 00 08 00
> >>> kernel: sd 6:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 00 00 00 58 00 00 08 00
> >>> kernel: reservation conflict error, dev sdb, sector 96 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
> >>> kernel: reservation conflict error, dev sdb, sector 88 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
> >>> systemd[1]: fstests-generic-069.scope: Deactivated successfully.
> >>> systemd[1]: fstests-generic-069.scope: Consumed 5.092s CPU time.
> >>> systemd[1]: media-test.mount: Deactivated successfully.
> >>> systemd[1]: media-scratch.mount: Deactivated successfully.
> >>> kernel: sd 6:0:0:1: reservation conflict
> >>> kernel: failed to unregister PR key.
> >>>
> >>> This appears to be due to a race. bl_alloc_lseg() calls this:
> >>>
> >>> 561 static struct nfs4_deviceid_node *
> >>> 562 bl_find_get_deviceid(struct nfs_server *server,
> >>> 563                 const struct nfs4_deviceid *id, const struct cred *cred,
> >>> 564                 gfp_t gfp_mask)
> >>> 565 {
> >>> 566         struct nfs4_deviceid_node *node;
> >>> 567         unsigned long start, end;
> >>> 568
> >>> 569 retry:
> >>> 570         node = nfs4_find_get_deviceid(server, id, cred, gfp_mask);
> >>> 571         if (!node)
> >>> 572                 return ERR_PTR(-ENODEV);
> >>>
> >>> nfs4_find_get_deviceid() does a lookup without the spin lock first.
> >>> If it can't find a matching deviceid, it creates a new device_info
> >>> (which calls bl_alloc_deviceid_node, and that registers the device's
> >>> PR key).
> >>>
> >>> Then it takes the nfs4_deviceid_lock and looks up the deviceid again.
> >>> If it finds it this time, bl_find_get_deviceid() frees the spare
> >>> (new) device_info, which unregisters the PR key for the same device.
> >>>
> >>> Any subsequent I/O from this client on that device gets EBADE.
> >>>
> >>> The umount later unregisters the device's PR key again.
> >>>
> >>> To prevent this problem, register the PR key after the deviceid_node
> >>> lookup.
> >>
> >> Hi Chuck - nice catch, but I'm not seeing how we don't have the same problem
> >> after this patch, instead it just seems like it moves the race.  What
> >> prevents another process waiting to take the nfs4_deviceid_lock from
> >> unregistering the same device?  I think we need another way to signal
> >> bl_free_device that we don't want to unregister for the case where the new
> >> device isn't added to nfs4_deviceid_cache.
> >
> > That's a (related but) somewhat different issue. I haven't seen
> > that in practice so far.
> >
> >
> >> No good ideas yet - maybe we can use a flag set within the
> >> nfs4_deviceid_lock?
> >
> > Well this smells like a use for a reference count on the block
> > device, but fs/nfs doesn't control the definition of that data
> > structure.
> 
> I think we need two things to fix this race:
>  - a way to determine if the current thread is the one
>    that added the device to the to the cache, if so do the register
>  - a way to determine if we're freeing because we lost the race to the
>    cache, if so don't un-register.

My patch is supposed to deal with all of that already. Can you show
me specifically what is not getting handled by my proposed change?


> We can get both with a flag that's always set within the nfs4_deviceid_lock,
> something like NFS_DEVICEID_INIT.  If set, it signals we need to register in
> the case we keep the device, or skip de-registration in the case where we
> lost the race and throw it out.  We still need this patch to do the
> registration after it lands in the cache.



-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20 15:45         ` Benjamin Coddington
@ 2024-06-20 15:48           ` Chuck Lever
  2024-06-20 15:58             ` Benjamin Coddington
  0 siblings, 1 reply; 30+ messages in thread
From: Chuck Lever @ 2024-06-20 15:48 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Christoph Hellwig, cel, linux-nfs

On Thu, Jun 20, 2024 at 11:45:02AM -0400, Benjamin Coddington wrote:
> On 20 Jun 2024, at 10:15, Christoph Hellwig wrote:
> 
> > On Thu, Jun 20, 2024 at 09:52:59AM -0400, Benjamin Coddington wrote:
> >> On 20 Jun 2024, at 1:06, Christoph Hellwig wrote:
> >>
> >>> On Wed, Jun 19, 2024 at 01:39:33PM -0400, cel@kernel.org wrote:
> >>>> -	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
> >>>> +	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0) {
> >>>
> >>> It might be worth to invert this and keep the unavailable handling in
> >>> the branch as that's the exceptional case.   That code is also woefully
> >>> under-documented and could have really used a comment.
> >>
> >> The transient device handling in general, or just this bit of it?
> >
> > Basically the code behind this NFS_DEVICEID_UNAVAILABLE check here.
> 
> How about..

Let's leave this as a separate patch. IMO this is dealing with an
entirely orthogonal issue.


> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 6be13e0ec170..665c054784b4 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -564,25 +564,30 @@ bl_find_get_deviceid(struct nfs_server *server,
>                 gfp_t gfp_mask)
>  {
>         struct nfs4_deviceid_node *node;
> -       unsigned long start, end;
> 
>  retry:
>         node = nfs4_find_get_deviceid(server, id, cred, gfp_mask);
>         if (!node)
>                 return ERR_PTR(-ENODEV);
> 
> -       if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
> -               return node;
> +       /* Transient devices are left in the cache with a timeout to minimize
> +        * sending GETDEVINFO after every LAYOUTGET */
> +       if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags)) {
> +               unsigned long start, end;
> 
> -       end = jiffies;
> -       start = end - PNFS_DEVICE_RETRY_TIMEOUT;
> -       if (!time_in_range(node->timestamp_unavailable, start, end)) {
> -               nfs4_delete_deviceid(node->ld, node->nfs_client, id);
> -               goto retry;
> +               end = jiffies;
> +               start = end - PNFS_DEVICE_RETRY_TIMEOUT;
> +
> +               /* Maybe the device is back, let's look for it again */
> +               if (!time_in_range(node->timestamp_unavailable, start, end)) {
> +                       nfs4_delete_deviceid(node->ld, node->nfs_client, id);
> +                       goto retry;
> +               }
> +               nfs4_put_deviceid_node(node);
> +               return ERR_PTR(-ENODEV);
>         }
> 
> -       nfs4_put_deviceid_node(node);
> -       return ERR_PTR(-ENODEV);
> +       return node;
>  }
> 
>  static int
> 

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20 15:46         ` Chuck Lever
@ 2024-06-20 15:56           ` Benjamin Coddington
  2024-06-20 16:45             ` Benjamin Coddington
  0 siblings, 1 reply; 30+ messages in thread
From: Benjamin Coddington @ 2024-06-20 15:56 UTC (permalink / raw)
  To: Chuck Lever; +Cc: cel, linux-nfs, Christoph Hellwig

On 20 Jun 2024, at 11:46, Chuck Lever wrote:

> On Thu, Jun 20, 2024 at 11:30:54AM -0400, Benjamin Coddington wrote:
>> On 20 Jun 2024, at 10:34, Chuck Lever wrote:
>>
>>> On Thu, Jun 20, 2024 at 09:51:46AM -0400, Benjamin Coddington wrote:
>>>> On 19 Jun 2024, at 13:39, cel@kernel.org wrote:
>>>>
>>>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>>>
>>>>> During generic/069 runs with pNFS SCSI layouts, the NFS client emits
>>>>> the following in the system journal:
>>>>>
>>>>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
>>>>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
>>>>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
>>>>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
>>>>> kernel: sd 6:0:0:1: reservation conflict
>>>>> kernel: sd 6:0:0:1: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>>>>> kernel: sd 6:0:0:1: [sdb] tag#16 CDB: Write(10) 2a 00 00 00 00 50 00 00 08 00
>>>>> kernel: reservation conflict error, dev sdb, sector 80 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
>>>>> kernel: sd 6:0:0:1: reservation conflict
>>>>> kernel: sd 6:0:0:1: reservation conflict
>>>>> kernel: sd 6:0:0:1: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>>>>> kernel: sd 6:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>>>>> kernel: sd 6:0:0:1: [sdb] tag#18 CDB: Write(10) 2a 00 00 00 00 60 00 00 08 00
>>>>> kernel: sd 6:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 00 00 00 58 00 00 08 00
>>>>> kernel: reservation conflict error, dev sdb, sector 96 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
>>>>> kernel: reservation conflict error, dev sdb, sector 88 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
>>>>> systemd[1]: fstests-generic-069.scope: Deactivated successfully.
>>>>> systemd[1]: fstests-generic-069.scope: Consumed 5.092s CPU time.
>>>>> systemd[1]: media-test.mount: Deactivated successfully.
>>>>> systemd[1]: media-scratch.mount: Deactivated successfully.
>>>>> kernel: sd 6:0:0:1: reservation conflict
>>>>> kernel: failed to unregister PR key.
>>>>>
>>>>> This appears to be due to a race. bl_alloc_lseg() calls this:
>>>>>
>>>>> 561 static struct nfs4_deviceid_node *
>>>>> 562 bl_find_get_deviceid(struct nfs_server *server,
>>>>> 563                 const struct nfs4_deviceid *id, const struct cred *cred,
>>>>> 564                 gfp_t gfp_mask)
>>>>> 565 {
>>>>> 566         struct nfs4_deviceid_node *node;
>>>>> 567         unsigned long start, end;
>>>>> 568
>>>>> 569 retry:
>>>>> 570         node = nfs4_find_get_deviceid(server, id, cred, gfp_mask);
>>>>> 571         if (!node)
>>>>> 572                 return ERR_PTR(-ENODEV);
>>>>>
>>>>> nfs4_find_get_deviceid() does a lookup without the spin lock first.
>>>>> If it can't find a matching deviceid, it creates a new device_info
>>>>> (which calls bl_alloc_deviceid_node, and that registers the device's
>>>>> PR key).
>>>>>
>>>>> Then it takes the nfs4_deviceid_lock and looks up the deviceid again.
>>>>> If it finds it this time, bl_find_get_deviceid() frees the spare
>>>>> (new) device_info, which unregisters the PR key for the same device.
>>>>>
>>>>> Any subsequent I/O from this client on that device gets EBADE.
>>>>>
>>>>> The umount later unregisters the device's PR key again.
>>>>>
>>>>> To prevent this problem, register the PR key after the deviceid_node
>>>>> lookup.
>>>>
>>>> Hi Chuck - nice catch, but I'm not seeing how we don't have the same problem
>>>> after this patch, instead it just seems like it moves the race.  What
>>>> prevents another process waiting to take the nfs4_deviceid_lock from
>>>> unregistering the same device?  I think we need another way to signal
>>>> bl_free_device that we don't want to unregister for the case where the new
>>>> device isn't added to nfs4_deviceid_cache.
>>>
>>> That's a (related but) somewhat different issue. I haven't seen
>>> that in practice so far.
>>>
>>>
>>>> No good ideas yet - maybe we can use a flag set within the
>>>> nfs4_deviceid_lock?
>>>
>>> Well this smells like a use for a reference count on the block
>>> device, but fs/nfs doesn't control the definition of that data
>>> structure.
>>
>> I think we need two things to fix this race:
>>  - a way to determine if the current thread is the one
>>    that added the device to the to the cache, if so do the register
>>  - a way to determine if we're freeing because we lost the race to the
>>    cache, if so don't un-register.
>
> My patch is supposed to deal with all of that already. Can you show
> me specifically what is not getting handled by my proposed change?

Well - I may be missing something, but it looks like with this patch you can
still have:

Thread
A                           B

nfs4_find_get_deviceid
new{a} = nfs4_get_device_info
locks nfs4_deviceid_lock
                            nfs4_find_get_deviceid
                            new{b} = nfs4_get_device_info
                            spins on nfs4_deviceid_lock
adds new{a} to the cache
unlocks nfs4_deviceid_lock
pr_register
                            locks nfs4_deviceid_lock
                            finds new{a}
                            pr_UNregisters new{b}

In this case, you end up with an unregistered device.


Also, you can have more than one thread doing the initial pr_register, but I
think as we've already discussed that's no big deal - it should be rare and
I don't think it returns an error.

Ben


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20 15:48           ` Chuck Lever
@ 2024-06-20 15:58             ` Benjamin Coddington
  0 siblings, 0 replies; 30+ messages in thread
From: Benjamin Coddington @ 2024-06-20 15:58 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Christoph Hellwig, cel, linux-nfs

On 20 Jun 2024, at 11:48, Chuck Lever wrote:

> On Thu, Jun 20, 2024 at 11:45:02AM -0400, Benjamin Coddington wrote:
>> On 20 Jun 2024, at 10:15, Christoph Hellwig wrote:
>>
>>> On Thu, Jun 20, 2024 at 09:52:59AM -0400, Benjamin Coddington wrote:
>>>> On 20 Jun 2024, at 1:06, Christoph Hellwig wrote:
>>>>
>>>>> On Wed, Jun 19, 2024 at 01:39:33PM -0400, cel@kernel.org wrote:
>>>>>> -	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0)
>>>>>> +	if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags) == 0) {
>>>>>
>>>>> It might be worth to invert this and keep the unavailable handling in
>>>>> the branch as that's the exceptional case.   That code is also woefully
>>>>> under-documented and could have really used a comment.
>>>>
>>>> The transient device handling in general, or just this bit of it?
>>>
>>> Basically the code behind this NFS_DEVICEID_UNAVAILABLE check here.
>>
>> How about..
>
> Let's leave this as a separate patch. IMO this is dealing with an
> entirely orthogonal issue.

Agree - just wanted feedback on what's appropriate for comments in here.  I
can send something after your work.

Ben


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20 15:56           ` Benjamin Coddington
@ 2024-06-20 16:45             ` Benjamin Coddington
  2024-06-20 17:08               ` Chuck Lever
  0 siblings, 1 reply; 30+ messages in thread
From: Benjamin Coddington @ 2024-06-20 16:45 UTC (permalink / raw)
  To: Chuck Lever; +Cc: cel, linux-nfs, Christoph Hellwig

On 20 Jun 2024, at 11:56, Benjamin Coddington wrote:

> On 20 Jun 2024, at 11:46, Chuck Lever wrote:
>
>> On Thu, Jun 20, 2024 at 11:30:54AM -0400, Benjamin Coddington wrote:
>>> On 20 Jun 2024, at 10:34, Chuck Lever wrote:
>>>
>>>> On Thu, Jun 20, 2024 at 09:51:46AM -0400, Benjamin Coddington wrote:
>>>>> On 19 Jun 2024, at 13:39, cel@kernel.org wrote:
>>>>>
>>>>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>>>>
>>>>>> During generic/069 runs with pNFS SCSI layouts, the NFS client emits
>>>>>> the following in the system journal:
>>>>>>
>>>>>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
>>>>>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
>>>>>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
>>>>>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
>>>>>> kernel: sd 6:0:0:1: reservation conflict
>>>>>> kernel: sd 6:0:0:1: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>>>>>> kernel: sd 6:0:0:1: [sdb] tag#16 CDB: Write(10) 2a 00 00 00 00 50 00 00 08 00
>>>>>> kernel: reservation conflict error, dev sdb, sector 80 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
>>>>>> kernel: sd 6:0:0:1: reservation conflict
>>>>>> kernel: sd 6:0:0:1: reservation conflict
>>>>>> kernel: sd 6:0:0:1: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>>>>>> kernel: sd 6:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>>>>>> kernel: sd 6:0:0:1: [sdb] tag#18 CDB: Write(10) 2a 00 00 00 00 60 00 00 08 00
>>>>>> kernel: sd 6:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 00 00 00 58 00 00 08 00
>>>>>> kernel: reservation conflict error, dev sdb, sector 96 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
>>>>>> kernel: reservation conflict error, dev sdb, sector 88 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
>>>>>> systemd[1]: fstests-generic-069.scope: Deactivated successfully.
>>>>>> systemd[1]: fstests-generic-069.scope: Consumed 5.092s CPU time.
>>>>>> systemd[1]: media-test.mount: Deactivated successfully.
>>>>>> systemd[1]: media-scratch.mount: Deactivated successfully.
>>>>>> kernel: sd 6:0:0:1: reservation conflict
>>>>>> kernel: failed to unregister PR key.
>>>>>>
>>>>>> This appears to be due to a race. bl_alloc_lseg() calls this:
>>>>>>
>>>>>> 561 static struct nfs4_deviceid_node *
>>>>>> 562 bl_find_get_deviceid(struct nfs_server *server,
>>>>>> 563                 const struct nfs4_deviceid *id, const struct cred *cred,
>>>>>> 564                 gfp_t gfp_mask)
>>>>>> 565 {
>>>>>> 566         struct nfs4_deviceid_node *node;
>>>>>> 567         unsigned long start, end;
>>>>>> 568
>>>>>> 569 retry:
>>>>>> 570         node = nfs4_find_get_deviceid(server, id, cred, gfp_mask);
>>>>>> 571         if (!node)
>>>>>> 572                 return ERR_PTR(-ENODEV);
>>>>>>
>>>>>> nfs4_find_get_deviceid() does a lookup without the spin lock first.
>>>>>> If it can't find a matching deviceid, it creates a new device_info
>>>>>> (which calls bl_alloc_deviceid_node, and that registers the device's
>>>>>> PR key).
>>>>>>
>>>>>> Then it takes the nfs4_deviceid_lock and looks up the deviceid again.
>>>>>> If it finds it this time, bl_find_get_deviceid() frees the spare
>>>>>> (new) device_info, which unregisters the PR key for the same device.
>>>>>>
>>>>>> Any subsequent I/O from this client on that device gets EBADE.
>>>>>>
>>>>>> The umount later unregisters the device's PR key again.
>>>>>>
>>>>>> To prevent this problem, register the PR key after the deviceid_node
>>>>>> lookup.
>>>>>
>>>>> Hi Chuck - nice catch, but I'm not seeing how we don't have the same problem
>>>>> after this patch, instead it just seems like it moves the race.  What
>>>>> prevents another process waiting to take the nfs4_deviceid_lock from
>>>>> unregistering the same device?  I think we need another way to signal
>>>>> bl_free_device that we don't want to unregister for the case where the new
>>>>> device isn't added to nfs4_deviceid_cache.
>>>>
>>>> That's a (related but) somewhat different issue. I haven't seen
>>>> that in practice so far.
>>>>
>>>>
>>>>> No good ideas yet - maybe we can use a flag set within the
>>>>> nfs4_deviceid_lock?
>>>>
>>>> Well this smells like a use for a reference count on the block
>>>> device, but fs/nfs doesn't control the definition of that data
>>>> structure.
>>>
>>> I think we need two things to fix this race:
>>>  - a way to determine if the current thread is the one
>>>    that added the device to the to the cache, if so do the register
>>>  - a way to determine if we're freeing because we lost the race to the
>>>    cache, if so don't un-register.
>>
>> My patch is supposed to deal with all of that already. Can you show
>> me specifically what is not getting handled by my proposed change?
>
> Well - I may be missing something, but it looks like with this patch you can
> still have:
>
> Thread
> A                           B
>
> nfs4_find_get_deviceid
> new{a} = nfs4_get_device_info
> locks nfs4_deviceid_lock
>                             nfs4_find_get_deviceid
>                             new{b} = nfs4_get_device_info
>                             spins on nfs4_deviceid_lock
> adds new{a} to the cache
> unlocks nfs4_deviceid_lock
> pr_register
>                             locks nfs4_deviceid_lock
>                             finds new{a}
>                             pr_UNregisters new{b}
>
> In this case, you end up with an unregistered device.

Oh jeez Chuck, nevermind - I am missing something, that we check the the
new{b} pnfs_block_dev->pr_registred before unregistering it.

So, actually - I think this patch does solve the problem.  I apologize for
the noise here.

Ben


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration
  2024-06-20 16:45             ` Benjamin Coddington
@ 2024-06-20 17:08               ` Chuck Lever
  0 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2024-06-20 17:08 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: cel, linux-nfs, Christoph Hellwig

On Thu, Jun 20, 2024 at 12:45:56PM -0400, Benjamin Coddington wrote:
> On 20 Jun 2024, at 11:56, Benjamin Coddington wrote:
> 
> > On 20 Jun 2024, at 11:46, Chuck Lever wrote:
> >
> >> On Thu, Jun 20, 2024 at 11:30:54AM -0400, Benjamin Coddington wrote:
> >>> On 20 Jun 2024, at 10:34, Chuck Lever wrote:
> >>>
> >>>> On Thu, Jun 20, 2024 at 09:51:46AM -0400, Benjamin Coddington wrote:
> >>>>> On 19 Jun 2024, at 13:39, cel@kernel.org wrote:
> >>>>>
> >>>>>> From: Chuck Lever <chuck.lever@oracle.com>
> >>>>>>
> >>>>>> During generic/069 runs with pNFS SCSI layouts, the NFS client emits
> >>>>>> the following in the system journal:
> >>>>>>
> >>>>>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
> >>>>>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
> >>>>>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
> >>>>>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
> >>>>>> kernel: sd 6:0:0:1: reservation conflict
> >>>>>> kernel: sd 6:0:0:1: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> >>>>>> kernel: sd 6:0:0:1: [sdb] tag#16 CDB: Write(10) 2a 00 00 00 00 50 00 00 08 00
> >>>>>> kernel: reservation conflict error, dev sdb, sector 80 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
> >>>>>> kernel: sd 6:0:0:1: reservation conflict
> >>>>>> kernel: sd 6:0:0:1: reservation conflict
> >>>>>> kernel: sd 6:0:0:1: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> >>>>>> kernel: sd 6:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> >>>>>> kernel: sd 6:0:0:1: [sdb] tag#18 CDB: Write(10) 2a 00 00 00 00 60 00 00 08 00
> >>>>>> kernel: sd 6:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 00 00 00 58 00 00 08 00
> >>>>>> kernel: reservation conflict error, dev sdb, sector 96 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
> >>>>>> kernel: reservation conflict error, dev sdb, sector 88 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
> >>>>>> systemd[1]: fstests-generic-069.scope: Deactivated successfully.
> >>>>>> systemd[1]: fstests-generic-069.scope: Consumed 5.092s CPU time.
> >>>>>> systemd[1]: media-test.mount: Deactivated successfully.
> >>>>>> systemd[1]: media-scratch.mount: Deactivated successfully.
> >>>>>> kernel: sd 6:0:0:1: reservation conflict
> >>>>>> kernel: failed to unregister PR key.
> >>>>>>
> >>>>>> This appears to be due to a race. bl_alloc_lseg() calls this:
> >>>>>>
> >>>>>> 561 static struct nfs4_deviceid_node *
> >>>>>> 562 bl_find_get_deviceid(struct nfs_server *server,
> >>>>>> 563                 const struct nfs4_deviceid *id, const struct cred *cred,
> >>>>>> 564                 gfp_t gfp_mask)
> >>>>>> 565 {
> >>>>>> 566         struct nfs4_deviceid_node *node;
> >>>>>> 567         unsigned long start, end;
> >>>>>> 568
> >>>>>> 569 retry:
> >>>>>> 570         node = nfs4_find_get_deviceid(server, id, cred, gfp_mask);
> >>>>>> 571         if (!node)
> >>>>>> 572                 return ERR_PTR(-ENODEV);
> >>>>>>
> >>>>>> nfs4_find_get_deviceid() does a lookup without the spin lock first.
> >>>>>> If it can't find a matching deviceid, it creates a new device_info
> >>>>>> (which calls bl_alloc_deviceid_node, and that registers the device's
> >>>>>> PR key).
> >>>>>>
> >>>>>> Then it takes the nfs4_deviceid_lock and looks up the deviceid again.
> >>>>>> If it finds it this time, bl_find_get_deviceid() frees the spare
> >>>>>> (new) device_info, which unregisters the PR key for the same device.
> >>>>>>
> >>>>>> Any subsequent I/O from this client on that device gets EBADE.
> >>>>>>
> >>>>>> The umount later unregisters the device's PR key again.
> >>>>>>
> >>>>>> To prevent this problem, register the PR key after the deviceid_node
> >>>>>> lookup.
> >>>>>
> >>>>> Hi Chuck - nice catch, but I'm not seeing how we don't have the same problem
> >>>>> after this patch, instead it just seems like it moves the race.  What
> >>>>> prevents another process waiting to take the nfs4_deviceid_lock from
> >>>>> unregistering the same device?  I think we need another way to signal
> >>>>> bl_free_device that we don't want to unregister for the case where the new
> >>>>> device isn't added to nfs4_deviceid_cache.
> >>>>
> >>>> That's a (related but) somewhat different issue. I haven't seen
> >>>> that in practice so far.
> >>>>
> >>>>
> >>>>> No good ideas yet - maybe we can use a flag set within the
> >>>>> nfs4_deviceid_lock?
> >>>>
> >>>> Well this smells like a use for a reference count on the block
> >>>> device, but fs/nfs doesn't control the definition of that data
> >>>> structure.
> >>>
> >>> I think we need two things to fix this race:
> >>>  - a way to determine if the current thread is the one
> >>>    that added the device to the to the cache, if so do the register
> >>>  - a way to determine if we're freeing because we lost the race to the
> >>>    cache, if so don't un-register.
> >>
> >> My patch is supposed to deal with all of that already. Can you show
> >> me specifically what is not getting handled by my proposed change?
> >
> > Well - I may be missing something, but it looks like with this patch you can
> > still have:
> >
> > Thread
> > A                           B
> >
> > nfs4_find_get_deviceid
> > new{a} = nfs4_get_device_info
> > locks nfs4_deviceid_lock
> >                             nfs4_find_get_deviceid
> >                             new{b} = nfs4_get_device_info
> >                             spins on nfs4_deviceid_lock
> > adds new{a} to the cache
> > unlocks nfs4_deviceid_lock
> > pr_register
> >                             locks nfs4_deviceid_lock
> >                             finds new{a}
> >                             pr_UNregisters new{b}
> >
> > In this case, you end up with an unregistered device.
> 
> Oh jeez Chuck, nevermind - I am missing something, that we check the the
> new{b} pnfs_block_dev->pr_registred before unregistering it.
> 
> So, actually - I think this patch does solve the problem.  I apologize for
> the noise here.

Thanks! Was wondering, because I thought that was exactly the race
I was trying to fix!

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-06-20 17:09 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-19 17:39 [RFC PATCH 0/4] Snapshot of fixes for SCSI PR key registration cel
2024-06-19 17:39 ` [RFC PATCH 1/4] nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg cel
2024-06-20  4:50   ` Christoph Hellwig
2024-06-20  4:52     ` Christoph Hellwig
2024-06-20 14:30     ` Chuck Lever
2024-06-19 17:39 ` [RFC PATCH 2/4] nfs/blocklayout: Report only when /no/ device is found cel
2024-06-20  4:36   ` Christoph Hellwig
2024-06-20 14:59     ` Chuck Lever
2024-06-20 12:17   ` Benjamin Coddington
2024-06-20 14:10     ` Christoph Hellwig
2024-06-19 17:39 ` [RFC PATCH 3/4] nfs/blocklayout: Fix premature PR key unregistration cel
2024-06-20  5:06   ` Christoph Hellwig
2024-06-20 13:52     ` Benjamin Coddington
2024-06-20 13:58       ` Chuck Lever
2024-06-20 14:15       ` Christoph Hellwig
2024-06-20 14:18         ` Chuck Lever III
2024-06-20 15:45         ` Benjamin Coddington
2024-06-20 15:48           ` Chuck Lever
2024-06-20 15:58             ` Benjamin Coddington
2024-06-20 15:39     ` Chuck Lever
2024-06-20 13:51   ` Benjamin Coddington
2024-06-20 14:34     ` Chuck Lever
2024-06-20 14:37       ` Christoph Hellwig
2024-06-20 15:30       ` Benjamin Coddington
2024-06-20 15:46         ` Chuck Lever
2024-06-20 15:56           ` Benjamin Coddington
2024-06-20 16:45             ` Benjamin Coddington
2024-06-20 17:08               ` Chuck Lever
2024-06-19 17:39 ` [RFC PATCH 4/4] nfs/blocklayout: Use bulk page allocation APIs cel
2024-06-20  4:44   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox