[RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib
@ 2025-03-13 23:32 David Howells
  2025-03-13 23:32 ` [RFC PATCH 01/35] ceph: Fix incorrect flush end position calculation David Howells
                   ` (34 more replies)
  0 siblings, 35 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:32 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Hi Viacheslav, Alex,

[!] NOTE: This is a preview of a work in progress.  rbd works and ceph
    works for plain I/O, but content crypto does not.

[!] NOTE: These patches are based on some other sets of patches not
    included in this posting.  They are, however, included in the git
    branch mentioned below.

These patches do a number of things:

 (1) (Mostly) collapse the different I/O types (PAGES, PAGELIST, BVECS,
     BIO) down to a single one.

     I added a new type, ceph_databuf, to make this easier.  The page list
     is attached to that as a bio_vec[] with an iov_iter, but could also be
     some other type supported by the iov_iter.  The iov_iter defines the
     data or buffer to be used.  I have an additional iov_iter type
     implemented that allows use of a straight folio[] or page[] instead of
     a bio_vec[] that I can deploy if that proves more useful.

 (2) RBD is modified to get rid of the removed page-list types and I think
     now fully works.

 (3) Ceph is mostly converted to using netfslib.  At this point, it can do
     plain reads and writes, but content crypto in currently
     non-functional.

     Multipage folios are enabled and work (all the support for that is
     hidden inside of netfslib).

 (4) The old Ceph VFS/VM I/O API implementation is removed.  With that, as
     the code currently stands, the patches overall result in a ~2500 LoC
     reduction.  That may be reduced as some more bits need transferring
     from the old code to the new code.

The conversion isn't quite complete:

 (1) ceph_osd_linger_request::preply_pages needs switching over to a
     ceph_databuf, but I haven't yet managed to work out how the pages that
     handle_watch_notify() sticks in there come about.

 (2) I haven't altered data transmission in net/ceph/messenger*.c yet.  The
     aim is to reduce it to a single sendmsg() call for each ceph_msg_data
     struct, using the iov_iter therein.

 (3) The data reception routines in net/ceph/messenger*.c also need
     modifying to pass each ceph_msg_data::iter to recvmsg() in turn.

 (4) It might be possible to merge struct ceph_databuf into struct
     ceph_msg_data and eliminate the former.

 (5) fs/ceph/ still needs a bit more work to clean up the use of page
     arrays.

 (6) I would like to change front and middle buffers with a ceph_databuf,
     vmapping them when we need to access them.

I added a kmap_ceph_databuf_page() macro and used that to get a page and
use kmap_local_page() on it to hide the bvec[] inside to make it easier to
replace.

Anyway, if anyone has any thoughts...


I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=ceph-iter

David

David Howells (35):
  ceph: Fix incorrect flush end position calculation
  libceph: Rename alignment to offset
  libceph: Add a new data container type, ceph_databuf
  ceph: Convert ceph_mds_request::r_pagelist to a databuf
  libceph: Add functions to add ceph_databufs to requests
  rbd: Use ceph_databuf for rbd_obj_read_sync()
  libceph: Change ceph_osdc_call()'s reply to a ceph_databuf
  libceph: Unexport osd_req_op_cls_request_data_pages()
  libceph: Remove osd_req_op_cls_response_data_pages()
  libceph: Convert notify_id_pages to a ceph_databuf
  ceph: Use ceph_databuf in DIO
  libceph: Bypass the messenger-v1 Tx loop for databuf/iter data blobs
  rbd: Switch from using bvec_iter to iov_iter
  libceph: Remove bvec and bio data container types
  libceph: Make osd_req_op_cls_init() use a ceph_databuf and map it
  libceph: Convert req_page of ceph_osdc_call() to ceph_databuf
  libceph, rbd: Use ceph_databuf encoding start/stop
  libceph, rbd: Convert some page arrays to ceph_databuf
  libceph, ceph: Convert users of ceph_pagelist to ceph_databuf
  libceph: Remove ceph_pagelist
  libceph: Make notify code use ceph_databuf_enc_start/stop
  libceph, rbd: Convert ceph_osdc_notify() reply to ceph_databuf
  rbd: Use ceph_databuf_enc_start/stop()
  ceph: Make ceph_calc_file_object_mapping() return size as size_t
  ceph: Wrap POSIX_FADV_WILLNEED to get caps
  ceph: Kill ceph_rw_context
  netfs: Pass extra write context to write functions
  netfs: Adjust group handling
  netfs: Allow fs-private data to be handed through to request alloc
  netfs: Make netfs_page_mkwrite() use folio_mkwrite_check_truncate()
  netfs: Fix netfs_unbuffered_read() to return ssize_t rather than int
  netfs: Add some more RMW support for ceph
  ceph: Use netfslib [INCOMPLETE]
  ceph: Enable multipage folios for ceph files
  ceph: Remove old I/O API bits

 drivers/block/rbd.c             |  904 ++++++--------
 fs/9p/vfs_file.c                |    2 +-
 fs/afs/write.c                  |    2 +-
 fs/ceph/Makefile                |    2 +-
 fs/ceph/acl.c                   |   39 +-
 fs/ceph/addr.c                  | 2009 +------------------------------
 fs/ceph/cache.h                 |    5 +
 fs/ceph/caps.c                  |    2 +-
 fs/ceph/crypto.c                |   56 +-
 fs/ceph/file.c                  | 1810 +++-------------------------
 fs/ceph/inode.c                 |  116 +-
 fs/ceph/ioctl.c                 |    2 +-
 fs/ceph/locks.c                 |   23 +-
 fs/ceph/mds_client.c            |  134 +--
 fs/ceph/mds_client.h            |    2 +-
 fs/ceph/rdwr.c                  | 1006 ++++++++++++++++
 fs/ceph/super.h                 |   81 +-
 fs/ceph/xattr.c                 |   69 +-
 fs/netfs/buffered_read.c        |   11 +-
 fs/netfs/buffered_write.c       |   48 +-
 fs/netfs/direct_read.c          |   83 +-
 fs/netfs/direct_write.c         |    3 +-
 fs/netfs/internal.h             |   40 +-
 fs/netfs/main.c                 |    5 +-
 fs/netfs/objects.c              |    4 +
 fs/netfs/read_collect.c         |    2 +
 fs/netfs/read_pgpriv2.c         |    2 +-
 fs/netfs/read_single.c          |    2 +-
 fs/netfs/write_issue.c          |   55 +-
 fs/netfs/write_retry.c          |    5 +-
 fs/smb/client/file.c            |    4 +-
 include/linux/ceph/databuf.h    |  169 +++
 include/linux/ceph/decode.h     |    4 +-
 include/linux/ceph/libceph.h    |    3 +-
 include/linux/ceph/messenger.h  |  122 +-
 include/linux/ceph/osd_client.h |   87 +-
 include/linux/ceph/pagelist.h   |   60 -
 include/linux/ceph/striper.h    |   60 +-
 include/linux/netfs.h           |   89 +-
 include/trace/events/netfs.h    |    3 +
 net/ceph/Makefile               |    5 +-
 net/ceph/cls_lock_client.c      |  200 ++-
 net/ceph/databuf.c              |  200 +++
 net/ceph/messenger.c            |  310 +----
 net/ceph/messenger_v1.c         |   76 +-
 net/ceph/mon_client.c           |   10 +-
 net/ceph/osd_client.c           |  510 +++-----
 net/ceph/pagelist.c             |  133 --
 net/ceph/snapshot.c             |   20 +-
 net/ceph/striper.c              |   57 +-
 50 files changed, 2996 insertions(+), 5650 deletions(-)
 create mode 100644 fs/ceph/rdwr.c
 create mode 100644 include/linux/ceph/databuf.h
 delete mode 100644 include/linux/ceph/pagelist.h
 create mode 100644 net/ceph/databuf.c
 delete mode 100644 net/ceph/pagelist.c


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [RFC PATCH 01/35] ceph: Fix incorrect flush end position calculation
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
@ 2025-03-13 23:32 ` David Howells
  2025-03-13 23:32 ` [RFC PATCH 02/35] libceph: Rename alignment to offset David Howells
                   ` (33 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:32 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel, Xiubo Li

In ceph, in fill_fscrypt_truncate(), the end flush position is calculated
by:

		loff_t lend = orig_pos + CEPH_FSCRYPT_BLOCK_SHIFT - 1;

but that's using the block shift not the block size.

Fix this to use the block size instead.

Fixes: 5c64737d2536 ("ceph: add truncate size handling support for fscrypt")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/ceph/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index c15970fa240f..b060f765ad20 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2364,7 +2364,7 @@ static int fill_fscrypt_truncate(struct inode *inode,
 
 	/* Try to writeback the dirty pagecaches */
 	if (issued & (CEPH_CAP_FILE_BUFFER)) {
-		loff_t lend = orig_pos + CEPH_FSCRYPT_BLOCK_SHIFT - 1;
+		loff_t lend = orig_pos + CEPH_FSCRYPT_BLOCK_SIZE - 1;
 
 		ret = filemap_write_and_wait_range(inode->i_mapping,
 						   orig_pos, lend);


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 02/35] libceph: Rename alignment to offset
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
  2025-03-13 23:32 ` [RFC PATCH 01/35] ceph: Fix incorrect flush end position calculation David Howells
@ 2025-03-13 23:32 ` David Howells
  2025-03-14 19:04   ` Viacheslav Dubeyko
  2025-03-14 20:01   ` David Howells
  2025-03-13 23:32 ` [RFC PATCH 03/35] libceph: Add a new data container type, ceph_databuf David Howells
                   ` (32 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:32 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Rename 'alignment' to 'offset' in a number of places where it seems to be
talking about the offset into the first page of a sequence of pages.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/ceph/addr.c                  |  4 ++--
 include/linux/ceph/messenger.h  |  4 ++--
 include/linux/ceph/osd_client.h | 10 +++++-----
 net/ceph/messenger.c            | 10 +++++-----
 net/ceph/osd_client.c           | 24 ++++++++++++------------
 5 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 20b6bd8cd004..482a9f41a685 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -254,7 +254,7 @@ static void finish_netfs_read(struct ceph_osd_request *req)
 
 	if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) {
 		ceph_put_page_vector(osd_data->pages,
-				     calc_pages_for(osd_data->alignment,
+				     calc_pages_for(osd_data->offset,
 					osd_data->length), false);
 	}
 	if (err > 0) {
@@ -918,7 +918,7 @@ static void writepages_finish(struct ceph_osd_request *req)
 		osd_data = osd_req_op_extent_osd_data(req, i);
 		BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_PAGES);
 		len += osd_data->length;
-		num_pages = calc_pages_for((u64)osd_data->alignment,
+		num_pages = calc_pages_for((u64)osd_data->offset,
 					   (u64)osd_data->length);
 		total_pages += num_pages;
 		for (j = 0; j < num_pages; j++) {
diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 1717cc57cdac..db2aba32b8a0 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -221,7 +221,7 @@ struct ceph_msg_data {
 		struct {
 			struct page	**pages;
 			size_t		length;		/* total # bytes */
-			unsigned int	alignment;	/* first page */
+			unsigned int	offset;		/* first page */
 			bool		own_pages;
 		};
 		struct ceph_pagelist	*pagelist;
@@ -602,7 +602,7 @@ extern bool ceph_con_keepalive_expired(struct ceph_connection *con,
 				       unsigned long interval);
 
 void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
-			     size_t length, size_t alignment, bool own_pages);
+			     size_t length, size_t offset, bool own_pages);
 extern void ceph_msg_data_add_pagelist(struct ceph_msg *msg,
 				struct ceph_pagelist *pagelist);
 #ifdef CONFIG_BLOCK
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index d55b30057a45..8fc84f389aad 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -118,7 +118,7 @@ struct ceph_osd_data {
 		struct {
 			struct page	**pages;
 			u64		length;
-			u32		alignment;
+			u32		offset;
 			bool		pages_from_pool;
 			bool		own_pages;
 		};
@@ -469,7 +469,7 @@ struct ceph_osd_req_op *osd_req_op_init(struct ceph_osd_request *osd_req,
 extern void osd_req_op_raw_data_in_pages(struct ceph_osd_request *,
 					unsigned int which,
 					struct page **pages, u64 length,
-					u32 alignment, bool pages_from_pool,
+					u32 offset, bool pages_from_pool,
 					bool own_pages);
 
 extern void osd_req_op_extent_init(struct ceph_osd_request *osd_req,
@@ -488,7 +488,7 @@ extern struct ceph_osd_data *osd_req_op_extent_osd_data(
 extern void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *,
 					unsigned int which,
 					struct page **pages, u64 length,
-					u32 alignment, bool pages_from_pool,
+					u32 offset, bool pages_from_pool,
 					bool own_pages);
 extern void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *,
 					unsigned int which,
@@ -515,7 +515,7 @@ extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *,
 extern void osd_req_op_cls_request_data_pages(struct ceph_osd_request *,
 					unsigned int which,
 					struct page **pages, u64 length,
-					u32 alignment, bool pages_from_pool,
+					u32 offset, bool pages_from_pool,
 					bool own_pages);
 void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req,
 				       unsigned int which,
@@ -524,7 +524,7 @@ void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req,
 extern void osd_req_op_cls_response_data_pages(struct ceph_osd_request *,
 					unsigned int which,
 					struct page **pages, u64 length,
-					u32 alignment, bool pages_from_pool,
+					u32 offset, bool pages_from_pool,
 					bool own_pages);
 int osd_req_op_cls_init(struct ceph_osd_request *osd_req, unsigned int which,
 			const char *class, const char *method);
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index d1b5705dc0c6..1df4291cc80b 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -840,8 +840,8 @@ static void ceph_msg_data_pages_cursor_init(struct ceph_msg_data_cursor *cursor,
 	BUG_ON(!data->length);
 
 	cursor->resid = min(length, data->length);
-	page_count = calc_pages_for(data->alignment, (u64)data->length);
-	cursor->page_offset = data->alignment & ~PAGE_MASK;
+	page_count = calc_pages_for(data->offset, (u64)data->length);
+	cursor->page_offset = data->offset & ~PAGE_MASK;
 	cursor->page_index = 0;
 	BUG_ON(page_count > (int)USHRT_MAX);
 	cursor->page_count = (unsigned short)page_count;
@@ -1873,7 +1873,7 @@ static struct ceph_msg_data *ceph_msg_data_add(struct ceph_msg *msg)
 static void ceph_msg_data_destroy(struct ceph_msg_data *data)
 {
 	if (data->type == CEPH_MSG_DATA_PAGES && data->own_pages) {
-		int num_pages = calc_pages_for(data->alignment, data->length);
+		int num_pages = calc_pages_for(data->offset, data->length);
 		ceph_release_page_vector(data->pages, num_pages);
 	} else if (data->type == CEPH_MSG_DATA_PAGELIST) {
 		ceph_pagelist_release(data->pagelist);
@@ -1881,7 +1881,7 @@ static void ceph_msg_data_destroy(struct ceph_msg_data *data)
 }
 
 void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
-			     size_t length, size_t alignment, bool own_pages)
+			     size_t length, size_t offset, bool own_pages)
 {
 	struct ceph_msg_data *data;
 
@@ -1892,7 +1892,7 @@ void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
 	data->type = CEPH_MSG_DATA_PAGES;
 	data->pages = pages;
 	data->length = length;
-	data->alignment = alignment & ~PAGE_MASK;
+	data->offset = offset & ~PAGE_MASK;
 	data->own_pages = own_pages;
 
 	msg->data_length += length;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index b24afec24138..e359e70ad47e 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -130,13 +130,13 @@ static void ceph_osd_data_init(struct ceph_osd_data *osd_data)
  * Consumes @pages if @own_pages is true.
  */
 static void ceph_osd_data_pages_init(struct ceph_osd_data *osd_data,
-			struct page **pages, u64 length, u32 alignment,
+			struct page **pages, u64 length, u32 offset,
 			bool pages_from_pool, bool own_pages)
 {
 	osd_data->type = CEPH_OSD_DATA_TYPE_PAGES;
 	osd_data->pages = pages;
 	osd_data->length = length;
-	osd_data->alignment = alignment;
+	osd_data->offset = offset;
 	osd_data->pages_from_pool = pages_from_pool;
 	osd_data->own_pages = own_pages;
 }
@@ -196,26 +196,26 @@ EXPORT_SYMBOL(osd_req_op_extent_osd_data);
 
 void osd_req_op_raw_data_in_pages(struct ceph_osd_request *osd_req,
 			unsigned int which, struct page **pages,
-			u64 length, u32 alignment,
+			u64 length, u32 offset,
 			bool pages_from_pool, bool own_pages)
 {
 	struct ceph_osd_data *osd_data;
 
 	osd_data = osd_req_op_raw_data_in(osd_req, which);
-	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
+	ceph_osd_data_pages_init(osd_data, pages, length, offset,
 				pages_from_pool, own_pages);
 }
 EXPORT_SYMBOL(osd_req_op_raw_data_in_pages);
 
 void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *osd_req,
 			unsigned int which, struct page **pages,
-			u64 length, u32 alignment,
+			u64 length, u32 offset,
 			bool pages_from_pool, bool own_pages)
 {
 	struct ceph_osd_data *osd_data;
 
 	osd_data = osd_req_op_data(osd_req, which, extent, osd_data);
-	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
+	ceph_osd_data_pages_init(osd_data, pages, length, offset,
 				pages_from_pool, own_pages);
 }
 EXPORT_SYMBOL(osd_req_op_extent_osd_data_pages);
@@ -312,12 +312,12 @@ EXPORT_SYMBOL(osd_req_op_cls_request_data_pagelist);
 
 void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req,
 			unsigned int which, struct page **pages, u64 length,
-			u32 alignment, bool pages_from_pool, bool own_pages)
+			u32 offset, bool pages_from_pool, bool own_pages)
 {
 	struct ceph_osd_data *osd_data;
 
 	osd_data = osd_req_op_data(osd_req, which, cls, request_data);
-	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
+	ceph_osd_data_pages_init(osd_data, pages, length, offset,
 				pages_from_pool, own_pages);
 	osd_req->r_ops[which].cls.indata_len += length;
 	osd_req->r_ops[which].indata_len += length;
@@ -344,12 +344,12 @@ EXPORT_SYMBOL(osd_req_op_cls_request_data_bvecs);
 
 void osd_req_op_cls_response_data_pages(struct ceph_osd_request *osd_req,
 			unsigned int which, struct page **pages, u64 length,
-			u32 alignment, bool pages_from_pool, bool own_pages)
+			u32 offset, bool pages_from_pool, bool own_pages)
 {
 	struct ceph_osd_data *osd_data;
 
 	osd_data = osd_req_op_data(osd_req, which, cls, response_data);
-	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
+	ceph_osd_data_pages_init(osd_data, pages, length, offset,
 				pages_from_pool, own_pages);
 }
 EXPORT_SYMBOL(osd_req_op_cls_response_data_pages);
@@ -382,7 +382,7 @@ static void ceph_osd_data_release(struct ceph_osd_data *osd_data)
 	if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES && osd_data->own_pages) {
 		int num_pages;
 
-		num_pages = calc_pages_for((u64)osd_data->alignment,
+		num_pages = calc_pages_for((u64)osd_data->offset,
 						(u64)osd_data->length);
 		ceph_release_page_vector(osd_data->pages, num_pages);
 	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) {
@@ -969,7 +969,7 @@ static void ceph_osdc_msg_data_add(struct ceph_msg *msg,
 		BUG_ON(length > (u64) SIZE_MAX);
 		if (length)
 			ceph_msg_data_add_pages(msg, osd_data->pages,
-					length, osd_data->alignment, false);
+					length, osd_data->offset, false);
 	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) {
 		BUG_ON(!length);
 		ceph_msg_data_add_pagelist(msg, osd_data->pagelist);


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 03/35] libceph: Add a new data container type, ceph_databuf
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
  2025-03-13 23:32 ` [RFC PATCH 01/35] ceph: Fix incorrect flush end position calculation David Howells
  2025-03-13 23:32 ` [RFC PATCH 02/35] libceph: Rename alignment to offset David Howells
@ 2025-03-13 23:32 ` David Howells
  2025-03-14 20:06   ` Viacheslav Dubeyko
  2025-03-17 11:27   ` David Howells
  2025-03-13 23:32 ` [RFC PATCH 04/35] ceph: Convert ceph_mds_request::r_pagelist to a databuf David Howells
                   ` (31 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:32 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Add a new ceph data container type, ceph_databuf, that can carry a list of
pages in a bvec and use an iov_iter to handle describe the data to the next
layer down.  The iterator can also be used to refer to other types, such as
ITER_FOLIOQ.

There are two ways of loading the bvec.  One way is to allocate a buffer
with space in it and then add data, expanding the space as needed; the
other is to splice in pages, expanding the bvec[] as needed.

This is intended to replace all other types.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/ceph/databuf.h    | 131 +++++++++++++++++++++
 include/linux/ceph/messenger.h  |   6 +-
 include/linux/ceph/osd_client.h |   3 +
 net/ceph/Makefile               |   3 +-
 net/ceph/databuf.c              | 200 ++++++++++++++++++++++++++++++++
 net/ceph/messenger.c            |  20 +++-
 net/ceph/osd_client.c           |  11 +-
 7 files changed, 369 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/ceph/databuf.h
 create mode 100644 net/ceph/databuf.c

diff --git a/include/linux/ceph/databuf.h b/include/linux/ceph/databuf.h
new file mode 100644
index 000000000000..14c7a6449467
--- /dev/null
+++ b/include/linux/ceph/databuf.h
@@ -0,0 +1,131 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __FS_CEPH_DATABUF_H
+#define __FS_CEPH_DATABUF_H
+
+#include <asm/byteorder.h>
+#include <linux/refcount.h>
+#include <linux/blk_types.h>
+
+struct ceph_databuf {
+	struct bio_vec	*bvec;		/* List of pages */
+	struct bio_vec	inline_bvec[1];	/* Inline bvecs for small buffers */
+	struct iov_iter	iter;		/* Iterator defining occupied data */
+	size_t		limit;		/* Maximum length before expansion required */
+	size_t		nr_bvec;	/* Number of bvec[] that have pages */
+	size_t		max_bvec;	/* Size of bvec[] */
+	refcount_t	refcnt;
+	bool		put_pages;	/* T if pages in bvec[] need to be put*/
+};
+
+struct ceph_databuf *ceph_databuf_alloc(size_t min_bvec, size_t space,
+					unsigned int data_source, gfp_t gfp);
+struct ceph_databuf *ceph_databuf_get(struct ceph_databuf *dbuf);
+void ceph_databuf_release(struct ceph_databuf *dbuf);
+int ceph_databuf_append(struct ceph_databuf *dbuf, const void *d, size_t l);
+int ceph_databuf_reserve(struct ceph_databuf *dbuf, size_t space, gfp_t gfp);
+int ceph_databuf_insert_frag(struct ceph_databuf *dbuf, unsigned int ix,
+			     size_t len, gfp_t gfp);
+
+static inline
+struct ceph_databuf *ceph_databuf_req_alloc(size_t min_bvec, size_t space, gfp_t gfp)
+{
+	return ceph_databuf_alloc(min_bvec, space, ITER_SOURCE, gfp);
+}
+
+static inline
+struct ceph_databuf *ceph_databuf_reply_alloc(size_t min_bvec, size_t space, gfp_t gfp)
+{
+	struct ceph_databuf *dbuf;
+
+	dbuf = ceph_databuf_alloc(min_bvec, space, ITER_DEST, gfp);
+	if (dbuf)
+		iov_iter_reexpand(&dbuf->iter, space);
+	return dbuf;
+}
+
+static inline struct page *ceph_databuf_page(struct ceph_databuf *dbuf,
+					     unsigned int ix)
+{
+	return dbuf->bvec[ix].bv_page;
+}
+
+#define kmap_ceph_databuf_page(dbuf, ix) \
+	kmap_local_page(ceph_databuf_page(dbuf, ix));
+
+static inline int ceph_databuf_encode_64(struct ceph_databuf *dbuf, u64 v)
+{
+	__le64 ev = cpu_to_le64(v);
+	return ceph_databuf_append(dbuf, &ev, sizeof(ev));
+}
+static inline int ceph_databuf_encode_32(struct ceph_databuf *dbuf, u32 v)
+{
+	__le32 ev = cpu_to_le32(v);
+	return ceph_databuf_append(dbuf, &ev, sizeof(ev));
+}
+static inline int ceph_databuf_encode_16(struct ceph_databuf *dbuf, u16 v)
+{
+	__le16 ev = cpu_to_le16(v);
+	return ceph_databuf_append(dbuf, &ev, sizeof(ev));
+}
+static inline int ceph_databuf_encode_8(struct ceph_databuf *dbuf, u8 v)
+{
+	return ceph_databuf_append(dbuf, &v, 1);
+}
+static inline int ceph_databuf_encode_string(struct ceph_databuf *dbuf,
+					     const char *s, u32 len)
+{
+	int ret = ceph_databuf_encode_32(dbuf, len);
+	if (ret)
+		return ret;
+	if (len)
+		return ceph_databuf_append(dbuf, s, len);
+	return 0;
+}
+
+static inline size_t ceph_databuf_len(struct ceph_databuf *dbuf)
+{
+	return dbuf->iter.count;
+}
+
+static inline void ceph_databuf_added_data(struct ceph_databuf *dbuf,
+					   size_t len)
+{
+	dbuf->iter.count += len;
+}
+
+static inline void ceph_databuf_reply_ready(struct ceph_databuf *reply,
+					    size_t len)
+{
+	reply->iter.data_source = ITER_SOURCE;
+	iov_iter_truncate(&reply->iter, len);
+}
+
+static inline void ceph_databuf_reset_reply(struct ceph_databuf *reply)
+{
+	iov_iter_bvec(&reply->iter, ITER_DEST,
+		      reply->bvec, reply->nr_bvec, reply->limit);
+}
+
+static inline void ceph_databuf_append_page(struct ceph_databuf *dbuf,
+					    struct page *page,
+					    unsigned int offset,
+					    unsigned int len)
+{
+	BUG_ON(dbuf->nr_bvec >= dbuf->max_bvec);
+	bvec_set_page(&dbuf->bvec[dbuf->nr_bvec++], page, len, offset);
+	dbuf->iter.count += len;
+	dbuf->iter.nr_segs++;
+}
+
+static inline void *ceph_databuf_enc_start(struct ceph_databuf *dbuf)
+{
+	return page_address(ceph_databuf_page(dbuf, 0)) + dbuf->iter.count;
+}
+
+static inline void ceph_databuf_enc_stop(struct ceph_databuf *dbuf, void *p)
+{
+	dbuf->iter.count = p - page_address(ceph_databuf_page(dbuf, 0));
+	BUG_ON(dbuf->iter.count > dbuf->limit);
+}
+
+#endif /* __FS_CEPH_DATABUF_H */
diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index db2aba32b8a0..864aad369c91 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -117,6 +117,7 @@ struct ceph_messenger {
 
 enum ceph_msg_data_type {
 	CEPH_MSG_DATA_NONE,	/* message contains no data payload */
+	CEPH_MSG_DATA_DATABUF,	/* data source/destination is a data buffer */
 	CEPH_MSG_DATA_PAGES,	/* data source/destination is a page array */
 	CEPH_MSG_DATA_PAGELIST,	/* data source/destination is a pagelist */
 #ifdef CONFIG_BLOCK
@@ -210,7 +211,10 @@ struct ceph_bvec_iter {
 
 struct ceph_msg_data {
 	enum ceph_msg_data_type		type;
+	struct iov_iter			iter;
+	bool				release_dbuf;
 	union {
+		struct ceph_databuf	*dbuf;
 #ifdef CONFIG_BLOCK
 		struct {
 			struct ceph_bio_iter	bio_pos;
@@ -225,7 +229,6 @@ struct ceph_msg_data {
 			bool		own_pages;
 		};
 		struct ceph_pagelist	*pagelist;
-		struct iov_iter		iter;
 	};
 };
 
@@ -601,6 +604,7 @@ extern void ceph_con_keepalive(struct ceph_connection *con);
 extern bool ceph_con_keepalive_expired(struct ceph_connection *con,
 				       unsigned long interval);
 
+void ceph_msg_data_add_databuf(struct ceph_msg *msg, struct ceph_databuf *dbuf);
 void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
 			     size_t length, size_t offset, bool own_pages);
 extern void ceph_msg_data_add_pagelist(struct ceph_msg *msg,
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 8fc84f389aad..b8fb5a71dd57 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -16,6 +16,7 @@
 #include <linux/ceph/msgpool.h>
 #include <linux/ceph/auth.h>
 #include <linux/ceph/pagelist.h>
+#include <linux/ceph/databuf.h>
 
 struct ceph_msg;
 struct ceph_snap_context;
@@ -103,6 +104,7 @@ struct ceph_osd {
 
 enum ceph_osd_data_type {
 	CEPH_OSD_DATA_TYPE_NONE = 0,
+	CEPH_OSD_DATA_TYPE_DATABUF,
 	CEPH_OSD_DATA_TYPE_PAGES,
 	CEPH_OSD_DATA_TYPE_PAGELIST,
 #ifdef CONFIG_BLOCK
@@ -115,6 +117,7 @@ enum ceph_osd_data_type {
 struct ceph_osd_data {
 	enum ceph_osd_data_type	type;
 	union {
+		struct ceph_databuf	*dbuf;
 		struct {
 			struct page	**pages;
 			u64		length;
diff --git a/net/ceph/Makefile b/net/ceph/Makefile
index 8802a0c0155d..4b2e0b654e45 100644
--- a/net/ceph/Makefile
+++ b/net/ceph/Makefile
@@ -15,4 +15,5 @@ libceph-y := ceph_common.o messenger.o msgpool.o buffer.o pagelist.o \
 	auth_x.o \
 	ceph_strings.o ceph_hash.o \
 	pagevec.o snapshot.o string_table.o \
-	messenger_v1.o messenger_v2.o
+	messenger_v1.o messenger_v2.o \
+	databuf.o
diff --git a/net/ceph/databuf.c b/net/ceph/databuf.c
new file mode 100644
index 000000000000..9d108fff5a4f
--- /dev/null
+++ b/net/ceph/databuf.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Data container
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/export.h>
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/uio.h>
+#include <linux/pagemap.h>
+#include <linux/highmem.h>
+#include <linux/ceph/databuf.h>
+
+struct ceph_databuf *ceph_databuf_alloc(size_t min_bvec, size_t space,
+					unsigned int data_source, gfp_t gfp)
+{
+	struct ceph_databuf *dbuf;
+	size_t inl = ARRAY_SIZE(dbuf->inline_bvec);
+
+	dbuf = kzalloc(sizeof(*dbuf), gfp);
+	if (!dbuf)
+		return NULL;
+
+	refcount_set(&dbuf->refcnt, 1);
+
+	if (min_bvec == 0 && space == 0) {
+		/* Do nothing */
+	} else if (min_bvec <= inl && space <= inl * PAGE_SIZE) {
+		dbuf->bvec = dbuf->inline_bvec;
+		dbuf->max_bvec = inl;
+		dbuf->limit = space;
+	} else if (min_bvec) {
+		min_bvec = umax(min_bvec, 16);
+
+		dbuf->bvec = kcalloc(min_bvec, sizeof(struct bio_vec), gfp);
+		if (!dbuf->bvec) {
+			kfree(dbuf);
+			return NULL;
+		}
+
+		dbuf->max_bvec = min_bvec;
+	}
+
+	iov_iter_bvec(&dbuf->iter, data_source, dbuf->bvec, 0, 0);
+
+	if (space) {
+		if (ceph_databuf_reserve(dbuf, space, gfp) < 0) {
+			ceph_databuf_release(dbuf);
+			return NULL;
+		}
+	}
+	return dbuf;
+}
+EXPORT_SYMBOL(ceph_databuf_alloc);
+
+struct ceph_databuf *ceph_databuf_get(struct ceph_databuf *dbuf)
+{
+	if (!dbuf)
+		return NULL;
+	refcount_inc(&dbuf->refcnt);
+	return dbuf;
+}
+EXPORT_SYMBOL(ceph_databuf_get);
+
+void ceph_databuf_release(struct ceph_databuf *dbuf)
+{
+	size_t i;
+
+	if (!dbuf || !refcount_dec_and_test(&dbuf->refcnt))
+		return;
+
+	if (dbuf->put_pages)
+		for (i = 0; i < dbuf->nr_bvec; i++)
+			put_page(dbuf->bvec[i].bv_page);
+	if (dbuf->bvec != dbuf->inline_bvec)
+		kfree(dbuf->bvec);
+	kfree(dbuf);
+}
+EXPORT_SYMBOL(ceph_databuf_release);
+
+/*
+ * Expand the bvec[] in the dbuf.
+ */
+static int ceph_databuf_expand(struct ceph_databuf *dbuf, size_t req_bvec,
+			       gfp_t gfp)
+{
+	struct bio_vec *bvec = dbuf->bvec, *old = bvec;
+	size_t size, max_bvec, off = dbuf->iter.bvec - old;
+	size_t inl = ARRAY_SIZE(dbuf->inline_bvec);
+
+	if (req_bvec <= inl) {
+		dbuf->bvec = dbuf->inline_bvec;
+		dbuf->max_bvec = inl;
+		dbuf->iter.bvec = dbuf->inline_bvec + off;
+		return 0;
+	}
+
+	max_bvec = roundup_pow_of_two(req_bvec);
+	size = array_size(max_bvec, sizeof(struct bio_vec));
+
+	if (old == dbuf->inline_bvec) {
+		bvec = kmalloc_array(max_bvec, sizeof(struct bio_vec), gfp);
+		if (!bvec)
+			return -ENOMEM;
+		memcpy(bvec, old, inl);
+	} else {
+		bvec = krealloc(old, size, gfp);
+		if (!bvec)
+			return -ENOMEM;
+	}
+	dbuf->bvec = bvec;
+	dbuf->max_bvec = max_bvec;
+	dbuf->iter.bvec = bvec + off;
+	return 0;
+}
+
+/* Allocate enough pages for a dbuf to append the given amount
+ * of dbuf without allocating.
+ * Returns: 0 on success, -ENOMEM on error.
+ */
+int ceph_databuf_reserve(struct ceph_databuf *dbuf, size_t add_space,
+			 gfp_t gfp)
+{
+	struct bio_vec *bvec;
+	size_t i, req_bvec = DIV_ROUND_UP(dbuf->iter.count + add_space, PAGE_SIZE);
+	int ret;
+
+	dbuf->put_pages = true;
+	if (req_bvec > dbuf->max_bvec) {
+		ret = ceph_databuf_expand(dbuf, req_bvec, gfp);
+		if (ret < 0)
+			return ret;
+	}
+
+	bvec = dbuf->bvec;
+	while (dbuf->nr_bvec < req_bvec) {
+		struct page *pages[16];
+		size_t want = min(req_bvec, ARRAY_SIZE(pages)), got;
+
+		memset(pages, 0, sizeof(pages));
+		got = alloc_pages_bulk(gfp, want, pages);
+		if (!got)
+			return -ENOMEM;
+		for (i = 0; i < got; i++)
+			bvec_set_page(&bvec[dbuf->nr_bvec + i], pages[i],
+				      PAGE_SIZE, 0);
+		dbuf->iter.nr_segs += got;
+		dbuf->nr_bvec += got;
+		dbuf->limit = dbuf->nr_bvec * PAGE_SIZE;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(ceph_databuf_reserve);
+
+int ceph_databuf_append(struct ceph_databuf *dbuf, const void *buf, size_t len)
+{
+	struct iov_iter temp_iter;
+
+	if (!len)
+		return 0;
+	if (dbuf->limit - dbuf->iter.count > len &&
+	    ceph_databuf_reserve(dbuf, len, GFP_NOIO) < 0)
+		return -ENOMEM;
+
+	iov_iter_bvec(&temp_iter, ITER_DEST,
+		      dbuf->bvec, dbuf->nr_bvec, dbuf->limit);
+	iov_iter_advance(&temp_iter, dbuf->iter.count);
+
+	if (copy_to_iter(buf, len, &temp_iter) != len)
+		return -EFAULT;
+	dbuf->iter.count += len;
+	return 0;
+}
+EXPORT_SYMBOL(ceph_databuf_append);
+
+/*
+ * Allocate a fragment and insert it into the buffer at the specified index.
+ */
+int ceph_databuf_insert_frag(struct ceph_databuf *dbuf, unsigned int ix,
+			     size_t len, gfp_t gfp)
+{
+	struct page *page;
+
+	page = alloc_page(gfp);
+	if (!page)
+		return -ENOMEM;
+
+	bvec_set_page(&dbuf->bvec[ix], page, len, 0);
+
+	if (dbuf->nr_bvec == ix) {
+		dbuf->iter.nr_segs = ix + 1;
+		dbuf->nr_bvec = ix + 1;
+		dbuf->iter.count += len;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(ceph_databuf_insert_frag);
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 1df4291cc80b..802f0b222131 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1872,7 +1872,9 @@ static struct ceph_msg_data *ceph_msg_data_add(struct ceph_msg *msg)
 
 static void ceph_msg_data_destroy(struct ceph_msg_data *data)
 {
-	if (data->type == CEPH_MSG_DATA_PAGES && data->own_pages) {
+	if (data->type == CEPH_MSG_DATA_DATABUF) {
+		ceph_databuf_release(data->dbuf);
+	} else if (data->type == CEPH_MSG_DATA_PAGES && data->own_pages) {
 		int num_pages = calc_pages_for(data->offset, data->length);
 		ceph_release_page_vector(data->pages, num_pages);
 	} else if (data->type == CEPH_MSG_DATA_PAGELIST) {
@@ -1880,6 +1882,22 @@ static void ceph_msg_data_destroy(struct ceph_msg_data *data)
 	}
 }
 
+void ceph_msg_data_add_databuf(struct ceph_msg *msg, struct ceph_databuf *dbuf)
+{
+	struct ceph_msg_data *data;
+
+	BUG_ON(!dbuf);
+	BUG_ON(!ceph_databuf_len(dbuf));
+
+	data = ceph_msg_data_add(msg);
+	data->type = CEPH_MSG_DATA_DATABUF;
+	data->dbuf = ceph_databuf_get(dbuf);
+	data->iter = dbuf->iter;
+
+	msg->data_length += ceph_databuf_len(dbuf);
+}
+EXPORT_SYMBOL(ceph_msg_data_add_databuf);
+
 void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
 			     size_t length, size_t offset, bool own_pages)
 {
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index e359e70ad47e..c84634264377 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -359,6 +359,8 @@ static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data)
 	switch (osd_data->type) {
 	case CEPH_OSD_DATA_TYPE_NONE:
 		return 0;
+	case CEPH_OSD_DATA_TYPE_DATABUF:
+		return ceph_databuf_len(osd_data->dbuf);
 	case CEPH_OSD_DATA_TYPE_PAGES:
 		return osd_data->length;
 	case CEPH_OSD_DATA_TYPE_PAGELIST:
@@ -379,7 +381,9 @@ static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data)
 
 static void ceph_osd_data_release(struct ceph_osd_data *osd_data)
 {
-	if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES && osd_data->own_pages) {
+	if (osd_data->type == CEPH_OSD_DATA_TYPE_DATABUF) {
+		ceph_databuf_release(osd_data->dbuf);
+	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES && osd_data->own_pages) {
 		int num_pages;
 
 		num_pages = calc_pages_for((u64)osd_data->offset,
@@ -965,7 +969,10 @@ static void ceph_osdc_msg_data_add(struct ceph_msg *msg,
 {
 	u64 length = ceph_osd_data_length(osd_data);
 
-	if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) {
+	if (osd_data->type == CEPH_OSD_DATA_TYPE_DATABUF) {
+		BUG_ON(!length);
+		ceph_msg_data_add_databuf(msg, osd_data->dbuf);
+	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) {
 		BUG_ON(length > (u64) SIZE_MAX);
 		if (length)
 			ceph_msg_data_add_pages(msg, osd_data->pages,


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 04/35] ceph: Convert ceph_mds_request::r_pagelist to a databuf
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (2 preceding siblings ...)
  2025-03-13 23:32 ` [RFC PATCH 03/35] libceph: Add a new data container type, ceph_databuf David Howells
@ 2025-03-13 23:32 ` David Howells
  2025-03-14 22:27   ` slava
  2025-03-17 11:52   ` David Howells
  2025-03-13 23:32 ` [RFC PATCH 05/35] libceph: Add functions to add ceph_databufs to requests David Howells
                   ` (30 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:32 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Convert ceph_mds_request::r_pagelist to a databuf, along with the stuff
that uses it such as setxattr ops.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/ceph/acl.c        | 39 ++++++++++----------
 fs/ceph/file.c       | 12 ++++---
 fs/ceph/inode.c      | 85 +++++++++++++++++++-------------------------
 fs/ceph/mds_client.c | 11 +++---
 fs/ceph/mds_client.h |  2 +-
 fs/ceph/super.h      |  2 +-
 fs/ceph/xattr.c      | 68 +++++++++++++++--------------------
 7 files changed, 96 insertions(+), 123 deletions(-)

diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c
index 1564eacc253d..d6da650db83e 100644
--- a/fs/ceph/acl.c
+++ b/fs/ceph/acl.c
@@ -171,7 +171,7 @@ int ceph_pre_init_acls(struct inode *dir, umode_t *mode,
 {
 	struct posix_acl *acl, *default_acl;
 	size_t val_size1 = 0, val_size2 = 0;
-	struct ceph_pagelist *pagelist = NULL;
+	struct ceph_databuf *dbuf = NULL;
 	void *tmp_buf = NULL;
 	int err;
 
@@ -201,58 +201,55 @@ int ceph_pre_init_acls(struct inode *dir, umode_t *mode,
 	tmp_buf = kmalloc(max(val_size1, val_size2), GFP_KERNEL);
 	if (!tmp_buf)
 		goto out_err;
-	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
-	if (!pagelist)
+	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_KERNEL);
+	if (!dbuf)
 		goto out_err;
 
-	err = ceph_pagelist_reserve(pagelist, PAGE_SIZE);
-	if (err)
-		goto out_err;
-
-	ceph_pagelist_encode_32(pagelist, acl && default_acl ? 2 : 1);
+	ceph_databuf_encode_32(dbuf, acl && default_acl ? 2 : 1);
 
 	if (acl) {
 		size_t len = strlen(XATTR_NAME_POSIX_ACL_ACCESS);
-		err = ceph_pagelist_reserve(pagelist, len + val_size1 + 8);
+		err = ceph_databuf_reserve(dbuf, len + val_size1 + 8,
+					   GFP_KERNEL);
 		if (err)
 			goto out_err;
-		ceph_pagelist_encode_string(pagelist, XATTR_NAME_POSIX_ACL_ACCESS,
-					    len);
+		ceph_databuf_encode_string(dbuf, XATTR_NAME_POSIX_ACL_ACCESS,
+					   len);
 		err = posix_acl_to_xattr(&init_user_ns, acl,
 					 tmp_buf, val_size1);
 		if (err < 0)
 			goto out_err;
-		ceph_pagelist_encode_32(pagelist, val_size1);
-		ceph_pagelist_append(pagelist, tmp_buf, val_size1);
+		ceph_databuf_encode_32(dbuf, val_size1);
+		ceph_databuf_append(dbuf, tmp_buf, val_size1);
 	}
 	if (default_acl) {
 		size_t len = strlen(XATTR_NAME_POSIX_ACL_DEFAULT);
-		err = ceph_pagelist_reserve(pagelist, len + val_size2 + 8);
+		err = ceph_databuf_reserve(dbuf, len + val_size2 + 8,
+					   GFP_KERNEL);
 		if (err)
 			goto out_err;
-		ceph_pagelist_encode_string(pagelist,
-					  XATTR_NAME_POSIX_ACL_DEFAULT, len);
+		ceph_databuf_encode_string(dbuf,
+					   XATTR_NAME_POSIX_ACL_DEFAULT, len);
 		err = posix_acl_to_xattr(&init_user_ns, default_acl,
 					 tmp_buf, val_size2);
 		if (err < 0)
 			goto out_err;
-		ceph_pagelist_encode_32(pagelist, val_size2);
-		ceph_pagelist_append(pagelist, tmp_buf, val_size2);
+		ceph_databuf_encode_32(dbuf, val_size2);
+		ceph_databuf_append(dbuf, tmp_buf, val_size2);
 	}
 
 	kfree(tmp_buf);
 
 	as_ctx->acl = acl;
 	as_ctx->default_acl = default_acl;
-	as_ctx->pagelist = pagelist;
+	as_ctx->dbuf = dbuf;
 	return 0;
 
 out_err:
 	posix_acl_release(acl);
 	posix_acl_release(default_acl);
 	kfree(tmp_buf);
-	if (pagelist)
-		ceph_pagelist_release(pagelist);
+	ceph_databuf_release(dbuf);
 	return err;
 }
 
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 851d70200c6b..9de2960748b9 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -679,9 +679,9 @@ static int ceph_finish_async_create(struct inode *dir, struct inode *inode,
 	iinfo.change_attr = 1;
 	ceph_encode_timespec64(&iinfo.btime, &now);
 
-	if (req->r_pagelist) {
-		iinfo.xattr_len = req->r_pagelist->length;
-		iinfo.xattr_data = req->r_pagelist->mapped_tail;
+	if (req->r_dbuf) {
+		iinfo.xattr_len = ceph_databuf_len(req->r_dbuf);
+		iinfo.xattr_data = kmap_ceph_databuf_page(req->r_dbuf, 0);
 	} else {
 		/* fake it */
 		iinfo.xattr_len = ARRAY_SIZE(xattr_buf);
@@ -731,6 +731,8 @@ static int ceph_finish_async_create(struct inode *dir, struct inode *inode,
 	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
 			      req->r_fmode, NULL);
 	up_read(&mdsc->snap_rwsem);
+	if (req->r_dbuf)
+		kunmap_local(iinfo.xattr_data);
 	if (ret) {
 		doutc(cl, "failed to fill inode: %d\n", ret);
 		ceph_dir_clear_complete(dir);
@@ -849,8 +851,8 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 			goto out_ctx;
 		}
 		/* Async create can't handle more than a page of xattrs */
-		if (as_ctx.pagelist &&
-		    !list_is_singular(&as_ctx.pagelist->head))
+		if (as_ctx.dbuf &&
+		    as_ctx.dbuf->nr_bvec > 1)
 			try_async = false;
 	} else if (!d_in_lookup(dentry)) {
 		/* If it's not being looked up, it's negative */
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index b060f765ad20..ec9b80fec7be 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -112,9 +112,9 @@ struct inode *ceph_new_inode(struct inode *dir, struct dentry *dentry,
 void ceph_as_ctx_to_req(struct ceph_mds_request *req,
 			struct ceph_acl_sec_ctx *as_ctx)
 {
-	if (as_ctx->pagelist) {
-		req->r_pagelist = as_ctx->pagelist;
-		as_ctx->pagelist = NULL;
+	if (as_ctx->dbuf) {
+		req->r_dbuf = as_ctx->dbuf;
+		as_ctx->dbuf = NULL;
 	}
 	ceph_fscrypt_as_ctx_to_req(req, as_ctx);
 }
@@ -2341,11 +2341,10 @@ static int fill_fscrypt_truncate(struct inode *inode,
 	loff_t pos, orig_pos = round_down(attr->ia_size,
 					  CEPH_FSCRYPT_BLOCK_SIZE);
 	u64 block = orig_pos >> CEPH_FSCRYPT_BLOCK_SHIFT;
-	struct ceph_pagelist *pagelist = NULL;
-	struct kvec iov = {0};
+	struct ceph_databuf *dbuf = NULL;
 	struct iov_iter iter;
-	struct page *page = NULL;
-	struct ceph_fscrypt_truncate_size_header header;
+	struct ceph_fscrypt_truncate_size_header *header;
+	void *p;
 	int retry_op = 0;
 	int len = CEPH_FSCRYPT_BLOCK_SIZE;
 	loff_t i_size = i_size_read(inode);
@@ -2372,37 +2371,35 @@ static int fill_fscrypt_truncate(struct inode *inode,
 			goto out;
 	}
 
-	page = __page_cache_alloc(GFP_KERNEL);
-	if (page == NULL) {
-		ret = -ENOMEM;
+	ret = -ENOMEM;
+	dbuf = ceph_databuf_req_alloc(2, 0, GFP_KERNEL);
+	if (!dbuf)
 		goto out;
-	}
 
-	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
-	if (!pagelist) {
-		ret = -ENOMEM;
+	if (ceph_databuf_insert_frag(dbuf, 0, sizeof(*header), GFP_KERNEL) < 0)
+		goto out;
+	if (ceph_databuf_insert_frag(dbuf, 1, PAGE_SIZE, GFP_KERNEL) < 0)
 		goto out;
-	}
 
-	iov.iov_base = kmap_local_page(page);
-	iov.iov_len = len;
-	iov_iter_kvec(&iter, READ, &iov, 1, len);
+	iov_iter_bvec(&iter, ITER_DEST, &dbuf->bvec[1], 1, len);
 
 	pos = orig_pos;
 	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op, &objver);
 	if (ret < 0)
 		goto out;
 
+	header = kmap_ceph_databuf_page(dbuf, 0);
+
 	/* Insert the header first */
-	header.ver = 1;
-	header.compat = 1;
-	header.change_attr = cpu_to_le64(inode_peek_iversion_raw(inode));
+	header->ver = 1;
+	header->compat = 1;
+	header->change_attr = cpu_to_le64(inode_peek_iversion_raw(inode));
 
 	/*
 	 * Always set the block_size to CEPH_FSCRYPT_BLOCK_SIZE,
 	 * because in MDS it may need this to do the truncate.
 	 */
-	header.block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE);
+	header->block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE);
 
 	/*
 	 * If we hit a hole here, we should just skip filling
@@ -2417,51 +2414,41 @@ static int fill_fscrypt_truncate(struct inode *inode,
 	if (!objver) {
 		doutc(cl, "hit hole, ppos %lld < size %lld\n", pos, i_size);
 
-		header.data_len = cpu_to_le32(8 + 8 + 4);
-		header.file_offset = 0;
+		header->data_len = cpu_to_le32(8 + 8 + 4);
+		header->file_offset = 0;
 		ret = 0;
 	} else {
-		header.data_len = cpu_to_le32(8 + 8 + 4 + CEPH_FSCRYPT_BLOCK_SIZE);
-		header.file_offset = cpu_to_le64(orig_pos);
+		header->data_len = cpu_to_le32(8 + 8 + 4 + CEPH_FSCRYPT_BLOCK_SIZE);
+		header->file_offset = cpu_to_le64(orig_pos);
 
 		doutc(cl, "encrypt block boff/bsize %d/%lu\n", boff,
 		      CEPH_FSCRYPT_BLOCK_SIZE);
 
 		/* truncate and zero out the extra contents for the last block */
-		memset(iov.iov_base + boff, 0, PAGE_SIZE - boff);
+		p = kmap_ceph_databuf_page(dbuf, 1);
+		memset(p + boff, 0, PAGE_SIZE - boff);
+		kunmap_local(p);
 
 		/* encrypt the last block */
-		ret = ceph_fscrypt_encrypt_block_inplace(inode, page,
-						    CEPH_FSCRYPT_BLOCK_SIZE,
-						    0, block,
-						    GFP_KERNEL);
+		ret = ceph_fscrypt_encrypt_block_inplace(
+			inode, ceph_databuf_page(dbuf, 1),
+			CEPH_FSCRYPT_BLOCK_SIZE, 0, block, GFP_KERNEL);
 		if (ret)
 			goto out;
 	}
 
-	/* Insert the header */
-	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
-	if (ret)
-		goto out;
+	ceph_databuf_added_data(dbuf, sizeof(*header));
+	if (header->block_size)
+		ceph_databuf_added_data(dbuf, CEPH_FSCRYPT_BLOCK_SIZE);
 
-	if (header.block_size) {
-		/* Append the last block contents to pagelist */
-		ret = ceph_pagelist_append(pagelist, iov.iov_base,
-					   CEPH_FSCRYPT_BLOCK_SIZE);
-		if (ret)
-			goto out;
-	}
-	req->r_pagelist = pagelist;
+	req->r_dbuf = dbuf;
 out:
 	doutc(cl, "%p %llx.%llx size dropping cap refs on %s\n", inode,
 	      ceph_vinop(inode), ceph_cap_string(got));
 	ceph_put_cap_refs(ci, got);
-	if (iov.iov_base)
-		kunmap_local(iov.iov_base);
-	if (page)
-		__free_pages(page, 0);
-	if (ret && pagelist)
-		ceph_pagelist_release(pagelist);
+	kunmap_local(header);
+	if (ret)
+		ceph_databuf_release(dbuf);
 	return ret;
 }
 
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 230e0c3f341f..09661a34f287 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1125,8 +1125,7 @@ void ceph_mdsc_release_request(struct kref *kref)
 	put_cred(req->r_cred);
 	if (req->r_mnt_idmap)
 		mnt_idmap_put(req->r_mnt_idmap);
-	if (req->r_pagelist)
-		ceph_pagelist_release(req->r_pagelist);
+	ceph_databuf_release(req->r_dbuf);
 	kfree(req->r_fscrypt_auth);
 	kfree(req->r_altname);
 	put_request_session(req);
@@ -3207,10 +3206,10 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
 	msg->front.iov_len = p - msg->front.iov_base;
 	msg->hdr.front_len = cpu_to_le32(msg->front.iov_len);
 
-	if (req->r_pagelist) {
-		struct ceph_pagelist *pagelist = req->r_pagelist;
-		ceph_msg_data_add_pagelist(msg, pagelist);
-		msg->hdr.data_len = cpu_to_le32(pagelist->length);
+	if (req->r_dbuf) {
+		struct ceph_databuf *dbuf = req->r_dbuf;
+		ceph_msg_data_add_databuf(msg, dbuf);
+		msg->hdr.data_len = cpu_to_le32(ceph_databuf_len(dbuf));
 	} else {
 		msg->hdr.data_len = 0;
 	}
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 3e2a6fa7c19a..a7ee8da07ce7 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -333,7 +333,7 @@ struct ceph_mds_request {
 	u32 r_direct_hash;      /* choose dir frag based on this dentry hash */
 
 	/* data payload is used for xattr ops */
-	struct ceph_pagelist *r_pagelist;
+	struct ceph_databuf *r_dbuf;
 
 	/* what caps shall we drop? */
 	int r_inode_drop, r_inode_unless;
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index bb0db0cc8003..984a6d2a5378 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1137,7 +1137,7 @@ struct ceph_acl_sec_ctx {
 #ifdef CONFIG_FS_ENCRYPTION
 	struct ceph_fscrypt_auth *fscrypt_auth;
 #endif
-	struct ceph_pagelist *pagelist;
+	struct ceph_databuf *dbuf;
 };
 
 #ifdef CONFIG_SECURITY
diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index 537165db4519..b083cd3b3974 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -1114,17 +1114,17 @@ static int ceph_sync_setxattr(struct inode *inode, const char *name,
 	struct ceph_mds_request *req;
 	struct ceph_mds_client *mdsc = fsc->mdsc;
 	struct ceph_osd_client *osdc = &fsc->client->osdc;
-	struct ceph_pagelist *pagelist = NULL;
+	struct ceph_databuf *dbuf = NULL;
 	int op = CEPH_MDS_OP_SETXATTR;
 	int err;
 
 	if (size > 0) {
-		/* copy value into pagelist */
-		pagelist = ceph_pagelist_alloc(GFP_NOFS);
-		if (!pagelist)
+		/* copy value into dbuf */
+		dbuf = ceph_databuf_req_alloc(1, size, GFP_NOFS);
+		if (!dbuf)
 			return -ENOMEM;
 
-		err = ceph_pagelist_append(pagelist, value, size);
+		err = ceph_databuf_append(dbuf, value, size);
 		if (err)
 			goto out;
 	} else if (!value) {
@@ -1154,8 +1154,8 @@ static int ceph_sync_setxattr(struct inode *inode, const char *name,
 		req->r_args.setxattr.flags = cpu_to_le32(flags);
 		req->r_args.setxattr.osdmap_epoch =
 			cpu_to_le32(osdc->osdmap->epoch);
-		req->r_pagelist = pagelist;
-		pagelist = NULL;
+		req->r_dbuf = dbuf;
+		dbuf = NULL;
 	}
 
 	req->r_inode = inode;
@@ -1169,8 +1169,7 @@ static int ceph_sync_setxattr(struct inode *inode, const char *name,
 	doutc(cl, "xattr.ver (after): %lld\n", ci->i_xattrs.version);
 
 out:
-	if (pagelist)
-		ceph_pagelist_release(pagelist);
+	ceph_databuf_release(dbuf);
 	return err;
 }
 
@@ -1377,7 +1376,7 @@ bool ceph_security_xattr_deadlock(struct inode *in)
 int ceph_security_init_secctx(struct dentry *dentry, umode_t mode,
 			   struct ceph_acl_sec_ctx *as_ctx)
 {
-	struct ceph_pagelist *pagelist = as_ctx->pagelist;
+	struct ceph_databuf *dbuf = as_ctx->dbuf;
 	const char *name;
 	size_t name_len;
 	int err;
@@ -1391,14 +1390,11 @@ int ceph_security_init_secctx(struct dentry *dentry, umode_t mode,
 	}
 
 	err = -ENOMEM;
-	if (!pagelist) {
-		pagelist = ceph_pagelist_alloc(GFP_KERNEL);
-		if (!pagelist)
+	if (!dbuf) {
+		dbuf = ceph_databuf_req_alloc(0, PAGE_SIZE, GFP_KERNEL);
+		if (!dbuf)
 			goto out;
-		err = ceph_pagelist_reserve(pagelist, PAGE_SIZE);
-		if (err)
-			goto out;
-		ceph_pagelist_encode_32(pagelist, 1);
+		ceph_databuf_encode_32(dbuf, 1);
 	}
 
 	/*
@@ -1407,38 +1403,31 @@ int ceph_security_init_secctx(struct dentry *dentry, umode_t mode,
 	 * dentry_init_security hook.
 	 */
 	name_len = strlen(name);
-	err = ceph_pagelist_reserve(pagelist,
-				    4 * 2 + name_len + as_ctx->lsmctx.len);
+	err = ceph_databuf_reserve(dbuf, 4 * 2 + name_len + as_ctx->lsmctx.len,
+				   GFP_KERNEL);
 	if (err)
 		goto out;
 
-	if (as_ctx->pagelist) {
+	if (as_ctx->dbuf) {
 		/* update count of KV pairs */
-		BUG_ON(pagelist->length <= sizeof(__le32));
-		if (list_is_singular(&pagelist->head)) {
-			le32_add_cpu((__le32*)pagelist->mapped_tail, 1);
-		} else {
-			struct page *page = list_first_entry(&pagelist->head,
-							     struct page, lru);
-			void *addr = kmap_atomic(page);
-			le32_add_cpu((__le32*)addr, 1);
-			kunmap_atomic(addr);
-		}
+		BUG_ON(ceph_databuf_len(dbuf) <= sizeof(__le32));
+		__le32 *addr = kmap_ceph_databuf_page(dbuf, 0);
+		le32_add_cpu(addr, 1);
+		kunmap_local(addr);
 	} else {
-		as_ctx->pagelist = pagelist;
+		as_ctx->dbuf = dbuf;
 	}
 
-	ceph_pagelist_encode_32(pagelist, name_len);
-	ceph_pagelist_append(pagelist, name, name_len);
+	ceph_databuf_encode_32(dbuf, name_len);
+	ceph_databuf_append(dbuf, name, name_len);
 
-	ceph_pagelist_encode_32(pagelist, as_ctx->lsmctx.len);
-	ceph_pagelist_append(pagelist, as_ctx->lsmctx.context,
-			     as_ctx->lsmctx.len);
+	ceph_databuf_encode_32(dbuf, as_ctx->lsmctx.len);
+	ceph_databuf_append(dbuf, as_ctx->lsmctx.context, as_ctx->lsmctx.len);
 
 	err = 0;
 out:
-	if (pagelist && !as_ctx->pagelist)
-		ceph_pagelist_release(pagelist);
+	if (dbuf && !as_ctx->dbuf)
+		ceph_databuf_release(dbuf);
 	return err;
 }
 #endif /* CONFIG_CEPH_FS_SECURITY_LABEL */
@@ -1456,8 +1445,7 @@ void ceph_release_acl_sec_ctx(struct ceph_acl_sec_ctx *as_ctx)
 #ifdef CONFIG_FS_ENCRYPTION
 	kfree(as_ctx->fscrypt_auth);
 #endif
-	if (as_ctx->pagelist)
-		ceph_pagelist_release(as_ctx->pagelist);
+	ceph_databuf_release(as_ctx->dbuf);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 05/35] libceph: Add functions to add ceph_databufs to requests
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (3 preceding siblings ...)
  2025-03-13 23:32 ` [RFC PATCH 04/35] ceph: Convert ceph_mds_request::r_pagelist to a databuf David Howells
@ 2025-03-13 23:32 ` David Howells
  2025-03-13 23:32 ` [RFC PATCH 06/35] rbd: Use ceph_databuf for rbd_obj_read_sync() David Howells
                   ` (29 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:32 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Add some helper functions to add ceph_databufs to ceph_osd_data structs
attached to ceph_osd_request structs.

The osd_data->iter is moved out of the union so that it can be set at the
same time as osd_data->dbuf.  Eventually, the I/O routines will only look
at ->iter; ->dbuf will be used as a pin that gets released at the end of
the I/O.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/ceph/osd_client.h | 11 +++++++-
 net/ceph/messenger.c            |  3 ++
 net/ceph/osd_client.c           | 50 +++++++++++++++++++++++++++++++++
 3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index b8fb5a71dd57..172ee515a0f3 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -116,6 +116,7 @@ enum ceph_osd_data_type {
 
 struct ceph_osd_data {
 	enum ceph_osd_data_type	type;
+	struct iov_iter		iter;
 	union {
 		struct ceph_databuf	*dbuf;
 		struct {
@@ -136,7 +137,6 @@ struct ceph_osd_data {
 			struct ceph_bvec_iter	bvec_pos;
 			u32			num_bvecs;
 		};
-		struct iov_iter		iter;
 	};
 };
 
@@ -488,6 +488,9 @@ extern struct ceph_osd_data *osd_req_op_extent_osd_data(
 					struct ceph_osd_request *osd_req,
 					unsigned int which);
 
+void osd_req_op_extent_osd_databuf(struct ceph_osd_request *req,
+				   unsigned int which,
+				   struct ceph_databuf *dbuf);
 extern void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *,
 					unsigned int which,
 					struct page **pages, u64 length,
@@ -512,6 +515,9 @@ void osd_req_op_extent_osd_data_bvec_pos(struct ceph_osd_request *osd_req,
 void osd_req_op_extent_osd_iter(struct ceph_osd_request *osd_req,
 				unsigned int which, struct iov_iter *iter);
 
+void osd_req_op_cls_request_databuf(struct ceph_osd_request *req,
+				    unsigned int which,
+				    struct ceph_databuf *dbuf);
 extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *,
 					unsigned int which,
 					struct ceph_pagelist *pagelist);
@@ -524,6 +530,9 @@ void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req,
 				       unsigned int which,
 				       struct bio_vec *bvecs, u32 num_bvecs,
 				       u32 bytes);
+void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req,
+				     unsigned int which,
+				     struct ceph_databuf *dbuf);
 extern void osd_req_op_cls_response_data_pages(struct ceph_osd_request *,
 					unsigned int which,
 					struct page **pages, u64 length,
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 802f0b222131..02439b38ec94 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1052,6 +1052,7 @@ static void __ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor)
 	case CEPH_MSG_DATA_BVECS:
 		ceph_msg_data_bvecs_cursor_init(cursor, length);
 		break;
+	case CEPH_MSG_DATA_DATABUF:
 	case CEPH_MSG_DATA_ITER:
 		ceph_msg_data_iter_cursor_init(cursor, length);
 		break;
@@ -1102,6 +1103,7 @@ struct page *ceph_msg_data_next(struct ceph_msg_data_cursor *cursor,
 	case CEPH_MSG_DATA_BVECS:
 		page = ceph_msg_data_bvecs_next(cursor, page_offset, length);
 		break;
+	case CEPH_MSG_DATA_DATABUF:
 	case CEPH_MSG_DATA_ITER:
 		page = ceph_msg_data_iter_next(cursor, page_offset, length);
 		break;
@@ -1143,6 +1145,7 @@ void ceph_msg_data_advance(struct ceph_msg_data_cursor *cursor, size_t bytes)
 	case CEPH_MSG_DATA_BVECS:
 		new_piece = ceph_msg_data_bvecs_advance(cursor, bytes);
 		break;
+	case CEPH_MSG_DATA_DATABUF:
 	case CEPH_MSG_DATA_ITER:
 		new_piece = ceph_msg_data_iter_advance(cursor, bytes);
 		break;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index c84634264377..720d8a605fc4 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -178,6 +178,17 @@ static void ceph_osd_iter_init(struct ceph_osd_data *osd_data,
 	osd_data->iter = *iter;
 }
 
+/*
+ * Consumes a ref on @dbuf.
+ */
+static void ceph_osd_databuf_init(struct ceph_osd_data *osd_data,
+				  struct ceph_databuf *dbuf)
+{
+	osd_data->type = CEPH_OSD_DATA_TYPE_DATABUF;
+	osd_data->dbuf = dbuf;
+	osd_data->iter = dbuf->iter;
+}
+
 static struct ceph_osd_data *
 osd_req_op_raw_data_in(struct ceph_osd_request *osd_req, unsigned int which)
 {
@@ -207,6 +218,17 @@ void osd_req_op_raw_data_in_pages(struct ceph_osd_request *osd_req,
 }
 EXPORT_SYMBOL(osd_req_op_raw_data_in_pages);
 
+void osd_req_op_extent_osd_databuf(struct ceph_osd_request *osd_req,
+				   unsigned int which,
+				   struct ceph_databuf *dbuf)
+{
+	struct ceph_osd_data *osd_data;
+
+	osd_data = osd_req_op_data(osd_req, which, extent, osd_data);
+	ceph_osd_databuf_init(osd_data, dbuf);
+}
+EXPORT_SYMBOL(osd_req_op_extent_osd_databuf);
+
 void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *osd_req,
 			unsigned int which, struct page **pages,
 			u64 length, u32 offset,
@@ -297,6 +319,21 @@ static void osd_req_op_cls_request_info_pagelist(
 	ceph_osd_data_pagelist_init(osd_data, pagelist);
 }
 
+void osd_req_op_cls_request_databuf(struct ceph_osd_request *osd_req,
+				    unsigned int which,
+				    struct ceph_databuf *dbuf)
+{
+	struct ceph_osd_data *osd_data;
+
+	BUG_ON(!ceph_databuf_len(dbuf));
+
+	osd_data = osd_req_op_data(osd_req, which, cls, request_data);
+	ceph_osd_databuf_init(osd_data, dbuf);
+	osd_req->r_ops[which].cls.indata_len += ceph_databuf_len(dbuf);
+	osd_req->r_ops[which].indata_len += ceph_databuf_len(dbuf);
+}
+EXPORT_SYMBOL(osd_req_op_cls_request_databuf);
+
 void osd_req_op_cls_request_data_pagelist(
 			struct ceph_osd_request *osd_req,
 			unsigned int which, struct ceph_pagelist *pagelist)
@@ -342,6 +379,19 @@ void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req,
 }
 EXPORT_SYMBOL(osd_req_op_cls_request_data_bvecs);
 
+void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req,
+				     unsigned int which,
+				     struct ceph_databuf *dbuf)
+{
+	struct ceph_osd_data *osd_data;
+
+	BUG_ON(!ceph_databuf_len(dbuf));
+
+	osd_data = osd_req_op_data(osd_req, which, cls, response_data);
+	ceph_osd_databuf_init(osd_data, ceph_databuf_get(dbuf));
+}
+EXPORT_SYMBOL(osd_req_op_cls_response_databuf);
+
 void osd_req_op_cls_response_data_pages(struct ceph_osd_request *osd_req,
 			unsigned int which, struct page **pages, u64 length,
 			u32 offset, bool pages_from_pool, bool own_pages)


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 06/35] rbd: Use ceph_databuf for rbd_obj_read_sync()
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (4 preceding siblings ...)
  2025-03-13 23:32 ` [RFC PATCH 05/35] libceph: Add functions to add ceph_databufs to requests David Howells
@ 2025-03-13 23:32 ` David Howells
  2025-03-17 19:08   ` Viacheslav Dubeyko
  2025-04-11 13:48   ` David Howells
  2025-03-13 23:32 ` [RFC PATCH 07/35] libceph: Change ceph_osdc_call()'s reply to a ceph_databuf David Howells
                   ` (28 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:32 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Make rbd_obj_read_sync() allocate and use a ceph_databuf object to convey
the data into the operation.  This has some space preallocated and this is
allocated by alloc_pages() and accessed with kmap_local rather than being
kmalloc'd.  This allows MSG_SPLICE_PAGES to be used.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 drivers/block/rbd.c | 45 ++++++++++++++++++++-------------------------
 1 file changed, 20 insertions(+), 25 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index faafd7ff43d6..bb953634c7cb 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4822,13 +4822,10 @@ static void rbd_free_disk(struct rbd_device *rbd_dev)
 static int rbd_obj_read_sync(struct rbd_device *rbd_dev,
 			     struct ceph_object_id *oid,
 			     struct ceph_object_locator *oloc,
-			     void *buf, int buf_len)
-
+			     struct ceph_databuf *dbuf, int len)
 {
 	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
 	struct ceph_osd_request *req;
-	struct page **pages;
-	int num_pages = calc_pages_for(0, buf_len);
 	int ret;
 
 	req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_KERNEL);
@@ -4839,15 +4836,8 @@ static int rbd_obj_read_sync(struct rbd_device *rbd_dev,
 	ceph_oloc_copy(&req->r_base_oloc, oloc);
 	req->r_flags = CEPH_OSD_FLAG_READ;
 
-	pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL);
-	if (IS_ERR(pages)) {
-		ret = PTR_ERR(pages);
-		goto out_req;
-	}
-
-	osd_req_op_extent_init(req, 0, CEPH_OSD_OP_READ, 0, buf_len, 0, 0);
-	osd_req_op_extent_osd_data_pages(req, 0, pages, buf_len, 0, false,
-					 true);
+	osd_req_op_extent_init(req, 0, CEPH_OSD_OP_READ, 0, len, 0, 0);
+	osd_req_op_extent_osd_databuf(req, 0, dbuf);
 
 	ret = ceph_osdc_alloc_messages(req, GFP_KERNEL);
 	if (ret)
@@ -4855,9 +4845,6 @@ static int rbd_obj_read_sync(struct rbd_device *rbd_dev,
 
 	ceph_osdc_start_request(osdc, req);
 	ret = ceph_osdc_wait_request(osdc, req);
-	if (ret >= 0)
-		ceph_copy_from_page_vector(pages, buf, 0, ret);
-
 out_req:
 	ceph_osdc_put_request(req);
 	return ret;
@@ -4872,12 +4859,18 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
 				  struct rbd_image_header *header,
 				  bool first_time)
 {
-	struct rbd_image_header_ondisk *ondisk = NULL;
+	struct rbd_image_header_ondisk *ondisk;
+	struct ceph_databuf *dbuf = NULL;
 	u32 snap_count = 0;
 	u64 names_size = 0;
 	u32 want_count;
 	int ret;
 
+	dbuf = ceph_databuf_req_alloc(1, sizeof(*ondisk), GFP_KERNEL);
+	if (!dbuf)
+		return -ENOMEM;
+	ondisk = kmap_ceph_databuf_page(dbuf, 0);
+
 	/*
 	 * The complete header will include an array of its 64-bit
 	 * snapshot ids, followed by the names of those snapshots as
@@ -4888,17 +4881,18 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
 	do {
 		size_t size;
 
-		kfree(ondisk);
-
 		size = sizeof (*ondisk);
 		size += snap_count * sizeof (struct rbd_image_snap_ondisk);
 		size += names_size;
-		ondisk = kmalloc(size, GFP_KERNEL);
-		if (!ondisk)
-			return -ENOMEM;
+
+		ret = -ENOMEM;
+		if (size > dbuf->limit &&
+		    ceph_databuf_reserve(dbuf, size - dbuf->limit,
+					 GFP_KERNEL) < 0)
+			goto out;
 
 		ret = rbd_obj_read_sync(rbd_dev, &rbd_dev->header_oid,
-					&rbd_dev->header_oloc, ondisk, size);
+					&rbd_dev->header_oloc, dbuf, size);
 		if (ret < 0)
 			goto out;
 		if ((size_t)ret < size) {
@@ -4907,6 +4901,7 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
 				size, ret);
 			goto out;
 		}
+
 		if (!rbd_dev_ondisk_valid(ondisk)) {
 			ret = -ENXIO;
 			rbd_warn(rbd_dev, "invalid header");
@@ -4920,8 +4915,8 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
 
 	ret = rbd_header_from_disk(header, ondisk, first_time);
 out:
-	kfree(ondisk);
-
+	kunmap_local(ondisk);
+	ceph_databuf_release(dbuf);
 	return ret;
 }
 


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 07/35] libceph: Change ceph_osdc_call()'s reply to a ceph_databuf
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (5 preceding siblings ...)
  2025-03-13 23:32 ` [RFC PATCH 06/35] rbd: Use ceph_databuf for rbd_obj_read_sync() David Howells
@ 2025-03-13 23:32 ` David Howells
  2025-03-17 19:41   ` Viacheslav Dubeyko
  2025-03-17 22:12   ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 08/35] libceph: Unexport osd_req_op_cls_request_data_pages() David Howells
                   ` (27 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:32 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Change the type of ceph_osdc_call()'s reply to a ceph_databuf struct rather
than a list of pages and access it with kmap_local rather than
page_address().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 drivers/block/rbd.c             | 135 ++++++++++++++++++--------------
 include/linux/ceph/osd_client.h |   2 +-
 net/ceph/cls_lock_client.c      |  41 +++++-----
 net/ceph/osd_client.c           |  16 ++--
 4 files changed, 109 insertions(+), 85 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index bb953634c7cb..073e80d2d966 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1826,9 +1826,8 @@ static int __rbd_object_map_load(struct rbd_device *rbd_dev)
 {
 	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
 	CEPH_DEFINE_OID_ONSTACK(oid);
-	struct page **pages;
-	void *p, *end;
-	size_t reply_len;
+	struct ceph_databuf *reply;
+	void *p, *q, *end;
 	u64 num_objects;
 	u64 object_map_bytes;
 	u64 object_map_size;
@@ -1842,48 +1841,57 @@ static int __rbd_object_map_load(struct rbd_device *rbd_dev)
 	object_map_bytes = DIV_ROUND_UP_ULL(num_objects * BITS_PER_OBJ,
 					    BITS_PER_BYTE);
 	num_pages = calc_pages_for(0, object_map_bytes) + 1;
-	pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL);
-	if (IS_ERR(pages))
-		return PTR_ERR(pages);
 
-	reply_len = num_pages * PAGE_SIZE;
+	reply = ceph_databuf_reply_alloc(num_pages, num_pages * PAGE_SIZE,
+					 GFP_KERNEL);
+	if (!reply)
+		return -ENOMEM;
+
 	rbd_object_map_name(rbd_dev, rbd_dev->spec->snap_id, &oid);
 	ret = ceph_osdc_call(osdc, &oid, &rbd_dev->header_oloc,
 			     "rbd", "object_map_load", CEPH_OSD_FLAG_READ,
-			     NULL, 0, pages, &reply_len);
+			     NULL, 0, reply);
 	if (ret)
 		goto out;
 
-	p = page_address(pages[0]);
-	end = p + min(reply_len, (size_t)PAGE_SIZE);
-	ret = decode_object_map_header(&p, end, &object_map_size);
+	p = kmap_ceph_databuf_page(reply, 0);
+	end = p + min(ceph_databuf_len(reply), (size_t)PAGE_SIZE);
+	q = p;
+	ret = decode_object_map_header(&q, end, &object_map_size);
 	if (ret)
-		goto out;
+		goto out_unmap;
 
 	if (object_map_size != num_objects) {
 		rbd_warn(rbd_dev, "object map size mismatch: %llu vs %llu",
 			 object_map_size, num_objects);
 		ret = -EINVAL;
-		goto out;
+		goto out_unmap;
 	}
+	iov_iter_advance(&reply->iter, q - p);
 
-	if (offset_in_page(p) + object_map_bytes > reply_len) {
+	if (object_map_bytes > ceph_databuf_len(reply)) {
 		ret = -EINVAL;
-		goto out;
+		goto out_unmap;
 	}
 
 	rbd_dev->object_map = kvmalloc(object_map_bytes, GFP_KERNEL);
 	if (!rbd_dev->object_map) {
 		ret = -ENOMEM;
-		goto out;
+		goto out_unmap;
 	}
 
 	rbd_dev->object_map_size = object_map_size;
-	ceph_copy_from_page_vector(pages, rbd_dev->object_map,
-				   offset_in_page(p), object_map_bytes);
 
+	ret = -EIO;
+	if (copy_from_iter(rbd_dev->object_map, object_map_bytes,
+			   &reply->iter) != object_map_bytes)
+		goto out_unmap;
+
+	ret = 0;
+out_unmap:
+	kunmap_local(p);
 out:
-	ceph_release_page_vector(pages, num_pages);
+	ceph_databuf_release(reply);
 	return ret;
 }
 
@@ -1952,6 +1960,7 @@ static int rbd_object_map_update_finish(struct rbd_obj_request *obj_req,
 {
 	struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev;
 	struct ceph_osd_data *osd_data;
+	struct ceph_databuf *dbuf;
 	u64 objno;
 	u8 state, new_state, current_state;
 	bool has_current_state;
@@ -1971,9 +1980,10 @@ static int rbd_object_map_update_finish(struct rbd_obj_request *obj_req,
 	 */
 	rbd_assert(osd_req->r_num_ops == 2);
 	osd_data = osd_req_op_data(osd_req, 1, cls, request_data);
-	rbd_assert(osd_data->type == CEPH_OSD_DATA_TYPE_PAGES);
+	rbd_assert(osd_data->type == CEPH_OSD_DATA_TYPE_DATABUF);
+	dbuf = osd_data->dbuf;
 
-	p = page_address(osd_data->pages[0]);
+	p = kmap_ceph_databuf_page(dbuf, 0);
 	objno = ceph_decode_64(&p);
 	rbd_assert(objno == obj_req->ex.oe_objno);
 	rbd_assert(ceph_decode_64(&p) == objno + 1);
@@ -1981,6 +1991,7 @@ static int rbd_object_map_update_finish(struct rbd_obj_request *obj_req,
 	has_current_state = ceph_decode_8(&p);
 	if (has_current_state)
 		current_state = ceph_decode_8(&p);
+	kunmap_local(p);
 
 	spin_lock(&rbd_dev->object_map_lock);
 	state = __rbd_object_map_get(rbd_dev, objno);
@@ -2020,7 +2031,7 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req,
 				     int which, u64 objno, u8 new_state,
 				     const u8 *current_state)
 {
-	struct page **pages;
+	struct ceph_databuf *dbuf;
 	void *p, *start;
 	int ret;
 
@@ -2028,11 +2039,11 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req,
 	if (ret)
 		return ret;
 
-	pages = ceph_alloc_page_vector(1, GFP_NOIO);
-	if (IS_ERR(pages))
-		return PTR_ERR(pages);
+	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
+	if (!dbuf)
+		return -ENOMEM;
 
-	p = start = page_address(pages[0]);
+	p = start = kmap_ceph_databuf_page(dbuf, 0);
 	ceph_encode_64(&p, objno);
 	ceph_encode_64(&p, objno + 1);
 	ceph_encode_8(&p, new_state);
@@ -2042,9 +2053,10 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req,
 	} else {
 		ceph_encode_8(&p, 0);
 	}
+	kunmap_local(p);
+	ceph_databuf_added_data(dbuf, p - start);
 
-	osd_req_op_cls_request_data_pages(req, which, pages, p - start, 0,
-					  false, true);
+	osd_req_op_cls_request_databuf(req, which, dbuf);
 	return 0;
 }
 
@@ -4673,8 +4685,8 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
 			     size_t inbound_size)
 {
 	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
+	struct ceph_databuf *reply;
 	struct page *req_page = NULL;
-	struct page *reply_page;
 	int ret;
 
 	/*
@@ -4695,8 +4707,8 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
 		memcpy(page_address(req_page), outbound, outbound_size);
 	}
 
-	reply_page = alloc_page(GFP_KERNEL);
-	if (!reply_page) {
+	reply = ceph_databuf_reply_alloc(1, inbound_size, GFP_KERNEL);
+	if (!reply) {
 		if (req_page)
 			__free_page(req_page);
 		return -ENOMEM;
@@ -4704,15 +4716,16 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
 
 	ret = ceph_osdc_call(osdc, oid, oloc, RBD_DRV_NAME, method_name,
 			     CEPH_OSD_FLAG_READ, req_page, outbound_size,
-			     &reply_page, &inbound_size);
+			     reply);
 	if (!ret) {
-		memcpy(inbound, page_address(reply_page), inbound_size);
-		ret = inbound_size;
+		ret = ceph_databuf_len(reply);
+		if (copy_from_iter(inbound, ret, &reply->iter) != ret)
+			ret = -EIO;
 	}
 
 	if (req_page)
 		__free_page(req_page);
-	__free_page(reply_page);
+	ceph_databuf_release(reply);
 	return ret;
 }
 
@@ -5633,7 +5646,7 @@ static int decode_parent_image_spec(void **p, void *end,
 
 static int __get_parent_info(struct rbd_device *rbd_dev,
 			     struct page *req_page,
-			     struct page *reply_page,
+			     struct ceph_databuf *reply,
 			     struct parent_image_info *pii)
 {
 	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
@@ -5643,27 +5656,31 @@ static int __get_parent_info(struct rbd_device *rbd_dev,
 
 	ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc,
 			     "rbd", "parent_get", CEPH_OSD_FLAG_READ,
-			     req_page, sizeof(u64), &reply_page, &reply_len);
+			     req_page, sizeof(u64), reply);
 	if (ret)
 		return ret == -EOPNOTSUPP ? 1 : ret;
 
-	p = page_address(reply_page);
+	p = kmap_ceph_databuf_page(reply, 0);
 	end = p + reply_len;
 	ret = decode_parent_image_spec(&p, end, pii);
+	kunmap_local(p);
 	if (ret)
 		return ret;
 
+	ceph_databuf_reset_reply(reply);
+
 	ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc,
 			     "rbd", "parent_overlap_get", CEPH_OSD_FLAG_READ,
-			     req_page, sizeof(u64), &reply_page, &reply_len);
+			     req_page, sizeof(u64), reply);
 	if (ret)
 		return ret;
 
-	p = page_address(reply_page);
+	p = kmap_ceph_databuf_page(reply, 0);
 	end = p + reply_len;
 	ceph_decode_8_safe(&p, end, pii->has_overlap, e_inval);
 	if (pii->has_overlap)
 		ceph_decode_64_safe(&p, end, pii->overlap, e_inval);
+	kunmap_local(p);
 
 	dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
 	     __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id,
@@ -5679,25 +5696,25 @@ static int __get_parent_info(struct rbd_device *rbd_dev,
  */
 static int __get_parent_info_legacy(struct rbd_device *rbd_dev,
 				    struct page *req_page,
-				    struct page *reply_page,
+				    struct ceph_databuf *reply,
 				    struct parent_image_info *pii)
 {
 	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
-	size_t reply_len = PAGE_SIZE;
 	void *p, *end;
 	int ret;
 
 	ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc,
 			     "rbd", "get_parent", CEPH_OSD_FLAG_READ,
-			     req_page, sizeof(u64), &reply_page, &reply_len);
+			     req_page, sizeof(u64), reply);
 	if (ret)
 		return ret;
 
-	p = page_address(reply_page);
-	end = p + reply_len;
+	p = kmap_ceph_databuf_page(reply, 0);
+	end = p + ceph_databuf_len(reply);
 	ceph_decode_64_safe(&p, end, pii->pool_id, e_inval);
 	pii->image_id = ceph_extract_encoded_string(&p, end, NULL, GFP_KERNEL);
 	if (IS_ERR(pii->image_id)) {
+		kunmap_local(p);
 		ret = PTR_ERR(pii->image_id);
 		pii->image_id = NULL;
 		return ret;
@@ -5705,6 +5722,7 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev,
 	ceph_decode_64_safe(&p, end, pii->snap_id, e_inval);
 	pii->has_overlap = true;
 	ceph_decode_64_safe(&p, end, pii->overlap, e_inval);
+	kunmap_local(p);
 
 	dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
 	     __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id,
@@ -5718,29 +5736,30 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev,
 static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev,
 				  struct parent_image_info *pii)
 {
-	struct page *req_page, *reply_page;
+	struct ceph_databuf *reply;
+	struct page *req_page;
 	void *p;
-	int ret;
+	int ret = -ENOMEM;
 
 	req_page = alloc_page(GFP_KERNEL);
 	if (!req_page)
-		return -ENOMEM;
+		goto out;
 
-	reply_page = alloc_page(GFP_KERNEL);
-	if (!reply_page) {
-		__free_page(req_page);
-		return -ENOMEM;
-	}
+	reply = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_KERNEL);
+	if (!reply)
+		goto out_free;
 
-	p = page_address(req_page);
+	p = kmap_local_page(req_page);
 	ceph_encode_64(&p, rbd_dev->spec->snap_id);
-	ret = __get_parent_info(rbd_dev, req_page, reply_page, pii);
+	kunmap_local(p);
+	ret = __get_parent_info(rbd_dev, req_page, reply, pii);
 	if (ret > 0)
-		ret = __get_parent_info_legacy(rbd_dev, req_page, reply_page,
-					       pii);
+		ret = __get_parent_info_legacy(rbd_dev, req_page, reply, pii);
 
+	ceph_databuf_release(reply);
+out_free:
 	__free_page(req_page);
-	__free_page(reply_page);
+out:
 	return ret;
 }
 
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 172ee515a0f3..57b8aff53f28 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -610,7 +610,7 @@ int ceph_osdc_call(struct ceph_osd_client *osdc,
 		   const char *class, const char *method,
 		   unsigned int flags,
 		   struct page *req_page, size_t req_len,
-		   struct page **resp_pages, size_t *resp_len);
+		   struct ceph_databuf *response);
 
 /* watch/notify */
 struct ceph_osd_linger_request *
diff --git a/net/ceph/cls_lock_client.c b/net/ceph/cls_lock_client.c
index 66136a4c1ce7..37bb8708e8bb 100644
--- a/net/ceph/cls_lock_client.c
+++ b/net/ceph/cls_lock_client.c
@@ -74,7 +74,7 @@ int ceph_cls_lock(struct ceph_osd_client *osdc,
 	     __func__, lock_name, type, cookie, tag, desc, flags);
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "lock",
 			     CEPH_OSD_FLAG_WRITE, lock_op_page,
-			     lock_op_buf_size, NULL, NULL);
+			     lock_op_buf_size, NULL);
 
 	dout("%s: status %d\n", __func__, ret);
 	__free_page(lock_op_page);
@@ -124,7 +124,7 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc,
 	dout("%s lock_name %s cookie %s\n", __func__, lock_name, cookie);
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "unlock",
 			     CEPH_OSD_FLAG_WRITE, unlock_op_page,
-			     unlock_op_buf_size, NULL, NULL);
+			     unlock_op_buf_size, NULL);
 
 	dout("%s: status %d\n", __func__, ret);
 	__free_page(unlock_op_page);
@@ -179,7 +179,7 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc,
 	     cookie, ENTITY_NAME(*locker));
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "break_lock",
 			     CEPH_OSD_FLAG_WRITE, break_op_page,
-			     break_op_buf_size, NULL, NULL);
+			     break_op_buf_size, NULL);
 
 	dout("%s: status %d\n", __func__, ret);
 	__free_page(break_op_page);
@@ -230,7 +230,7 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc,
 	     __func__, lock_name, type, old_cookie, tag, new_cookie);
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "set_cookie",
 			     CEPH_OSD_FLAG_WRITE, cookie_op_page,
-			     cookie_op_buf_size, NULL, NULL);
+			     cookie_op_buf_size, NULL);
 
 	dout("%s: status %d\n", __func__, ret);
 	__free_page(cookie_op_page);
@@ -337,10 +337,10 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc,
 		       char *lock_name, u8 *type, char **tag,
 		       struct ceph_locker **lockers, u32 *num_lockers)
 {
+	struct ceph_databuf *reply;
 	int get_info_op_buf_size;
 	int name_len = strlen(lock_name);
-	struct page *get_info_op_page, *reply_page;
-	size_t reply_len = PAGE_SIZE;
+	struct page *get_info_op_page;
 	void *p, *end;
 	int ret;
 
@@ -353,8 +353,8 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc,
 	if (!get_info_op_page)
 		return -ENOMEM;
 
-	reply_page = alloc_page(GFP_NOIO);
-	if (!reply_page) {
+	reply = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO);
+	if (!reply) {
 		__free_page(get_info_op_page);
 		return -ENOMEM;
 	}
@@ -370,18 +370,19 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc,
 	dout("%s lock_name %s\n", __func__, lock_name);
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "get_info",
 			     CEPH_OSD_FLAG_READ, get_info_op_page,
-			     get_info_op_buf_size, &reply_page, &reply_len);
+			     get_info_op_buf_size, reply);
 
 	dout("%s: status %d\n", __func__, ret);
 	if (ret >= 0) {
-		p = page_address(reply_page);
-		end = p + reply_len;
+		p = kmap_ceph_databuf_page(reply, 0);
+		end = p + ceph_databuf_len(reply);
 
 		ret = decode_lockers(&p, end, type, tag, lockers, num_lockers);
+		kunmap_local(p);
 	}
 
 	__free_page(get_info_op_page);
-	__free_page(reply_page);
+	ceph_databuf_release(reply);
 	return ret;
 }
 EXPORT_SYMBOL(ceph_cls_lock_info);
@@ -389,11 +390,11 @@ EXPORT_SYMBOL(ceph_cls_lock_info);
 int ceph_cls_assert_locked(struct ceph_osd_request *req, int which,
 			   char *lock_name, u8 type, char *cookie, char *tag)
 {
+	struct ceph_databuf *dbuf;
 	int assert_op_buf_size;
 	int name_len = strlen(lock_name);
 	int cookie_len = strlen(cookie);
 	int tag_len = strlen(tag);
-	struct page **pages;
 	void *p, *end;
 	int ret;
 
@@ -408,11 +409,11 @@ int ceph_cls_assert_locked(struct ceph_osd_request *req, int which,
 	if (ret)
 		return ret;
 
-	pages = ceph_alloc_page_vector(1, GFP_NOIO);
-	if (IS_ERR(pages))
-		return PTR_ERR(pages);
+	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
+	if (!dbuf)
+		return -ENOMEM;
 
-	p = page_address(pages[0]);
+	p = kmap_ceph_databuf_page(dbuf, 0);
 	end = p + assert_op_buf_size;
 
 	/* encode cls_lock_assert_op struct */
@@ -422,10 +423,12 @@ int ceph_cls_assert_locked(struct ceph_osd_request *req, int which,
 	ceph_encode_8(&p, type);
 	ceph_encode_string(&p, end, cookie, cookie_len);
 	ceph_encode_string(&p, end, tag, tag_len);
+	kunmap(p);
 	WARN_ON(p != end);
+	ceph_databuf_added_data(dbuf, assert_op_buf_size);
 
-	osd_req_op_cls_request_data_pages(req, which, pages, assert_op_buf_size,
-					  0, false, true);
+	osd_req_op_cls_request_databuf(req, which, dbuf);
 	return 0;
 }
 EXPORT_SYMBOL(ceph_cls_assert_locked);
+
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 720d8a605fc4..b6cf875d3de4 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -5195,7 +5195,10 @@ EXPORT_SYMBOL(ceph_osdc_maybe_request_map);
  * Execute an OSD class method on an object.
  *
  * @flags: CEPH_OSD_FLAG_*
- * @resp_len: in/out param for reply length
+ * @response: Pointer to the storage descriptor for the reply or NULL.
+ *
+ * The size of the response buffer is set by the caller in @response->limit and
+ * the size of the response obtained is set in @response->iter.
  */
 int ceph_osdc_call(struct ceph_osd_client *osdc,
 		   struct ceph_object_id *oid,
@@ -5203,7 +5206,7 @@ int ceph_osdc_call(struct ceph_osd_client *osdc,
 		   const char *class, const char *method,
 		   unsigned int flags,
 		   struct page *req_page, size_t req_len,
-		   struct page **resp_pages, size_t *resp_len)
+		   struct ceph_databuf *response)
 {
 	struct ceph_osd_request *req;
 	int ret;
@@ -5226,9 +5229,8 @@ int ceph_osdc_call(struct ceph_osd_client *osdc,
 	if (req_page)
 		osd_req_op_cls_request_data_pages(req, 0, &req_page, req_len,
 						  0, false, false);
-	if (resp_pages)
-		osd_req_op_cls_response_data_pages(req, 0, resp_pages,
-						   *resp_len, 0, false, false);
+	if (response)
+		osd_req_op_cls_response_databuf(req, 0, response);
 
 	ret = ceph_osdc_alloc_messages(req, GFP_NOIO);
 	if (ret)
@@ -5238,8 +5240,8 @@ int ceph_osdc_call(struct ceph_osd_client *osdc,
 	ret = ceph_osdc_wait_request(osdc, req);
 	if (ret >= 0) {
 		ret = req->r_ops[0].rval;
-		if (resp_pages)
-			*resp_len = req->r_ops[0].outdata_len;
+		if (response)
+			ceph_databuf_reply_ready(response, req->r_ops[0].outdata_len);
 	}
 
 out_put_req:


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 08/35] libceph: Unexport osd_req_op_cls_request_data_pages()
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (6 preceding siblings ...)
  2025-03-13 23:32 ` [RFC PATCH 07/35] libceph: Change ceph_osdc_call()'s reply to a ceph_databuf David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 09/35] libceph: Remove osd_req_op_cls_response_data_pages() David Howells
                   ` (26 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Unexport osd_req_op_cls_request_data_pages() as it's not used outside of
the file in which it is defined and it will be replaced.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/ceph/osd_client.h | 5 -----
 net/ceph/osd_client.c           | 3 +--
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 57b8aff53f28..60f28fc0238b 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -521,11 +521,6 @@ void osd_req_op_cls_request_databuf(struct ceph_osd_request *req,
 extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *,
 					unsigned int which,
 					struct ceph_pagelist *pagelist);
-extern void osd_req_op_cls_request_data_pages(struct ceph_osd_request *,
-					unsigned int which,
-					struct page **pages, u64 length,
-					u32 offset, bool pages_from_pool,
-					bool own_pages);
 void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req,
 				       unsigned int which,
 				       struct bio_vec *bvecs, u32 num_bvecs,
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index b6cf875d3de4..10827b1227e4 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -347,7 +347,7 @@ void osd_req_op_cls_request_data_pagelist(
 }
 EXPORT_SYMBOL(osd_req_op_cls_request_data_pagelist);
 
-void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req,
+static void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req,
 			unsigned int which, struct page **pages, u64 length,
 			u32 offset, bool pages_from_pool, bool own_pages)
 {
@@ -359,7 +359,6 @@ void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req,
 	osd_req->r_ops[which].cls.indata_len += length;
 	osd_req->r_ops[which].indata_len += length;
 }
-EXPORT_SYMBOL(osd_req_op_cls_request_data_pages);
 
 void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req,
 				       unsigned int which,


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 09/35] libceph: Remove osd_req_op_cls_response_data_pages()
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (7 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 08/35] libceph: Unexport osd_req_op_cls_request_data_pages() David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 10/35] libceph: Convert notify_id_pages to a ceph_databuf David Howells
                   ` (25 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Remove osd_req_op_cls_response_data_pages() as it's no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/ceph/osd_client.h |  5 -----
 net/ceph/osd_client.c           | 12 ------------
 2 files changed, 17 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 60f28fc0238b..fe51c6ed23af 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -528,11 +528,6 @@ void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req,
 void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req,
 				     unsigned int which,
 				     struct ceph_databuf *dbuf);
-extern void osd_req_op_cls_response_data_pages(struct ceph_osd_request *,
-					unsigned int which,
-					struct page **pages, u64 length,
-					u32 offset, bool pages_from_pool,
-					bool own_pages);
 int osd_req_op_cls_init(struct ceph_osd_request *osd_req, unsigned int which,
 			const char *class, const char *method);
 extern int osd_req_op_xattr_init(struct ceph_osd_request *osd_req, unsigned int which,
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 10827b1227e4..e1dbde4bf2b9 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -391,18 +391,6 @@ void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req,
 }
 EXPORT_SYMBOL(osd_req_op_cls_response_databuf);
 
-void osd_req_op_cls_response_data_pages(struct ceph_osd_request *osd_req,
-			unsigned int which, struct page **pages, u64 length,
-			u32 offset, bool pages_from_pool, bool own_pages)
-{
-	struct ceph_osd_data *osd_data;
-
-	osd_data = osd_req_op_data(osd_req, which, cls, response_data);
-	ceph_osd_data_pages_init(osd_data, pages, length, offset,
-				pages_from_pool, own_pages);
-}
-EXPORT_SYMBOL(osd_req_op_cls_response_data_pages);
-
 static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data)
 {
 	switch (osd_data->type) {


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 10/35] libceph: Convert notify_id_pages to a ceph_databuf
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (8 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 09/35] libceph: Remove osd_req_op_cls_response_data_pages() David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 11/35] ceph: Use ceph_databuf in DIO David Howells
                   ` (24 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Convert linger->notify_id_pages to a ceph_databuf

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/ceph/osd_client.h |  2 +-
 net/ceph/osd_client.c           | 24 +++++++++++++-----------
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index fe51c6ed23af..5ac4c0b4dfcd 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -349,7 +349,7 @@ struct ceph_osd_linger_request {
 	void *data;
 
 	struct ceph_pagelist *request_pl;
-	struct page **notify_id_pages;
+	struct ceph_databuf *notify_id_buf;
 
 	struct page ***preply_pages;
 	size_t *preply_len;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index e1dbde4bf2b9..fc5c136e793d 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2841,9 +2841,7 @@ static void linger_release(struct kref *kref)
 
 	if (lreq->request_pl)
 		ceph_pagelist_release(lreq->request_pl);
-	if (lreq->notify_id_pages)
-		ceph_release_page_vector(lreq->notify_id_pages, 1);
-
+	ceph_databuf_release(lreq->notify_id_buf);
 	ceph_osdc_put_request(lreq->reg_req);
 	ceph_osdc_put_request(lreq->ping_req);
 	target_destroy(&lreq->t);
@@ -3128,10 +3126,13 @@ static void linger_commit_cb(struct ceph_osd_request *req)
 	if (!lreq->is_watch) {
 		struct ceph_osd_data *osd_data =
 		    osd_req_op_data(req, 0, notify, response_data);
-		void *p = page_address(osd_data->pages[0]);
+		struct ceph_databuf *notify_id_buf = lreq->notify_id_buf;
+		void *p;
 
 		WARN_ON(req->r_ops[0].op != CEPH_OSD_OP_NOTIFY ||
-			osd_data->type != CEPH_OSD_DATA_TYPE_PAGES);
+			osd_data->type != CEPH_OSD_DATA_TYPE_DATABUF);
+
+		p = kmap_ceph_databuf_page(notify_id_buf, 0);
 
 		/* make note of the notify_id */
 		if (req->r_ops[0].outdata_len >= sizeof(u64)) {
@@ -3141,6 +3142,8 @@ static void linger_commit_cb(struct ceph_osd_request *req)
 		} else {
 			dout("lreq %p no notify_id\n", lreq);
 		}
+
+		kunmap_local(p);
 	}
 
 out:
@@ -3224,9 +3227,9 @@ static void send_linger(struct ceph_osd_linger_request *lreq)
 			refcount_inc(&lreq->request_pl->refcnt);
 			osd_req_op_notify_init(req, 0, lreq->linger_id,
 					       lreq->request_pl);
-			ceph_osd_data_pages_init(
+			ceph_osd_databuf_init(
 			    osd_req_op_data(req, 0, notify, response_data),
-			    lreq->notify_id_pages, PAGE_SIZE, 0, false, false);
+			    ceph_databuf_get(lreq->notify_id_buf));
 		}
 		dout("lreq %p register\n", lreq);
 		req->r_callback = linger_commit_cb;
@@ -5016,10 +5019,9 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
 	}
 
 	/* for notify_id */
-	lreq->notify_id_pages = ceph_alloc_page_vector(1, GFP_NOIO);
-	if (IS_ERR(lreq->notify_id_pages)) {
-		ret = PTR_ERR(lreq->notify_id_pages);
-		lreq->notify_id_pages = NULL;
+	lreq->notify_id_buf = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO);
+	if (!lreq->notify_id_buf) {
+		ret = -ENOMEM;
 		goto out_put_lreq;
 	}
 


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 11/35] ceph: Use ceph_databuf in DIO
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (9 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 10/35] libceph: Convert notify_id_pages to a ceph_databuf David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-17 20:03   ` Viacheslav Dubeyko
  2025-03-17 22:26   ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 12/35] libceph: Bypass the messenger-v1 Tx loop for databuf/iter data blobs David Howells
                   ` (23 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Stash the list of pages to be read into/written from during a ceph fs
direct read/write in a ceph_databuf struct rather than using a bvec array.
Eventually this will be replaced with just an iterator supplied by
netfslib.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/ceph/file.c | 110 +++++++++++++++++++++----------------------------
 1 file changed, 47 insertions(+), 63 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 9de2960748b9..fb4024bc8274 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -82,11 +82,10 @@ static __le32 ceph_flags_sys2wire(struct ceph_mds_client *mdsc, u32 flags)
  */
 #define ITER_GET_BVECS_PAGES	64
 
-static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
-				struct bio_vec *bvecs)
+static int __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
+			    struct ceph_databuf *dbuf)
 {
 	size_t size = 0;
-	int bvec_idx = 0;
 
 	if (maxsize > iov_iter_count(iter))
 		maxsize = iov_iter_count(iter);
@@ -98,22 +97,24 @@ static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
 		int idx = 0;
 
 		bytes = iov_iter_get_pages2(iter, pages, maxsize - size,
-					   ITER_GET_BVECS_PAGES, &start);
-		if (bytes < 0)
-			return size ?: bytes;
-
-		size += bytes;
+					    ITER_GET_BVECS_PAGES, &start);
+		if (bytes < 0) {
+			if (size == 0)
+				return bytes;
+			break;
+		}
 
-		for ( ; bytes; idx++, bvec_idx++) {
+		while (bytes) {
 			int len = min_t(int, bytes, PAGE_SIZE - start);
 
-			bvec_set_page(&bvecs[bvec_idx], pages[idx], len, start);
+			ceph_databuf_append_page(dbuf, pages[idx++], start, len);
 			bytes -= len;
+			size += len;
 			start = 0;
 		}
 	}
 
-	return size;
+	return 0;
 }
 
 /*
@@ -124,52 +125,44 @@ static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
  * Attempt to get up to @maxsize bytes worth of pages from @iter.
  * Return the number of bytes in the created bio_vec array, or an error.
  */
-static ssize_t iter_get_bvecs_alloc(struct iov_iter *iter, size_t maxsize,
-				    struct bio_vec **bvecs, int *num_bvecs)
+static struct ceph_databuf *iter_get_bvecs_alloc(struct iov_iter *iter,
+						 size_t maxsize, bool write)
 {
-	struct bio_vec *bv;
+	struct ceph_databuf *dbuf;
 	size_t orig_count = iov_iter_count(iter);
-	ssize_t bytes;
-	int npages;
+	int npages, ret;
 
 	iov_iter_truncate(iter, maxsize);
 	npages = iov_iter_npages(iter, INT_MAX);
 	iov_iter_reexpand(iter, orig_count);
 
-	/*
-	 * __iter_get_bvecs() may populate only part of the array -- zero it
-	 * out.
-	 */
-	bv = kvmalloc_array(npages, sizeof(*bv), GFP_KERNEL | __GFP_ZERO);
-	if (!bv)
-		return -ENOMEM;
+	if (write)
+		dbuf = ceph_databuf_req_alloc(npages, 0, GFP_KERNEL);
+	else
+		dbuf = ceph_databuf_reply_alloc(npages, 0, GFP_KERNEL);
+	if (!dbuf)
+		return ERR_PTR(-ENOMEM);
 
-	bytes = __iter_get_bvecs(iter, maxsize, bv);
-	if (bytes < 0) {
+	ret = __iter_get_bvecs(iter, maxsize, dbuf);
+	if (ret < 0) {
 		/*
 		 * No pages were pinned -- just free the array.
 		 */
-		kvfree(bv);
-		return bytes;
+		ceph_databuf_release(dbuf);
+		return ERR_PTR(ret);
 	}
 
-	*bvecs = bv;
-	*num_bvecs = npages;
-	return bytes;
+	return dbuf;
 }
 
-static void put_bvecs(struct bio_vec *bvecs, int num_bvecs, bool should_dirty)
+static void ceph_dirty_pages(struct ceph_databuf *dbuf)
 {
+	struct bio_vec *bvec = dbuf->bvec;
 	int i;
 
-	for (i = 0; i < num_bvecs; i++) {
-		if (bvecs[i].bv_page) {
-			if (should_dirty)
-				set_page_dirty_lock(bvecs[i].bv_page);
-			put_page(bvecs[i].bv_page);
-		}
-	}
-	kvfree(bvecs);
+	for (i = 0; i < dbuf->nr_bvec; i++)
+		if (bvec[i].bv_page)
+			set_page_dirty_lock(bvec[i].bv_page);
 }
 
 /*
@@ -1338,14 +1331,11 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req)
 	struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
 	struct ceph_osd_req_op *op = &req->r_ops[0];
 	struct ceph_client_metric *metric = &ceph_sb_to_mdsc(inode->i_sb)->metric;
-	unsigned int len = osd_data->bvec_pos.iter.bi_size;
+	size_t len = osd_data->iter.count;
 	bool sparse = (op->op == CEPH_OSD_OP_SPARSE_READ);
 	struct ceph_client *cl = ceph_inode_to_client(inode);
 
-	BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_BVECS);
-	BUG_ON(!osd_data->num_bvecs);
-
-	doutc(cl, "req %p inode %p %llx.%llx, rc %d bytes %u\n", req,
+	doutc(cl, "req %p inode %p %llx.%llx, rc %d bytes %zu\n", req,
 	      inode, ceph_vinop(inode), rc, len);
 
 	if (rc == -EOLDSNAPC) {
@@ -1367,7 +1357,6 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req)
 		if (rc == -ENOENT)
 			rc = 0;
 		if (rc >= 0 && len > rc) {
-			struct iov_iter i;
 			int zlen = len - rc;
 
 			/*
@@ -1384,10 +1373,8 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req)
 				aio_req->total_len = rc + zlen;
 			}
 
-			iov_iter_bvec(&i, ITER_DEST, osd_data->bvec_pos.bvecs,
-				      osd_data->num_bvecs, len);
-			iov_iter_advance(&i, rc);
-			iov_iter_zero(zlen, &i);
+			iov_iter_advance(&osd_data->iter, rc);
+			iov_iter_zero(zlen, &osd_data->iter);
 		}
 	}
 
@@ -1401,8 +1388,8 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req)
 						 req->r_end_latency, len, rc);
 	}
 
-	put_bvecs(osd_data->bvec_pos.bvecs, osd_data->num_bvecs,
-		  aio_req->should_dirty);
+	if (aio_req->should_dirty)
+		ceph_dirty_pages(osd_data->dbuf);
 	ceph_osdc_put_request(req);
 
 	if (rc < 0)
@@ -1491,9 +1478,8 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 	struct ceph_client_metric *metric = &fsc->mdsc->metric;
 	struct ceph_vino vino;
 	struct ceph_osd_request *req;
-	struct bio_vec *bvecs;
 	struct ceph_aio_request *aio_req = NULL;
-	int num_pages = 0;
+	struct ceph_databuf *dbuf = NULL;
 	int flags;
 	int ret = 0;
 	struct timespec64 mtime = current_time(inode);
@@ -1529,8 +1515,8 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 
 	while (iov_iter_count(iter) > 0) {
 		u64 size = iov_iter_count(iter);
-		ssize_t len;
 		struct ceph_osd_req_op *op;
+		size_t len;
 		int readop = sparse ? CEPH_OSD_OP_SPARSE_READ : CEPH_OSD_OP_READ;
 		int extent_cnt;
 
@@ -1563,16 +1549,17 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 			}
 		}
 
-		len = iter_get_bvecs_alloc(iter, size, &bvecs, &num_pages);
-		if (len < 0) {
+		dbuf = iter_get_bvecs_alloc(iter, size, write);
+		if (IS_ERR(dbuf)) {
 			ceph_osdc_put_request(req);
-			ret = len;
+			ret = PTR_ERR(dbuf);
 			break;
 		}
+		len = ceph_databuf_len(dbuf);
 		if (len != size)
 			osd_req_op_extent_update(req, 0, len);
 
-		osd_req_op_extent_osd_data_bvecs(req, 0, bvecs, num_pages, len);
+		osd_req_op_extent_osd_databuf(req, 0, dbuf);
 
 		/*
 		 * To simplify error handling, allow AIO when IO within i_size
@@ -1637,20 +1624,17 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 				ret = 0;
 
 			if (ret >= 0 && ret < len && pos + ret < size) {
-				struct iov_iter i;
 				int zlen = min_t(size_t, len - ret,
 						 size - pos - ret);
 
-				iov_iter_bvec(&i, ITER_DEST, bvecs, num_pages, len);
-				iov_iter_advance(&i, ret);
-				iov_iter_zero(zlen, &i);
+				iov_iter_advance(&dbuf->iter, ret);
+				iov_iter_zero(zlen, &dbuf->iter);
 				ret += zlen;
 			}
 			if (ret >= 0)
 				len = ret;
 		}
 
-		put_bvecs(bvecs, num_pages, should_dirty);
 		ceph_osdc_put_request(req);
 		if (ret < 0)
 			break;


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 12/35] libceph: Bypass the messenger-v1 Tx loop for databuf/iter data blobs
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (10 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 11/35] ceph: Use ceph_databuf in DIO David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 13/35] rbd: Switch from using bvec_iter to iov_iter David Howells
                   ` (22 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Don't use the messenger-v1 Tx loop for databuf/iter data blobs, which sends
page fragments individually, but rather pass the entire iterator to the
socket in one go.  This uses the loop inside of tcp_sendmsg() to do the
work and allows TCP to make better choices.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/ceph/messenger.h |  1 +
 net/ceph/messenger.c           |  1 +
 net/ceph/messenger_v1.c        | 76 ++++++++++++++++++++++++++++------
 3 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 864aad369c91..1b646d0dff39 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -255,6 +255,7 @@ struct ceph_msg_data_cursor {
 		};
 		struct {
 			struct iov_iter		iov_iter;
+			struct iov_iter		crc_iter;
 			unsigned int		lastlen;
 		};
 	};
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 02439b38ec94..dc8082575e4f 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -975,6 +975,7 @@ static void ceph_msg_data_iter_cursor_init(struct ceph_msg_data_cursor *cursor,
 	struct ceph_msg_data *data = cursor->data;
 
 	cursor->iov_iter = data->iter;
+	cursor->crc_iter = data->iter;
 	cursor->lastlen = 0;
 	iov_iter_truncate(&cursor->iov_iter, length);
 	cursor->resid = iov_iter_count(&cursor->iov_iter);
diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
index 0cb61c76b9b8..d6464ac62b09 100644
--- a/net/ceph/messenger_v1.c
+++ b/net/ceph/messenger_v1.c
@@ -3,6 +3,7 @@
 
 #include <linux/bvec.h>
 #include <linux/crc32c.h>
+#include <linux/iov_iter.h>
 #include <linux/net.h>
 #include <linux/socket.h>
 #include <net/sock.h>
@@ -74,6 +75,21 @@ static int ceph_tcp_sendmsg(struct socket *sock, struct kvec *iov,
 	return r;
 }
 
+static int ceph_tcp_sock_sendmsg(struct socket *sock, struct iov_iter *iter,
+				 unsigned int flags)
+{
+	struct msghdr msg = {
+		.msg_iter  = *iter,
+		.msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL | flags,
+	};
+	int r;
+
+	r = sock_sendmsg(sock, &msg);
+	if (r == -EAGAIN)
+		r = 0;
+	return r;
+}
+
 /*
  * @more: MSG_MORE or 0.
  */
@@ -455,6 +471,24 @@ static int write_partial_kvec(struct ceph_connection *con)
 	return ret;  /* done! */
 }
 
+static size_t ceph_crc_from_iter(void *iter_from, size_t progress,
+				 size_t len, void *priv, void *priv2)
+{
+	u32 *crc = priv;
+
+	*crc = crc32c(*crc, iter_from, len);
+	return 0;
+}
+
+static void ceph_calc_crc(struct iov_iter *iter, size_t count, u32 *crc)
+{
+	size_t done;
+
+	done = iterate_and_advance_kernel(iter, count, crc, NULL,
+					  ceph_crc_from_iter);
+	WARN_ON(done != count);
+}
+
 /*
  * Write as much message data payload as we can.  If we finish, queue
  * up the footer.
@@ -467,7 +501,7 @@ static int write_partial_message_data(struct ceph_connection *con)
 	struct ceph_msg *msg = con->out_msg;
 	struct ceph_msg_data_cursor *cursor = &msg->cursor;
 	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
-	u32 crc;
+	u32 crc = 0;
 
 	dout("%s %p msg %p\n", __func__, con, msg);
 
@@ -484,9 +518,6 @@ static int write_partial_message_data(struct ceph_connection *con)
 	 */
 	crc = do_datacrc ? le32_to_cpu(msg->footer.data_crc) : 0;
 	while (cursor->total_resid) {
-		struct page *page;
-		size_t page_offset;
-		size_t length;
 		int ret;
 
 		if (!cursor->resid) {
@@ -494,17 +525,36 @@ static int write_partial_message_data(struct ceph_connection *con)
 			continue;
 		}
 
-		page = ceph_msg_data_next(cursor, &page_offset, &length);
-		ret = ceph_tcp_sendpage(con->sock, page, page_offset, length,
-					MSG_MORE);
-		if (ret <= 0) {
-			if (do_datacrc)
-				msg->footer.data_crc = cpu_to_le32(crc);
+		if (cursor->data->type == CEPH_MSG_DATA_DATABUF ||
+		    cursor->data->type == CEPH_MSG_DATA_ITER) {
+			ret = ceph_tcp_sock_sendmsg(con->sock, &cursor->iov_iter,
+						    MSG_MORE);
+			if (ret <= 0) {
+				if (do_datacrc)
+					msg->footer.data_crc = cpu_to_le32(crc);
 
-			return ret;
+				return ret;
+			}
+			if (do_datacrc && cursor->need_crc)
+				ceph_calc_crc(&cursor->crc_iter, ret, &crc);
+		} else {
+			struct page *page;
+			size_t page_offset;
+			size_t length;
+
+			page = ceph_msg_data_next(cursor, &page_offset, &length);
+			ret = ceph_tcp_sendpage(con->sock, page, page_offset,
+						length, MSG_MORE);
+			if (ret <= 0) {
+				if (do_datacrc)
+					msg->footer.data_crc = cpu_to_le32(crc);
+
+				return ret;
+			}
+			if (do_datacrc && cursor->need_crc)
+				crc = ceph_crc32c_page(crc, page, page_offset,
+						       length);
 		}
-		if (do_datacrc && cursor->need_crc)
-			crc = ceph_crc32c_page(crc, page, page_offset, length);
 		ceph_msg_data_advance(cursor, (size_t)ret);
 	}
 


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 13/35] rbd: Switch from using bvec_iter to iov_iter
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (11 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 12/35] libceph: Bypass the messenger-v1 Tx loop for databuf/iter data blobs David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-18 19:38   ` Viacheslav Dubeyko
  2025-03-18 22:13   ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 14/35] libceph: Remove bvec and bio data container types David Howells
                   ` (21 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel, Xiubo Li

Switch from using a ceph_bio_iter/ceph_bvec_iter for iterating over the
bio_vecs attached to the request to using a ceph_databuf with the bio_vecs
transscribed from the bio list.  This allows the entire bio bvec[] set to
be passed down to the socket (if unencrypted).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: linux-fsdevel@vger.kernel.org
---
 drivers/block/rbd.c          | 642 ++++++++++++++---------------------
 include/linux/ceph/databuf.h |  22 ++
 include/linux/ceph/striper.h |  58 +++-
 net/ceph/striper.c           |  53 ---
 4 files changed, 331 insertions(+), 444 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 073e80d2d966..dd22cea7ae89 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -46,6 +46,7 @@
 #include <linux/slab.h>
 #include <linux/idr.h>
 #include <linux/workqueue.h>
+#include <linux/iov_iter.h>
 
 #include "rbd_types.h"
 
@@ -214,13 +215,6 @@ struct pending_result {
 
 struct rbd_img_request;
 
-enum obj_request_type {
-	OBJ_REQUEST_NODATA = 1,
-	OBJ_REQUEST_BIO,	/* pointer into provided bio (list) */
-	OBJ_REQUEST_BVECS,	/* pointer into provided bio_vec array */
-	OBJ_REQUEST_OWN_BVECS,	/* private bio_vec array, doesn't own pages */
-};
-
 enum obj_operation_type {
 	OBJ_OP_READ = 1,
 	OBJ_OP_WRITE,
@@ -295,18 +289,12 @@ struct rbd_obj_request {
 	struct ceph_file_extent	*img_extents;
 	u32			num_img_extents;
 
-	union {
-		struct ceph_bio_iter	bio_pos;
-		struct {
-			struct ceph_bvec_iter	bvec_pos;
-			u32			bvec_count;
-			u32			bvec_idx;
-		};
-	};
+	unsigned int		bvec_count;
+	struct iov_iter		iter;
+	struct ceph_databuf	*dbuf;
 
 	enum rbd_obj_copyup_state copyup_state;
-	struct bio_vec		*copyup_bvecs;
-	u32			copyup_bvec_count;
+	struct ceph_databuf	*copyup_buf;
 
 	struct list_head	osd_reqs;	/* w/ r_private_item */
 
@@ -330,7 +318,6 @@ enum rbd_img_state {
 struct rbd_img_request {
 	struct rbd_device	*rbd_dev;
 	enum obj_operation_type	op_type;
-	enum obj_request_type	data_type;
 	unsigned long		flags;
 	enum rbd_img_state	state;
 	union {
@@ -1221,26 +1208,6 @@ static void rbd_dev_mapping_clear(struct rbd_device *rbd_dev)
 	rbd_dev->mapping.size = 0;
 }
 
-static void zero_bios(struct ceph_bio_iter *bio_pos, u32 off, u32 bytes)
-{
-	struct ceph_bio_iter it = *bio_pos;
-
-	ceph_bio_iter_advance(&it, off);
-	ceph_bio_iter_advance_step(&it, bytes, ({
-		memzero_bvec(&bv);
-	}));
-}
-
-static void zero_bvecs(struct ceph_bvec_iter *bvec_pos, u32 off, u32 bytes)
-{
-	struct ceph_bvec_iter it = *bvec_pos;
-
-	ceph_bvec_iter_advance(&it, off);
-	ceph_bvec_iter_advance_step(&it, bytes, ({
-		memzero_bvec(&bv);
-	}));
-}
-
 /*
  * Zero a range in @obj_req data buffer defined by a bio (list) or
  * (private) bio_vec array.
@@ -1252,17 +1219,9 @@ static void rbd_obj_zero_range(struct rbd_obj_request *obj_req, u32 off,
 {
 	dout("%s %p data buf %u~%u\n", __func__, obj_req, off, bytes);
 
-	switch (obj_req->img_request->data_type) {
-	case OBJ_REQUEST_BIO:
-		zero_bios(&obj_req->bio_pos, off, bytes);
-		break;
-	case OBJ_REQUEST_BVECS:
-	case OBJ_REQUEST_OWN_BVECS:
-		zero_bvecs(&obj_req->bvec_pos, off, bytes);
-		break;
-	default:
-		BUG();
-	}
+	iov_iter_advance(&obj_req->dbuf->iter, off);
+	iov_iter_zero(bytes, &obj_req->dbuf->iter);
+	iov_iter_revert(&obj_req->dbuf->iter, off);
 }
 
 static void rbd_obj_request_destroy(struct kref *kref);
@@ -1487,7 +1446,6 @@ static void rbd_obj_request_destroy(struct kref *kref)
 {
 	struct rbd_obj_request *obj_request;
 	struct ceph_osd_request *osd_req;
-	u32 i;
 
 	obj_request = container_of(kref, struct rbd_obj_request, kref);
 
@@ -1500,27 +1458,8 @@ static void rbd_obj_request_destroy(struct kref *kref)
 		ceph_osdc_put_request(osd_req);
 	}
 
-	switch (obj_request->img_request->data_type) {
-	case OBJ_REQUEST_NODATA:
-	case OBJ_REQUEST_BIO:
-	case OBJ_REQUEST_BVECS:
-		break;		/* Nothing to do */
-	case OBJ_REQUEST_OWN_BVECS:
-		kfree(obj_request->bvec_pos.bvecs);
-		break;
-	default:
-		BUG();
-	}
-
 	kfree(obj_request->img_extents);
-	if (obj_request->copyup_bvecs) {
-		for (i = 0; i < obj_request->copyup_bvec_count; i++) {
-			if (obj_request->copyup_bvecs[i].bv_page)
-				__free_page(obj_request->copyup_bvecs[i].bv_page);
-		}
-		kfree(obj_request->copyup_bvecs);
-	}
-
+	ceph_databuf_release(obj_request->copyup_buf);
 	kmem_cache_free(rbd_obj_request_cache, obj_request);
 }
 
@@ -1855,7 +1794,7 @@ static int __rbd_object_map_load(struct rbd_device *rbd_dev)
 		goto out;
 
 	p = kmap_ceph_databuf_page(reply, 0);
-	end = p + min(ceph_databuf_len(reply), (size_t)PAGE_SIZE);
+	end = p + umin(ceph_databuf_len(reply), PAGE_SIZE);
 	q = p;
 	ret = decode_object_map_header(&q, end, &object_map_size);
 	if (ret)
@@ -2167,29 +2106,6 @@ static int rbd_obj_calc_img_extents(struct rbd_obj_request *obj_req,
 	return 0;
 }
 
-static void rbd_osd_setup_data(struct ceph_osd_request *osd_req, int which)
-{
-	struct rbd_obj_request *obj_req = osd_req->r_priv;
-
-	switch (obj_req->img_request->data_type) {
-	case OBJ_REQUEST_BIO:
-		osd_req_op_extent_osd_data_bio(osd_req, which,
-					       &obj_req->bio_pos,
-					       obj_req->ex.oe_len);
-		break;
-	case OBJ_REQUEST_BVECS:
-	case OBJ_REQUEST_OWN_BVECS:
-		rbd_assert(obj_req->bvec_pos.iter.bi_size ==
-							obj_req->ex.oe_len);
-		rbd_assert(obj_req->bvec_idx == obj_req->bvec_count);
-		osd_req_op_extent_osd_data_bvec_pos(osd_req, which,
-						    &obj_req->bvec_pos);
-		break;
-	default:
-		BUG();
-	}
-}
-
 static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which)
 {
 	struct page **pages;
@@ -2223,8 +2139,7 @@ static int rbd_osd_setup_copyup(struct ceph_osd_request *osd_req, int which,
 	if (ret)
 		return ret;
 
-	osd_req_op_cls_request_data_bvecs(osd_req, which, obj_req->copyup_bvecs,
-					  obj_req->copyup_bvec_count, bytes);
+	osd_req_op_cls_request_databuf(osd_req, which, obj_req->copyup_buf);
 	return 0;
 }
 
@@ -2256,7 +2171,7 @@ static void __rbd_osd_setup_write_ops(struct ceph_osd_request *osd_req,
 
 	osd_req_op_extent_init(osd_req, which, opcode,
 			       obj_req->ex.oe_off, obj_req->ex.oe_len, 0, 0);
-	rbd_osd_setup_data(osd_req, which);
+	osd_req_op_extent_osd_databuf(osd_req, which, obj_req->dbuf);
 }
 
 static int rbd_obj_init_write(struct rbd_obj_request *obj_req)
@@ -2427,6 +2342,19 @@ static void rbd_osd_setup_write_ops(struct ceph_osd_request *osd_req,
 	}
 }
 
+static struct ceph_object_extent *alloc_object_extent(void *arg)
+{
+	struct rbd_img_request *img_req = arg;
+	struct rbd_obj_request *obj_req;
+
+	obj_req = rbd_obj_request_create();
+	if (!obj_req)
+		return NULL;
+
+	rbd_img_obj_request_add(img_req, obj_req);
+	return &obj_req->ex;
+}
+
 /*
  * Prune the list of object requests (adjust offset and/or length, drop
  * redundant requests).  Prepare object request state machines and image
@@ -2466,104 +2394,232 @@ static int __rbd_img_fill_request(struct rbd_img_request *img_req)
 	return 0;
 }
 
-union rbd_img_fill_iter {
-	struct ceph_bio_iter	bio_iter;
-	struct ceph_bvec_iter	bvec_iter;
-};
+/*
+ * Handle ranged, but dataless ops such as DISCARD and ZEROOUT.
+ */
+static int rbd_img_fill_nodata(struct rbd_img_request *img_req,
+			       u64 off, u64 len)
+{
+	int ret;
+
+	ret = ceph_file_to_extents(&img_req->rbd_dev->layout, off, len,
+				   &img_req->object_extents,
+				   alloc_object_extent, img_req,
+				   NULL, NULL);
+	if (ret)
+		return ret;
 
-struct rbd_img_fill_ctx {
-	enum obj_request_type	pos_type;
-	union rbd_img_fill_iter	*pos;
-	union rbd_img_fill_iter	iter;
-	ceph_object_extent_fn_t	set_pos_fn;
-	ceph_object_extent_fn_t	count_fn;
-	ceph_object_extent_fn_t	copy_fn;
+	return __rbd_img_fill_request(img_req);
+}
+
+struct rbd_bio_iter {
+	const struct bio	*first_bio;
+	const struct bio	*bio;
+	size_t			skip;
+	unsigned int		bvix;
 };
 
-static struct ceph_object_extent *alloc_object_extent(void *arg)
+static void rbd_start_bio_iteration(struct rbd_bio_iter *iter, struct bio *bio)
 {
-	struct rbd_img_request *img_req = arg;
-	struct rbd_obj_request *obj_req;
+	iter->bio = bio;
+	iter->bvix = 0;
+	iter->skip = 0;
+}
 
-	obj_req = rbd_obj_request_create();
-	if (!obj_req)
-		return NULL;
+static void count_bio_bvecs(struct ceph_object_extent *ex, u32 bytes, void *arg)
+{
+	struct rbd_obj_request *obj_req = container_of(ex, struct rbd_obj_request, ex);
+	struct rbd_bio_iter *iter = arg;
+	const struct bio *bio;
+	unsigned int need_bv = obj_req->bvec_count, i = 0;
+	size_t skip;
+
+	/* Count the number of bvecs we need. */
+	skip = iter->skip;
+	bio = iter->bio;
+	while (bio) {
+		for (i = iter->bvix; i < bio->bi_vcnt; i++, skip = 0) {
+			const struct bio_vec *bv = bio->bi_io_vec + i;
+			size_t part = umin(bytes, bv->bv_len - skip);
+
+			if (!part)
+				continue;
 
-	rbd_img_obj_request_add(img_req, obj_req);
-	return &obj_req->ex;
+			need_bv++;
+			skip += part;
+			bytes -= part;
+			if (!bytes)
+				goto done;
+		}
+
+		bio = bio->bi_next;
+		iter->bvix = 0;
+		iter->skip = 0;
+	}
+
+done:
+	iter->bio = bio;
+	iter->bvix = i;
+	iter->skip = skip;
+	obj_req->bvec_count += need_bv;
 }
 
-/*
- * While su != os && sc == 1 is technically not fancy (it's the same
- * layout as su == os && sc == 1), we can't use the nocopy path for it
- * because ->set_pos_fn() should be called only once per object.
- * ceph_file_to_extents() invokes action_fn once per stripe unit, so
- * treat su != os && sc == 1 as fancy.
- */
-static bool rbd_layout_is_fancy(struct ceph_file_layout *l)
+static void copy_bio_bvecs(struct ceph_object_extent *ex, u32 bytes, void *arg)
+{
+	struct rbd_obj_request *obj_req = container_of(ex, struct rbd_obj_request, ex);
+	struct rbd_bio_iter *iter = arg;
+	struct ceph_databuf *dbuf = obj_req->dbuf;
+	const struct bio *bio;
+	unsigned int i;
+	size_t skip = iter->skip;
+
+	/* Transcribe the pages to the databuf. */
+	for (bio = iter->bio; bio; bio = bio->bi_next) {
+		for (i = iter->bvix; i < bio->bi_vcnt; i++, skip = 0) {
+			const struct bio_vec *bv = bio->bi_io_vec + i;
+			size_t part = umin(bytes, bv->bv_len - skip);
+
+			if (!part)
+				continue;
+
+			ceph_databuf_append_page(dbuf, bv->bv_page,
+						 bv->bv_offset + skip,
+						 bv->bv_len - skip);
+			skip += part;
+			bytes -= part;
+			if (!bytes)
+				goto done;
+		}
+
+		iter->bvix = 0;
+		iter->skip = 0;
+	}
+
+done:
+	iter->bio = bio;
+	iter->bvix = i;
+	iter->skip = skip;
+}
+
+static int rbd_img_alloc_databufs(struct rbd_img_request *img_req)
 {
-	return l->stripe_unit != l->object_size;
+	struct rbd_obj_request *obj_req;
+
+	for_each_obj_request(img_req, obj_req) {
+		if (img_req->op_type == OBJ_OP_READ)
+			obj_req->dbuf = ceph_databuf_reply_alloc(obj_req->bvec_count, 0,
+								 GFP_NOIO);
+		else
+			obj_req->dbuf = ceph_databuf_req_alloc(obj_req->bvec_count, 0,
+							       GFP_NOIO);
+		if (!obj_req->dbuf)
+			return -ENOMEM;
+	}
+
+	return 0;
 }
 
-static int rbd_img_fill_request_nocopy(struct rbd_img_request *img_req,
-				       struct ceph_file_extent *img_extents,
-				       u32 num_img_extents,
-				       struct rbd_img_fill_ctx *fctx)
+/*
+ * Map an image extent that is backed by a bio chain to a list of object
+ * extents, create the corresponding object requests (normally each to a
+ * different object, but not always) and add them to @img_req.  For each object
+ * request, set up its data descriptor to point to a distilled list of page
+ * fragments.
+ *
+ * Because ceph_file_to_extents() will merge adjacent object extents together,
+ * each object request's data descriptor may point to multiple different chunks
+ * of the data buffer.
+ *
+ * The data buffer is assumed to be large enough.
+ */
+static int rbd_img_fill_from_bio(struct rbd_img_request *img_req,
+				 u64 off, u64 len, struct bio *bio)
 {
-	u32 i;
+	struct rbd_bio_iter iter;
+	struct rbd_device *rbd_dev = img_req->rbd_dev;
 	int ret;
 
-	img_req->data_type = fctx->pos_type;
+	/*
+	 * Create object requests and determine ->bvec_count for each object
+	 * request.  Note that ->bvec_count sum over all object requests may
+	 * be greater than the number of bio_vecs in the provided bio (list)
+	 * or bio_vec array because when mapped, those bio_vecs can straddle
+	 * stripe unit boundaries.
+	 */
+	rbd_start_bio_iteration(&iter, bio);
+	ret = ceph_file_to_extents(&rbd_dev->layout, off, len,
+				   &img_req->object_extents,
+				   alloc_object_extent, img_req,
+				   count_bio_bvecs, &iter);
+	if (ret)
+		return ret;
+
+	ret = rbd_img_alloc_databufs(img_req);
+	if (ret)
+		return ret;
 
 	/*
-	 * Create object requests and set each object request's starting
-	 * position in the provided bio (list) or bio_vec array.
+	 * Fill in each object request's databuf, splitting and rearranging the
+	 * provided bio_vecs in stripe unit chunks as needed.
 	 */
-	fctx->iter = *fctx->pos;
-	for (i = 0; i < num_img_extents; i++) {
-		ret = ceph_file_to_extents(&img_req->rbd_dev->layout,
-					   img_extents[i].fe_off,
-					   img_extents[i].fe_len,
-					   &img_req->object_extents,
-					   alloc_object_extent, img_req,
-					   fctx->set_pos_fn, &fctx->iter);
-		if (ret)
-			return ret;
-	}
+	rbd_start_bio_iteration(&iter, bio);
+	ret = ceph_iterate_extents(&rbd_dev->layout, off, len,
+				   &img_req->object_extents,
+				   copy_bio_bvecs, &iter);
+	if (ret)
+		return ret;
 
 	return __rbd_img_fill_request(img_req);
 }
 
+static void rbd_count_iter(struct ceph_object_extent *ex, u32 bytes, void *arg)
+{
+	struct rbd_obj_request *obj_req = container_of(ex, struct rbd_obj_request, ex);
+	struct iov_iter *iter = arg;
+
+	obj_req->bvec_count += iov_iter_npages_cap(iter, INT_MAX, bytes);
+}
+
+static size_t rbd_copy_iter_step(void *iter_base, size_t progress, size_t len,
+				 void *priv, void *priv2)
+{
+	struct ceph_databuf *dbuf = priv;
+	struct page *page = virt_to_page(iter_base);
+
+	ceph_databuf_append_page(dbuf, page, (unsigned long)iter_base & ~PAGE_MASK, len);
+	return 0;
+}
+
+static void rbd_copy_iter(struct ceph_object_extent *ex, u32 bytes, void *arg)
+{
+	struct rbd_obj_request *obj_req = container_of(ex, struct rbd_obj_request, ex);
+	struct iov_iter *iter = arg;
+
+	iterate_bvec(iter, bytes, obj_req->dbuf, NULL, rbd_copy_iter_step);
+}
+
 /*
- * Map a list of image extents to a list of object extents, create the
- * corresponding object requests (normally each to a different object,
- * but not always) and add them to @img_req.  For each object request,
- * set up its data descriptor to point to the corresponding chunk(s) of
- * @fctx->pos data buffer.
+ * Map a list of image extents to a list of object extents, creating the
+ * corresponding object requests (normally each to a different object, but not
+ * always) and add them to @img_req.  For each object request, set up its data
+ * descriptor to point to the corresponding chunk(s) of the @dbuf data buffer.
  *
  * Because ceph_file_to_extents() will merge adjacent object extents
  * together, each object request's data descriptor may point to multiple
- * different chunks of @fctx->pos data buffer.
+ * different chunks of the data buffer.
  *
- * @fctx->pos data buffer is assumed to be large enough.
+ * The data buffer is assumed to be large enough.
  */
-static int rbd_img_fill_request(struct rbd_img_request *img_req,
-				struct ceph_file_extent *img_extents,
-				u32 num_img_extents,
-				struct rbd_img_fill_ctx *fctx)
+static int rbd_img_fill_from_dbuf(struct rbd_img_request *img_req,
+				  const struct ceph_file_extent *img_extents,
+				  u32 num_img_extents,
+				  const struct ceph_databuf *dbuf)
 {
 	struct rbd_device *rbd_dev = img_req->rbd_dev;
-	struct rbd_obj_request *obj_req;
-	u32 i;
+	struct iov_iter iter;
+	unsigned int i;
 	int ret;
 
-	if (fctx->pos_type == OBJ_REQUEST_NODATA ||
-	    !rbd_layout_is_fancy(&rbd_dev->layout))
-		return rbd_img_fill_request_nocopy(img_req, img_extents,
-						   num_img_extents, fctx);
-
-	img_req->data_type = OBJ_REQUEST_OWN_BVECS;
-
 	/*
 	 * Create object requests and determine ->bvec_count for each object
 	 * request.  Note that ->bvec_count sum over all object requests may
@@ -2571,37 +2627,33 @@ static int rbd_img_fill_request(struct rbd_img_request *img_req,
 	 * or bio_vec array because when mapped, those bio_vecs can straddle
 	 * stripe unit boundaries.
 	 */
-	fctx->iter = *fctx->pos;
+	iter = dbuf->iter;
 	for (i = 0; i < num_img_extents; i++) {
 		ret = ceph_file_to_extents(&rbd_dev->layout,
 					   img_extents[i].fe_off,
 					   img_extents[i].fe_len,
 					   &img_req->object_extents,
 					   alloc_object_extent, img_req,
-					   fctx->count_fn, &fctx->iter);
+					   rbd_count_iter, &iter);
 		if (ret)
 			return ret;
 	}
 
-	for_each_obj_request(img_req, obj_req) {
-		obj_req->bvec_pos.bvecs = kmalloc_array(obj_req->bvec_count,
-					      sizeof(*obj_req->bvec_pos.bvecs),
-					      GFP_NOIO);
-		if (!obj_req->bvec_pos.bvecs)
-			return -ENOMEM;
-	}
+	ret = rbd_img_alloc_databufs(img_req);
+	if (ret)
+		return ret;
 
 	/*
-	 * Fill in each object request's private bio_vec array, splitting and
-	 * rearranging the provided bio_vecs in stripe unit chunks as needed.
+	 * Fill in each object request's databuf, splitting and rearranging the
+	 * provided bio_vecs in stripe unit chunks as needed.
 	 */
-	fctx->iter = *fctx->pos;
+	iter = dbuf->iter;
 	for (i = 0; i < num_img_extents; i++) {
 		ret = ceph_iterate_extents(&rbd_dev->layout,
 					   img_extents[i].fe_off,
 					   img_extents[i].fe_len,
 					   &img_req->object_extents,
-					   fctx->copy_fn, &fctx->iter);
+					   rbd_copy_iter, &iter);
 		if (ret)
 			return ret;
 	}
@@ -2609,148 +2661,6 @@ static int rbd_img_fill_request(struct rbd_img_request *img_req,
 	return __rbd_img_fill_request(img_req);
 }
 
-static int rbd_img_fill_nodata(struct rbd_img_request *img_req,
-			       u64 off, u64 len)
-{
-	struct ceph_file_extent ex = { off, len };
-	union rbd_img_fill_iter dummy = {};
-	struct rbd_img_fill_ctx fctx = {
-		.pos_type = OBJ_REQUEST_NODATA,
-		.pos = &dummy,
-	};
-
-	return rbd_img_fill_request(img_req, &ex, 1, &fctx);
-}
-
-static void set_bio_pos(struct ceph_object_extent *ex, u32 bytes, void *arg)
-{
-	struct rbd_obj_request *obj_req =
-	    container_of(ex, struct rbd_obj_request, ex);
-	struct ceph_bio_iter *it = arg;
-
-	dout("%s objno %llu bytes %u\n", __func__, ex->oe_objno, bytes);
-	obj_req->bio_pos = *it;
-	ceph_bio_iter_advance(it, bytes);
-}
-
-static void count_bio_bvecs(struct ceph_object_extent *ex, u32 bytes, void *arg)
-{
-	struct rbd_obj_request *obj_req =
-	    container_of(ex, struct rbd_obj_request, ex);
-	struct ceph_bio_iter *it = arg;
-
-	dout("%s objno %llu bytes %u\n", __func__, ex->oe_objno, bytes);
-	ceph_bio_iter_advance_step(it, bytes, ({
-		obj_req->bvec_count++;
-	}));
-
-}
-
-static void copy_bio_bvecs(struct ceph_object_extent *ex, u32 bytes, void *arg)
-{
-	struct rbd_obj_request *obj_req =
-	    container_of(ex, struct rbd_obj_request, ex);
-	struct ceph_bio_iter *it = arg;
-
-	dout("%s objno %llu bytes %u\n", __func__, ex->oe_objno, bytes);
-	ceph_bio_iter_advance_step(it, bytes, ({
-		obj_req->bvec_pos.bvecs[obj_req->bvec_idx++] = bv;
-		obj_req->bvec_pos.iter.bi_size += bv.bv_len;
-	}));
-}
-
-static int __rbd_img_fill_from_bio(struct rbd_img_request *img_req,
-				   struct ceph_file_extent *img_extents,
-				   u32 num_img_extents,
-				   struct ceph_bio_iter *bio_pos)
-{
-	struct rbd_img_fill_ctx fctx = {
-		.pos_type = OBJ_REQUEST_BIO,
-		.pos = (union rbd_img_fill_iter *)bio_pos,
-		.set_pos_fn = set_bio_pos,
-		.count_fn = count_bio_bvecs,
-		.copy_fn = copy_bio_bvecs,
-	};
-
-	return rbd_img_fill_request(img_req, img_extents, num_img_extents,
-				    &fctx);
-}
-
-static int rbd_img_fill_from_bio(struct rbd_img_request *img_req,
-				 u64 off, u64 len, struct bio *bio)
-{
-	struct ceph_file_extent ex = { off, len };
-	struct ceph_bio_iter it = { .bio = bio, .iter = bio->bi_iter };
-
-	return __rbd_img_fill_from_bio(img_req, &ex, 1, &it);
-}
-
-static void set_bvec_pos(struct ceph_object_extent *ex, u32 bytes, void *arg)
-{
-	struct rbd_obj_request *obj_req =
-	    container_of(ex, struct rbd_obj_request, ex);
-	struct ceph_bvec_iter *it = arg;
-
-	obj_req->bvec_pos = *it;
-	ceph_bvec_iter_shorten(&obj_req->bvec_pos, bytes);
-	ceph_bvec_iter_advance(it, bytes);
-}
-
-static void count_bvecs(struct ceph_object_extent *ex, u32 bytes, void *arg)
-{
-	struct rbd_obj_request *obj_req =
-	    container_of(ex, struct rbd_obj_request, ex);
-	struct ceph_bvec_iter *it = arg;
-
-	ceph_bvec_iter_advance_step(it, bytes, ({
-		obj_req->bvec_count++;
-	}));
-}
-
-static void copy_bvecs(struct ceph_object_extent *ex, u32 bytes, void *arg)
-{
-	struct rbd_obj_request *obj_req =
-	    container_of(ex, struct rbd_obj_request, ex);
-	struct ceph_bvec_iter *it = arg;
-
-	ceph_bvec_iter_advance_step(it, bytes, ({
-		obj_req->bvec_pos.bvecs[obj_req->bvec_idx++] = bv;
-		obj_req->bvec_pos.iter.bi_size += bv.bv_len;
-	}));
-}
-
-static int __rbd_img_fill_from_bvecs(struct rbd_img_request *img_req,
-				     struct ceph_file_extent *img_extents,
-				     u32 num_img_extents,
-				     struct ceph_bvec_iter *bvec_pos)
-{
-	struct rbd_img_fill_ctx fctx = {
-		.pos_type = OBJ_REQUEST_BVECS,
-		.pos = (union rbd_img_fill_iter *)bvec_pos,
-		.set_pos_fn = set_bvec_pos,
-		.count_fn = count_bvecs,
-		.copy_fn = copy_bvecs,
-	};
-
-	return rbd_img_fill_request(img_req, img_extents, num_img_extents,
-				    &fctx);
-}
-
-static int rbd_img_fill_from_bvecs(struct rbd_img_request *img_req,
-				   struct ceph_file_extent *img_extents,
-				   u32 num_img_extents,
-				   struct bio_vec *bvecs)
-{
-	struct ceph_bvec_iter it = {
-		.bvecs = bvecs,
-		.iter = { .bi_size = ceph_file_extents_bytes(img_extents,
-							     num_img_extents) },
-	};
-
-	return __rbd_img_fill_from_bvecs(img_req, img_extents, num_img_extents,
-					 &it);
-}
-
 static void rbd_img_handle_request_work(struct work_struct *work)
 {
 	struct rbd_img_request *img_req =
@@ -2791,7 +2701,7 @@ static int rbd_obj_read_object(struct rbd_obj_request *obj_req)
 
 	osd_req_op_extent_init(osd_req, 0, CEPH_OSD_OP_READ,
 			       obj_req->ex.oe_off, obj_req->ex.oe_len, 0, 0);
-	rbd_osd_setup_data(osd_req, 0);
+	osd_req_op_extent_osd_databuf(osd_req, 0, obj_req->dbuf);
 	rbd_osd_format_read(osd_req);
 
 	ret = ceph_osdc_alloc_messages(osd_req, GFP_NOIO);
@@ -2802,7 +2712,13 @@ static int rbd_obj_read_object(struct rbd_obj_request *obj_req)
 	return 0;
 }
 
-static int rbd_obj_read_from_parent(struct rbd_obj_request *obj_req)
+/*
+ * Redirect an I/O request to the parent device.  Note that by the time we get
+ * here, the page list from the original bio chain has been decanted into a
+ * databuf struct that we can just take slices from.
+ */
+static int rbd_obj_read_from_parent(struct rbd_obj_request *obj_req,
+				    struct ceph_databuf *dbuf)
 {
 	struct rbd_img_request *img_req = obj_req->img_request;
 	struct rbd_device *parent = img_req->rbd_dev->parent;
@@ -2824,30 +2740,10 @@ static int rbd_obj_read_from_parent(struct rbd_obj_request *obj_req)
 	dout("%s child_img_req %p for obj_req %p\n", __func__, child_img_req,
 	     obj_req);
 
-	if (!rbd_img_is_write(img_req)) {
-		switch (img_req->data_type) {
-		case OBJ_REQUEST_BIO:
-			ret = __rbd_img_fill_from_bio(child_img_req,
-						      obj_req->img_extents,
-						      obj_req->num_img_extents,
-						      &obj_req->bio_pos);
-			break;
-		case OBJ_REQUEST_BVECS:
-		case OBJ_REQUEST_OWN_BVECS:
-			ret = __rbd_img_fill_from_bvecs(child_img_req,
-						      obj_req->img_extents,
-						      obj_req->num_img_extents,
-						      &obj_req->bvec_pos);
-			break;
-		default:
-			BUG();
-		}
-	} else {
-		ret = rbd_img_fill_from_bvecs(child_img_req,
-					      obj_req->img_extents,
-					      obj_req->num_img_extents,
-					      obj_req->copyup_bvecs);
-	}
+	ret = rbd_img_fill_from_dbuf(child_img_req,
+				     obj_req->img_extents,
+				     obj_req->num_img_extents,
+				     dbuf);
 	if (ret) {
 		rbd_img_request_destroy(child_img_req);
 		return ret;
@@ -2890,7 +2786,8 @@ static bool rbd_obj_advance_read(struct rbd_obj_request *obj_req, int *result)
 				return true;
 			}
 			if (obj_req->num_img_extents) {
-				ret = rbd_obj_read_from_parent(obj_req);
+				ret = rbd_obj_read_from_parent(obj_req,
+							       obj_req->dbuf);
 				if (ret) {
 					*result = ret;
 					return true;
@@ -3004,23 +2901,6 @@ static int rbd_obj_write_object(struct rbd_obj_request *obj_req)
 	return 0;
 }
 
-/*
- * copyup_bvecs pages are never highmem pages
- */
-static bool is_zero_bvecs(struct bio_vec *bvecs, u32 bytes)
-{
-	struct ceph_bvec_iter it = {
-		.bvecs = bvecs,
-		.iter = { .bi_size = bytes },
-	};
-
-	ceph_bvec_iter_advance_step(&it, bytes, ({
-		if (memchr_inv(bvec_virt(&bv), 0, bv.bv_len))
-			return false;
-	}));
-	return true;
-}
-
 #define MODS_ONLY	U32_MAX
 
 static int rbd_obj_copyup_empty_snapc(struct rbd_obj_request *obj_req,
@@ -3084,30 +2964,18 @@ static int rbd_obj_copyup_current_snapc(struct rbd_obj_request *obj_req,
 	return 0;
 }
 
-static int setup_copyup_bvecs(struct rbd_obj_request *obj_req, u64 obj_overlap)
+static int setup_copyup_buf(struct rbd_obj_request *obj_req, u64 obj_overlap)
 {
-	u32 i;
-
-	rbd_assert(!obj_req->copyup_bvecs);
-	obj_req->copyup_bvec_count = calc_pages_for(0, obj_overlap);
-	obj_req->copyup_bvecs = kcalloc(obj_req->copyup_bvec_count,
-					sizeof(*obj_req->copyup_bvecs),
-					GFP_NOIO);
-	if (!obj_req->copyup_bvecs)
-		return -ENOMEM;
-
-	for (i = 0; i < obj_req->copyup_bvec_count; i++) {
-		unsigned int len = min(obj_overlap, (u64)PAGE_SIZE);
-		struct page *page = alloc_page(GFP_NOIO);
+	struct ceph_databuf *dbuf;
 
-		if (!page)
-			return -ENOMEM;
+	rbd_assert(!obj_req->copyup_buf);
 
-		bvec_set_page(&obj_req->copyup_bvecs[i], page, len, 0);
-		obj_overlap -= len;
-	}
+	dbuf = ceph_databuf_req_alloc(calc_pages_for(0, obj_overlap),
+				      obj_overlap, GFP_NOIO);
+	if (!dbuf)
+		return -ENOMEM;
 
-	rbd_assert(!obj_overlap);
+	obj_req->copyup_buf = dbuf;
 	return 0;
 }
 
@@ -3134,11 +3002,11 @@ static int rbd_obj_copyup_read_parent(struct rbd_obj_request *obj_req)
 		return rbd_obj_copyup_current_snapc(obj_req, MODS_ONLY);
 	}
 
-	ret = setup_copyup_bvecs(obj_req, rbd_obj_img_extents_bytes(obj_req));
+	ret = setup_copyup_buf(obj_req, rbd_obj_img_extents_bytes(obj_req));
 	if (ret)
 		return ret;
 
-	return rbd_obj_read_from_parent(obj_req);
+	return rbd_obj_read_from_parent(obj_req, obj_req->copyup_buf);
 }
 
 static void rbd_obj_copyup_object_maps(struct rbd_obj_request *obj_req)
@@ -3241,8 +3109,8 @@ static bool rbd_obj_advance_copyup(struct rbd_obj_request *obj_req, int *result)
 		if (*result)
 			return true;
 
-		if (is_zero_bvecs(obj_req->copyup_bvecs,
-				  rbd_obj_img_extents_bytes(obj_req))) {
+		if (ceph_databuf_is_all_zero(obj_req->copyup_buf,
+					     rbd_obj_img_extents_bytes(obj_req))) {
 			dout("%s %p detected zeros\n", __func__, obj_req);
 			obj_req->flags |= RBD_OBJ_FLAG_COPYUP_ZEROS;
 		}
diff --git a/include/linux/ceph/databuf.h b/include/linux/ceph/databuf.h
index 14c7a6449467..54b76d0c91a0 100644
--- a/include/linux/ceph/databuf.h
+++ b/include/linux/ceph/databuf.h
@@ -5,6 +5,7 @@
 #include <asm/byteorder.h>
 #include <linux/refcount.h>
 #include <linux/blk_types.h>
+#include <linux/iov_iter.h>
 
 struct ceph_databuf {
 	struct bio_vec	*bvec;		/* List of pages */
@@ -128,4 +129,25 @@ static inline void ceph_databuf_enc_stop(struct ceph_databuf *dbuf, void *p)
 	BUG_ON(dbuf->iter.count > dbuf->limit);
 }
 
+static __always_inline
+size_t ceph_databuf_scan_for_nonzero(void *iter_from, size_t progress,
+				     size_t len, void *priv, void *priv2)
+{
+	void *p;
+
+	p = memchr_inv(iter_from, 0, len);
+	return p ? p - iter_from : 0;
+}
+
+/*
+ * Scan a buffer to see if it contains only zeros.
+ */
+static inline bool ceph_databuf_is_all_zero(struct ceph_databuf *dbuf, size_t count)
+{
+	struct iov_iter iter_copy = dbuf->iter;
+
+	return iterate_bvec(&iter_copy, count, NULL, NULL,
+			    ceph_databuf_scan_for_nonzero) == count;
+}
+
 #endif /* __FS_CEPH_DATABUF_H */
diff --git a/include/linux/ceph/striper.h b/include/linux/ceph/striper.h
index 3486636c0e6e..50bc1b88c5c4 100644
--- a/include/linux/ceph/striper.h
+++ b/include/linux/ceph/striper.h
@@ -4,6 +4,7 @@
 
 #include <linux/list.h>
 #include <linux/types.h>
+#include <linux/bug.h>
 
 struct ceph_file_layout;
 
@@ -39,10 +40,6 @@ int ceph_file_to_extents(struct ceph_file_layout *l, u64 off, u64 len,
 			 void *alloc_arg,
 			 ceph_object_extent_fn_t action_fn,
 			 void *action_arg);
-int ceph_iterate_extents(struct ceph_file_layout *l, u64 off, u64 len,
-			 struct list_head *object_extents,
-			 ceph_object_extent_fn_t action_fn,
-			 void *action_arg);
 
 struct ceph_file_extent {
 	u64 fe_off;
@@ -68,4 +65,57 @@ int ceph_extent_to_file(struct ceph_file_layout *l,
 
 u64 ceph_get_num_objects(struct ceph_file_layout *l, u64 size);
 
+static __always_inline
+struct ceph_object_extent *ceph_lookup_containing(struct list_head *object_extents,
+						  u64 objno, u64 objoff, u32 xlen)
+{
+	struct ceph_object_extent *ex;
+
+	list_for_each_entry(ex, object_extents, oe_item) {
+		if (ex->oe_objno == objno &&
+		    ex->oe_off <= objoff &&
+		    ex->oe_off + ex->oe_len >= objoff + xlen) /* paranoia */
+			return ex;
+
+		if (ex->oe_objno > objno)
+			break;
+	}
+
+	return NULL;
+}
+
+/*
+ * A stripped down, non-allocating version of ceph_file_to_extents(),
+ * for when @object_extents is already populated.
+ */
+static __always_inline
+int ceph_iterate_extents(struct ceph_file_layout *l, u64 off, u64 len,
+			 struct list_head *object_extents,
+			 ceph_object_extent_fn_t action_fn,
+			 void *action_arg)
+{
+	while (len) {
+		struct ceph_object_extent *ex;
+		u64 objno, objoff;
+		u32 xlen;
+
+		ceph_calc_file_object_mapping(l, off, len, &objno, &objoff,
+					      &xlen);
+
+		ex = ceph_lookup_containing(object_extents, objno, objoff, xlen);
+		if (!ex) {
+			WARN(1, "%s: objno %llu %llu~%u not found!\n",
+			     __func__, objno, objoff, xlen);
+			return -EINVAL;
+		}
+
+		action_fn(ex, xlen, action_arg);
+
+		off += xlen;
+		len -= xlen;
+	}
+
+	return 0;
+}
+
 #endif
diff --git a/net/ceph/striper.c b/net/ceph/striper.c
index 3b3fa75d1189..3dedbf018fa6 100644
--- a/net/ceph/striper.c
+++ b/net/ceph/striper.c
@@ -70,25 +70,6 @@ lookup_last(struct list_head *object_extents, u64 objno,
 	return NULL;
 }
 
-static struct ceph_object_extent *
-lookup_containing(struct list_head *object_extents, u64 objno,
-		  u64 objoff, u32 xlen)
-{
-	struct ceph_object_extent *ex;
-
-	list_for_each_entry(ex, object_extents, oe_item) {
-		if (ex->oe_objno == objno &&
-		    ex->oe_off <= objoff &&
-		    ex->oe_off + ex->oe_len >= objoff + xlen) /* paranoia */
-			return ex;
-
-		if (ex->oe_objno > objno)
-			break;
-	}
-
-	return NULL;
-}
-
 /*
  * Map a file extent to a sorted list of object extents.
  *
@@ -167,40 +148,6 @@ int ceph_file_to_extents(struct ceph_file_layout *l, u64 off, u64 len,
 }
 EXPORT_SYMBOL(ceph_file_to_extents);
 
-/*
- * A stripped down, non-allocating version of ceph_file_to_extents(),
- * for when @object_extents is already populated.
- */
-int ceph_iterate_extents(struct ceph_file_layout *l, u64 off, u64 len,
-			 struct list_head *object_extents,
-			 ceph_object_extent_fn_t action_fn,
-			 void *action_arg)
-{
-	while (len) {
-		struct ceph_object_extent *ex;
-		u64 objno, objoff;
-		u32 xlen;
-
-		ceph_calc_file_object_mapping(l, off, len, &objno, &objoff,
-					      &xlen);
-
-		ex = lookup_containing(object_extents, objno, objoff, xlen);
-		if (!ex) {
-			WARN(1, "%s: objno %llu %llu~%u not found!\n",
-			     __func__, objno, objoff, xlen);
-			return -EINVAL;
-		}
-
-		action_fn(ex, xlen, action_arg);
-
-		off += xlen;
-		len -= xlen;
-	}
-
-	return 0;
-}
-EXPORT_SYMBOL(ceph_iterate_extents);
-
 /*
  * Reverse map an object extent to a sorted list of file extents.
  *


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 14/35] libceph: Remove bvec and bio data container types
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (12 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 13/35] rbd: Switch from using bvec_iter to iov_iter David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 15/35] libceph: Make osd_req_op_cls_init() use a ceph_databuf and map it David Howells
                   ` (20 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

The CEPH_MSG_DATA_BIO and CEPH_MSG_DATA_BVEC data types are now not used,
so remove them.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/ceph/messenger.h  | 103 --------------------
 include/linux/ceph/osd_client.h |  31 ------
 net/ceph/messenger.c            | 166 --------------------------------
 net/ceph/osd_client.c           |  94 ------------------
 4 files changed, 394 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 1b646d0dff39..ff0aea6d2d31 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -120,108 +120,15 @@ enum ceph_msg_data_type {
 	CEPH_MSG_DATA_DATABUF,	/* data source/destination is a data buffer */
 	CEPH_MSG_DATA_PAGES,	/* data source/destination is a page array */
 	CEPH_MSG_DATA_PAGELIST,	/* data source/destination is a pagelist */
-#ifdef CONFIG_BLOCK
-	CEPH_MSG_DATA_BIO,	/* data source/destination is a bio list */
-#endif /* CONFIG_BLOCK */
-	CEPH_MSG_DATA_BVECS,	/* data source/destination is a bio_vec array */
 	CEPH_MSG_DATA_ITER,	/* data source/destination is an iov_iter */
 };
 
-#ifdef CONFIG_BLOCK
-
-struct ceph_bio_iter {
-	struct bio *bio;
-	struct bvec_iter iter;
-};
-
-#define __ceph_bio_iter_advance_step(it, n, STEP) do {			      \
-	unsigned int __n = (n), __cur_n;				      \
-									      \
-	while (__n) {							      \
-		BUG_ON(!(it)->iter.bi_size);				      \
-		__cur_n = min((it)->iter.bi_size, __n);			      \
-		(void)(STEP);						      \
-		bio_advance_iter((it)->bio, &(it)->iter, __cur_n);	      \
-		if (!(it)->iter.bi_size && (it)->bio->bi_next) {	      \
-			dout("__ceph_bio_iter_advance_step next bio\n");      \
-			(it)->bio = (it)->bio->bi_next;			      \
-			(it)->iter = (it)->bio->bi_iter;		      \
-		}							      \
-		__n -= __cur_n;						      \
-	}								      \
-} while (0)
-
-/*
- * Advance @it by @n bytes.
- */
-#define ceph_bio_iter_advance(it, n)					      \
-	__ceph_bio_iter_advance_step(it, n, 0)
-
-/*
- * Advance @it by @n bytes, executing BVEC_STEP for each bio_vec.
- */
-#define ceph_bio_iter_advance_step(it, n, BVEC_STEP)			      \
-	__ceph_bio_iter_advance_step(it, n, ({				      \
-		struct bio_vec bv;					      \
-		struct bvec_iter __cur_iter;				      \
-									      \
-		__cur_iter = (it)->iter;				      \
-		__cur_iter.bi_size = __cur_n;				      \
-		__bio_for_each_segment(bv, (it)->bio, __cur_iter, __cur_iter) \
-			(void)(BVEC_STEP);				      \
-	}))
-
-#endif /* CONFIG_BLOCK */
-
-struct ceph_bvec_iter {
-	struct bio_vec *bvecs;
-	struct bvec_iter iter;
-};
-
-#define __ceph_bvec_iter_advance_step(it, n, STEP) do {			      \
-	BUG_ON((n) > (it)->iter.bi_size);				      \
-	(void)(STEP);							      \
-	bvec_iter_advance((it)->bvecs, &(it)->iter, (n));		      \
-} while (0)
-
-/*
- * Advance @it by @n bytes.
- */
-#define ceph_bvec_iter_advance(it, n)					      \
-	__ceph_bvec_iter_advance_step(it, n, 0)
-
-/*
- * Advance @it by @n bytes, executing BVEC_STEP for each bio_vec.
- */
-#define ceph_bvec_iter_advance_step(it, n, BVEC_STEP)			      \
-	__ceph_bvec_iter_advance_step(it, n, ({				      \
-		struct bio_vec bv;					      \
-		struct bvec_iter __cur_iter;				      \
-									      \
-		__cur_iter = (it)->iter;				      \
-		__cur_iter.bi_size = (n);				      \
-		for_each_bvec(bv, (it)->bvecs, __cur_iter, __cur_iter)	      \
-			(void)(BVEC_STEP);				      \
-	}))
-
-#define ceph_bvec_iter_shorten(it, n) do {				      \
-	BUG_ON((n) > (it)->iter.bi_size);				      \
-	(it)->iter.bi_size = (n);					      \
-} while (0)
-
 struct ceph_msg_data {
 	enum ceph_msg_data_type		type;
 	struct iov_iter			iter;
 	bool				release_dbuf;
 	union {
 		struct ceph_databuf	*dbuf;
-#ifdef CONFIG_BLOCK
-		struct {
-			struct ceph_bio_iter	bio_pos;
-			u32			bio_length;
-		};
-#endif /* CONFIG_BLOCK */
-		struct ceph_bvec_iter	bvec_pos;
 		struct {
 			struct page	**pages;
 			size_t		length;		/* total # bytes */
@@ -240,10 +147,6 @@ struct ceph_msg_data_cursor {
 	int			sr_resid;	/* residual sparse_read len */
 	bool			need_crc;	/* crc update needed */
 	union {
-#ifdef CONFIG_BLOCK
-		struct ceph_bio_iter	bio_iter;
-#endif /* CONFIG_BLOCK */
-		struct bvec_iter	bvec_iter;
 		struct {				/* pages */
 			unsigned int	page_offset;	/* offset in page */
 			unsigned short	page_index;	/* index in array */
@@ -610,12 +513,6 @@ void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
 			     size_t length, size_t offset, bool own_pages);
 extern void ceph_msg_data_add_pagelist(struct ceph_msg *msg,
 				struct ceph_pagelist *pagelist);
-#ifdef CONFIG_BLOCK
-void ceph_msg_data_add_bio(struct ceph_msg *msg, struct ceph_bio_iter *bio_pos,
-			   u32 length);
-#endif /* CONFIG_BLOCK */
-void ceph_msg_data_add_bvecs(struct ceph_msg *msg,
-			     struct ceph_bvec_iter *bvec_pos);
 void ceph_msg_data_add_iter(struct ceph_msg *msg,
 			    struct iov_iter *iter);
 
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 5ac4c0b4dfcd..9182aa5075b2 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -107,10 +107,6 @@ enum ceph_osd_data_type {
 	CEPH_OSD_DATA_TYPE_DATABUF,
 	CEPH_OSD_DATA_TYPE_PAGES,
 	CEPH_OSD_DATA_TYPE_PAGELIST,
-#ifdef CONFIG_BLOCK
-	CEPH_OSD_DATA_TYPE_BIO,
-#endif /* CONFIG_BLOCK */
-	CEPH_OSD_DATA_TYPE_BVECS,
 	CEPH_OSD_DATA_TYPE_ITER,
 };
 
@@ -127,16 +123,6 @@ struct ceph_osd_data {
 			bool		own_pages;
 		};
 		struct ceph_pagelist	*pagelist;
-#ifdef CONFIG_BLOCK
-		struct {
-			struct ceph_bio_iter	bio_pos;
-			u32			bio_length;
-		};
-#endif /* CONFIG_BLOCK */
-		struct {
-			struct ceph_bvec_iter	bvec_pos;
-			u32			num_bvecs;
-		};
 	};
 };
 
@@ -499,19 +485,6 @@ extern void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *,
 extern void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *,
 					unsigned int which,
 					struct ceph_pagelist *pagelist);
-#ifdef CONFIG_BLOCK
-void osd_req_op_extent_osd_data_bio(struct ceph_osd_request *osd_req,
-				    unsigned int which,
-				    struct ceph_bio_iter *bio_pos,
-				    u32 bio_length);
-#endif /* CONFIG_BLOCK */
-void osd_req_op_extent_osd_data_bvecs(struct ceph_osd_request *osd_req,
-				      unsigned int which,
-				      struct bio_vec *bvecs, u32 num_bvecs,
-				      u32 bytes);
-void osd_req_op_extent_osd_data_bvec_pos(struct ceph_osd_request *osd_req,
-					 unsigned int which,
-					 struct ceph_bvec_iter *bvec_pos);
 void osd_req_op_extent_osd_iter(struct ceph_osd_request *osd_req,
 				unsigned int which, struct iov_iter *iter);
 
@@ -521,10 +494,6 @@ void osd_req_op_cls_request_databuf(struct ceph_osd_request *req,
 extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *,
 					unsigned int which,
 					struct ceph_pagelist *pagelist);
-void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req,
-				       unsigned int which,
-				       struct bio_vec *bvecs, u32 num_bvecs,
-				       u32 bytes);
 void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req,
 				     unsigned int which,
 				     struct ceph_databuf *dbuf);
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index dc8082575e4f..cb66a768bd7c 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -12,9 +12,6 @@
 #include <linux/slab.h>
 #include <linux/socket.h>
 #include <linux/string.h>
-#ifdef	CONFIG_BLOCK
-#include <linux/bio.h>
-#endif	/* CONFIG_BLOCK */
 #include <linux/dns_resolver.h>
 #include <net/tcp.h>
 #include <trace/events/sock.h>
@@ -714,116 +711,6 @@ void ceph_con_discard_requeued(struct ceph_connection *con, u64 reconnect_seq)
 	}
 }
 
-#ifdef CONFIG_BLOCK
-
-/*
- * For a bio data item, a piece is whatever remains of the next
- * entry in the current bio iovec, or the first entry in the next
- * bio in the list.
- */
-static void ceph_msg_data_bio_cursor_init(struct ceph_msg_data_cursor *cursor,
-					size_t length)
-{
-	struct ceph_msg_data *data = cursor->data;
-	struct ceph_bio_iter *it = &cursor->bio_iter;
-
-	cursor->resid = min_t(size_t, length, data->bio_length);
-	*it = data->bio_pos;
-	if (cursor->resid < it->iter.bi_size)
-		it->iter.bi_size = cursor->resid;
-
-	BUG_ON(cursor->resid < bio_iter_len(it->bio, it->iter));
-}
-
-static struct page *ceph_msg_data_bio_next(struct ceph_msg_data_cursor *cursor,
-						size_t *page_offset,
-						size_t *length)
-{
-	struct bio_vec bv = bio_iter_iovec(cursor->bio_iter.bio,
-					   cursor->bio_iter.iter);
-
-	*page_offset = bv.bv_offset;
-	*length = bv.bv_len;
-	return bv.bv_page;
-}
-
-static bool ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
-					size_t bytes)
-{
-	struct ceph_bio_iter *it = &cursor->bio_iter;
-	struct page *page = bio_iter_page(it->bio, it->iter);
-
-	BUG_ON(bytes > cursor->resid);
-	BUG_ON(bytes > bio_iter_len(it->bio, it->iter));
-	cursor->resid -= bytes;
-	bio_advance_iter(it->bio, &it->iter, bytes);
-
-	if (!cursor->resid)
-		return false;   /* no more data */
-
-	if (!bytes || (it->iter.bi_size && it->iter.bi_bvec_done &&
-		       page == bio_iter_page(it->bio, it->iter)))
-		return false;	/* more bytes to process in this segment */
-
-	if (!it->iter.bi_size) {
-		it->bio = it->bio->bi_next;
-		it->iter = it->bio->bi_iter;
-		if (cursor->resid < it->iter.bi_size)
-			it->iter.bi_size = cursor->resid;
-	}
-
-	BUG_ON(cursor->resid < bio_iter_len(it->bio, it->iter));
-	return true;
-}
-#endif /* CONFIG_BLOCK */
-
-static void ceph_msg_data_bvecs_cursor_init(struct ceph_msg_data_cursor *cursor,
-					size_t length)
-{
-	struct ceph_msg_data *data = cursor->data;
-	struct bio_vec *bvecs = data->bvec_pos.bvecs;
-
-	cursor->resid = min_t(size_t, length, data->bvec_pos.iter.bi_size);
-	cursor->bvec_iter = data->bvec_pos.iter;
-	cursor->bvec_iter.bi_size = cursor->resid;
-
-	BUG_ON(cursor->resid < bvec_iter_len(bvecs, cursor->bvec_iter));
-}
-
-static struct page *ceph_msg_data_bvecs_next(struct ceph_msg_data_cursor *cursor,
-						size_t *page_offset,
-						size_t *length)
-{
-	struct bio_vec bv = bvec_iter_bvec(cursor->data->bvec_pos.bvecs,
-					   cursor->bvec_iter);
-
-	*page_offset = bv.bv_offset;
-	*length = bv.bv_len;
-	return bv.bv_page;
-}
-
-static bool ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor,
-					size_t bytes)
-{
-	struct bio_vec *bvecs = cursor->data->bvec_pos.bvecs;
-	struct page *page = bvec_iter_page(bvecs, cursor->bvec_iter);
-
-	BUG_ON(bytes > cursor->resid);
-	BUG_ON(bytes > bvec_iter_len(bvecs, cursor->bvec_iter));
-	cursor->resid -= bytes;
-	bvec_iter_advance(bvecs, &cursor->bvec_iter, bytes);
-
-	if (!cursor->resid)
-		return false;   /* no more data */
-
-	if (!bytes || (cursor->bvec_iter.bi_bvec_done &&
-		       page == bvec_iter_page(bvecs, cursor->bvec_iter)))
-		return false;	/* more bytes to process in this segment */
-
-	BUG_ON(cursor->resid < bvec_iter_len(bvecs, cursor->bvec_iter));
-	return true;
-}
-
 /*
  * For a page array, a piece comes from the first page in the array
  * that has not already been fully consumed.
@@ -1045,14 +932,6 @@ static void __ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor)
 	case CEPH_MSG_DATA_PAGES:
 		ceph_msg_data_pages_cursor_init(cursor, length);
 		break;
-#ifdef CONFIG_BLOCK
-	case CEPH_MSG_DATA_BIO:
-		ceph_msg_data_bio_cursor_init(cursor, length);
-		break;
-#endif /* CONFIG_BLOCK */
-	case CEPH_MSG_DATA_BVECS:
-		ceph_msg_data_bvecs_cursor_init(cursor, length);
-		break;
 	case CEPH_MSG_DATA_DATABUF:
 	case CEPH_MSG_DATA_ITER:
 		ceph_msg_data_iter_cursor_init(cursor, length);
@@ -1096,14 +975,6 @@ struct page *ceph_msg_data_next(struct ceph_msg_data_cursor *cursor,
 	case CEPH_MSG_DATA_PAGES:
 		page = ceph_msg_data_pages_next(cursor, page_offset, length);
 		break;
-#ifdef CONFIG_BLOCK
-	case CEPH_MSG_DATA_BIO:
-		page = ceph_msg_data_bio_next(cursor, page_offset, length);
-		break;
-#endif /* CONFIG_BLOCK */
-	case CEPH_MSG_DATA_BVECS:
-		page = ceph_msg_data_bvecs_next(cursor, page_offset, length);
-		break;
 	case CEPH_MSG_DATA_DATABUF:
 	case CEPH_MSG_DATA_ITER:
 		page = ceph_msg_data_iter_next(cursor, page_offset, length);
@@ -1138,14 +1009,6 @@ void ceph_msg_data_advance(struct ceph_msg_data_cursor *cursor, size_t bytes)
 	case CEPH_MSG_DATA_PAGES:
 		new_piece = ceph_msg_data_pages_advance(cursor, bytes);
 		break;
-#ifdef CONFIG_BLOCK
-	case CEPH_MSG_DATA_BIO:
-		new_piece = ceph_msg_data_bio_advance(cursor, bytes);
-		break;
-#endif /* CONFIG_BLOCK */
-	case CEPH_MSG_DATA_BVECS:
-		new_piece = ceph_msg_data_bvecs_advance(cursor, bytes);
-		break;
 	case CEPH_MSG_DATA_DATABUF:
 	case CEPH_MSG_DATA_ITER:
 		new_piece = ceph_msg_data_iter_advance(cursor, bytes);
@@ -1938,35 +1801,6 @@ void ceph_msg_data_add_pagelist(struct ceph_msg *msg,
 }
 EXPORT_SYMBOL(ceph_msg_data_add_pagelist);
 
-#ifdef	CONFIG_BLOCK
-void ceph_msg_data_add_bio(struct ceph_msg *msg, struct ceph_bio_iter *bio_pos,
-			   u32 length)
-{
-	struct ceph_msg_data *data;
-
-	data = ceph_msg_data_add(msg);
-	data->type = CEPH_MSG_DATA_BIO;
-	data->bio_pos = *bio_pos;
-	data->bio_length = length;
-
-	msg->data_length += length;
-}
-EXPORT_SYMBOL(ceph_msg_data_add_bio);
-#endif	/* CONFIG_BLOCK */
-
-void ceph_msg_data_add_bvecs(struct ceph_msg *msg,
-			     struct ceph_bvec_iter *bvec_pos)
-{
-	struct ceph_msg_data *data;
-
-	data = ceph_msg_data_add(msg);
-	data->type = CEPH_MSG_DATA_BVECS;
-	data->bvec_pos = *bvec_pos;
-
-	msg->data_length += bvec_pos->iter.bi_size;
-}
-EXPORT_SYMBOL(ceph_msg_data_add_bvecs);
-
 void ceph_msg_data_add_iter(struct ceph_msg *msg,
 			    struct iov_iter *iter)
 {
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index fc5c136e793d..10f65e9b1906 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -9,9 +9,6 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
-#ifdef CONFIG_BLOCK
-#include <linux/bio.h>
-#endif
 
 #include <linux/ceph/ceph_features.h>
 #include <linux/ceph/libceph.h>
@@ -151,26 +148,6 @@ static void ceph_osd_data_pagelist_init(struct ceph_osd_data *osd_data,
 	osd_data->pagelist = pagelist;
 }
 
-#ifdef CONFIG_BLOCK
-static void ceph_osd_data_bio_init(struct ceph_osd_data *osd_data,
-				   struct ceph_bio_iter *bio_pos,
-				   u32 bio_length)
-{
-	osd_data->type = CEPH_OSD_DATA_TYPE_BIO;
-	osd_data->bio_pos = *bio_pos;
-	osd_data->bio_length = bio_length;
-}
-#endif /* CONFIG_BLOCK */
-
-static void ceph_osd_data_bvecs_init(struct ceph_osd_data *osd_data,
-				     struct ceph_bvec_iter *bvec_pos,
-				     u32 num_bvecs)
-{
-	osd_data->type = CEPH_OSD_DATA_TYPE_BVECS;
-	osd_data->bvec_pos = *bvec_pos;
-	osd_data->num_bvecs = num_bvecs;
-}
-
 static void ceph_osd_iter_init(struct ceph_osd_data *osd_data,
 			       struct iov_iter *iter)
 {
@@ -252,47 +229,6 @@ void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *osd_req,
 }
 EXPORT_SYMBOL(osd_req_op_extent_osd_data_pagelist);
 
-#ifdef CONFIG_BLOCK
-void osd_req_op_extent_osd_data_bio(struct ceph_osd_request *osd_req,
-				    unsigned int which,
-				    struct ceph_bio_iter *bio_pos,
-				    u32 bio_length)
-{
-	struct ceph_osd_data *osd_data;
-
-	osd_data = osd_req_op_data(osd_req, which, extent, osd_data);
-	ceph_osd_data_bio_init(osd_data, bio_pos, bio_length);
-}
-EXPORT_SYMBOL(osd_req_op_extent_osd_data_bio);
-#endif /* CONFIG_BLOCK */
-
-void osd_req_op_extent_osd_data_bvecs(struct ceph_osd_request *osd_req,
-				      unsigned int which,
-				      struct bio_vec *bvecs, u32 num_bvecs,
-				      u32 bytes)
-{
-	struct ceph_osd_data *osd_data;
-	struct ceph_bvec_iter it = {
-		.bvecs = bvecs,
-		.iter = { .bi_size = bytes },
-	};
-
-	osd_data = osd_req_op_data(osd_req, which, extent, osd_data);
-	ceph_osd_data_bvecs_init(osd_data, &it, num_bvecs);
-}
-EXPORT_SYMBOL(osd_req_op_extent_osd_data_bvecs);
-
-void osd_req_op_extent_osd_data_bvec_pos(struct ceph_osd_request *osd_req,
-					 unsigned int which,
-					 struct ceph_bvec_iter *bvec_pos)
-{
-	struct ceph_osd_data *osd_data;
-
-	osd_data = osd_req_op_data(osd_req, which, extent, osd_data);
-	ceph_osd_data_bvecs_init(osd_data, bvec_pos, 0);
-}
-EXPORT_SYMBOL(osd_req_op_extent_osd_data_bvec_pos);
-
 /**
  * osd_req_op_extent_osd_iter - Set up an operation with an iterator buffer
  * @osd_req: The request to set up
@@ -360,24 +296,6 @@ static void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req,
 	osd_req->r_ops[which].indata_len += length;
 }
 
-void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req,
-				       unsigned int which,
-				       struct bio_vec *bvecs, u32 num_bvecs,
-				       u32 bytes)
-{
-	struct ceph_osd_data *osd_data;
-	struct ceph_bvec_iter it = {
-		.bvecs = bvecs,
-		.iter = { .bi_size = bytes },
-	};
-
-	osd_data = osd_req_op_data(osd_req, which, cls, request_data);
-	ceph_osd_data_bvecs_init(osd_data, &it, num_bvecs);
-	osd_req->r_ops[which].cls.indata_len += bytes;
-	osd_req->r_ops[which].indata_len += bytes;
-}
-EXPORT_SYMBOL(osd_req_op_cls_request_data_bvecs);
-
 void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req,
 				     unsigned int which,
 				     struct ceph_databuf *dbuf)
@@ -402,12 +320,6 @@ static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data)
 		return osd_data->length;
 	case CEPH_OSD_DATA_TYPE_PAGELIST:
 		return (u64)osd_data->pagelist->length;
-#ifdef CONFIG_BLOCK
-	case CEPH_OSD_DATA_TYPE_BIO:
-		return (u64)osd_data->bio_length;
-#endif /* CONFIG_BLOCK */
-	case CEPH_OSD_DATA_TYPE_BVECS:
-		return osd_data->bvec_pos.iter.bi_size;
 	case CEPH_OSD_DATA_TYPE_ITER:
 		return iov_iter_count(&osd_data->iter);
 	default:
@@ -1017,12 +929,6 @@ static void ceph_osdc_msg_data_add(struct ceph_msg *msg,
 	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) {
 		BUG_ON(!length);
 		ceph_msg_data_add_pagelist(msg, osd_data->pagelist);
-#ifdef CONFIG_BLOCK
-	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_BIO) {
-		ceph_msg_data_add_bio(msg, &osd_data->bio_pos, length);
-#endif
-	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_BVECS) {
-		ceph_msg_data_add_bvecs(msg, &osd_data->bvec_pos);
 	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_ITER) {
 		ceph_msg_data_add_iter(msg, &osd_data->iter);
 	} else {


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 15/35] libceph: Make osd_req_op_cls_init() use a ceph_databuf and map it
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (13 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 14/35] libceph: Remove bvec and bio data container types David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 16/35] libceph: Convert req_page of ceph_osdc_call() to ceph_databuf David Howells
                   ` (19 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Make osd_req_op_cls_init() use a ceph_databuf to hold the request_info data
and then map it and write directly into it rather than using the databuf
encode functions.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 net/ceph/osd_client.c | 55 +++++++++++++++++--------------------------
 1 file changed, 22 insertions(+), 33 deletions(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 10f65e9b1906..405ccf7e7a91 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -245,14 +245,14 @@ void osd_req_op_extent_osd_iter(struct ceph_osd_request *osd_req,
 }
 EXPORT_SYMBOL(osd_req_op_extent_osd_iter);
 
-static void osd_req_op_cls_request_info_pagelist(
-			struct ceph_osd_request *osd_req,
-			unsigned int which, struct ceph_pagelist *pagelist)
+static void osd_req_op_cls_request_info_databuf(struct ceph_osd_request *osd_req,
+						unsigned int which,
+						struct ceph_databuf *dbuf)
 {
 	struct ceph_osd_data *osd_data;
 
 	osd_data = osd_req_op_data(osd_req, which, cls, request_info);
-	ceph_osd_data_pagelist_init(osd_data, pagelist);
+	ceph_osd_databuf_init(osd_data, dbuf);
 }
 
 void osd_req_op_cls_request_databuf(struct ceph_osd_request *osd_req,
@@ -778,42 +778,31 @@ int osd_req_op_cls_init(struct ceph_osd_request *osd_req, unsigned int which,
 			const char *class, const char *method)
 {
 	struct ceph_osd_req_op *op;
-	struct ceph_pagelist *pagelist;
-	size_t payload_len = 0;
-	size_t size;
-	int ret;
+	struct ceph_databuf *dbuf;
+	size_t csize = strlen(class), msize = strlen(method);
+	void *p;
+
+	BUG_ON(csize > (size_t) U8_MAX);
+	BUG_ON(msize > (size_t) U8_MAX);
 
 	op = osd_req_op_init(osd_req, which, CEPH_OSD_OP_CALL, 0);
+	op->cls.class_name = class;
+	op->cls.class_len  = csize;
+	op->cls.method_name = method;
+	op->cls.method_len  = msize;
 
-	pagelist = ceph_pagelist_alloc(GFP_NOFS);
-	if (!pagelist)
+	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOFS);
+	if (!dbuf)
 		return -ENOMEM;
 
-	op->cls.class_name = class;
-	size = strlen(class);
-	BUG_ON(size > (size_t) U8_MAX);
-	op->cls.class_len = size;
-	ret = ceph_pagelist_append(pagelist, class, size);
-	if (ret)
-		goto err_pagelist_free;
-	payload_len += size;
-
-	op->cls.method_name = method;
-	size = strlen(method);
-	BUG_ON(size > (size_t) U8_MAX);
-	op->cls.method_len = size;
-	ret = ceph_pagelist_append(pagelist, method, size);
-	if (ret)
-		goto err_pagelist_free;
-	payload_len += size;
+	p = ceph_databuf_enc_start(dbuf);
+	ceph_encode_copy(&p, class, csize);
+	ceph_encode_copy(&p, method, msize);
+	ceph_databuf_enc_stop(dbuf, p);
 
-	osd_req_op_cls_request_info_pagelist(osd_req, which, pagelist);
-	op->indata_len = payload_len;
+	osd_req_op_cls_request_info_databuf(osd_req, which, dbuf);
+	op->indata_len = ceph_databuf_len(dbuf);
 	return 0;
-
-err_pagelist_free:
-	ceph_pagelist_release(pagelist);
-	return ret;
 }
 EXPORT_SYMBOL(osd_req_op_cls_init);
 


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 16/35] libceph: Convert req_page of ceph_osdc_call() to ceph_databuf
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (14 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 15/35] libceph: Make osd_req_op_cls_init() use a ceph_databuf and map it David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 17/35] libceph, rbd: Use ceph_databuf encoding start/stop David Howells
                   ` (18 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Convert the request data (req_page) of ceph_osdc_call() to ceph_databuf.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 drivers/block/rbd.c             | 53 +++++++++++-----------
 include/linux/ceph/osd_client.h |  2 +-
 net/ceph/cls_lock_client.c      | 78 ++++++++++++++++++---------------
 net/ceph/osd_client.c           | 25 ++---------
 4 files changed, 74 insertions(+), 84 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index dd22cea7ae89..ec09d578b0b0 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1789,7 +1789,7 @@ static int __rbd_object_map_load(struct rbd_device *rbd_dev)
 	rbd_object_map_name(rbd_dev, rbd_dev->spec->snap_id, &oid);
 	ret = ceph_osdc_call(osdc, &oid, &rbd_dev->header_oloc,
 			     "rbd", "object_map_load", CEPH_OSD_FLAG_READ,
-			     NULL, 0, reply);
+			     NULL, reply);
 	if (ret)
 		goto out;
 
@@ -4553,8 +4553,8 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
 			     size_t inbound_size)
 {
 	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
-	struct ceph_databuf *reply;
-	struct page *req_page = NULL;
+	struct ceph_databuf *request = NULL, *reply;
+	void *p;
 	int ret;
 
 	/*
@@ -4568,32 +4568,33 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
 		if (outbound_size > PAGE_SIZE)
 			return -E2BIG;
 
-		req_page = alloc_page(GFP_KERNEL);
-		if (!req_page)
+		request = ceph_databuf_req_alloc(0, outbound_size, GFP_KERNEL);
+		if (!request)
 			return -ENOMEM;
 
-		memcpy(page_address(req_page), outbound, outbound_size);
+		p = kmap_ceph_databuf_page(request, 0);
+		memcpy(p, outbound, outbound_size);
+		kunmap_local(p);
+		ceph_databuf_added_data(request, outbound_size);
 	}
 
 	reply = ceph_databuf_reply_alloc(1, inbound_size, GFP_KERNEL);
 	if (!reply) {
-		if (req_page)
-			__free_page(req_page);
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	ret = ceph_osdc_call(osdc, oid, oloc, RBD_DRV_NAME, method_name,
-			     CEPH_OSD_FLAG_READ, req_page, outbound_size,
-			     reply);
+			     CEPH_OSD_FLAG_READ, request, reply);
 	if (!ret) {
 		ret = ceph_databuf_len(reply);
 		if (copy_from_iter(inbound, ret, &reply->iter) != ret)
 			ret = -EIO;
 	}
 
-	if (req_page)
-		__free_page(req_page);
 	ceph_databuf_release(reply);
+out:
+	ceph_databuf_release(request);
 	return ret;
 }
 
@@ -5513,7 +5514,7 @@ static int decode_parent_image_spec(void **p, void *end,
 }
 
 static int __get_parent_info(struct rbd_device *rbd_dev,
-			     struct page *req_page,
+			     struct ceph_databuf *request,
 			     struct ceph_databuf *reply,
 			     struct parent_image_info *pii)
 {
@@ -5524,7 +5525,7 @@ static int __get_parent_info(struct rbd_device *rbd_dev,
 
 	ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc,
 			     "rbd", "parent_get", CEPH_OSD_FLAG_READ,
-			     req_page, sizeof(u64), reply);
+			     request, reply);
 	if (ret)
 		return ret == -EOPNOTSUPP ? 1 : ret;
 
@@ -5539,7 +5540,7 @@ static int __get_parent_info(struct rbd_device *rbd_dev,
 
 	ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc,
 			     "rbd", "parent_overlap_get", CEPH_OSD_FLAG_READ,
-			     req_page, sizeof(u64), reply);
+			     request, reply);
 	if (ret)
 		return ret;
 
@@ -5563,7 +5564,7 @@ static int __get_parent_info(struct rbd_device *rbd_dev,
  * The caller is responsible for @pii.
  */
 static int __get_parent_info_legacy(struct rbd_device *rbd_dev,
-				    struct page *req_page,
+				    struct ceph_databuf *request,
 				    struct ceph_databuf *reply,
 				    struct parent_image_info *pii)
 {
@@ -5573,7 +5574,7 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev,
 
 	ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc,
 			     "rbd", "get_parent", CEPH_OSD_FLAG_READ,
-			     req_page, sizeof(u64), reply);
+			     request, reply);
 	if (ret)
 		return ret;
 
@@ -5604,29 +5605,29 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev,
 static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev,
 				  struct parent_image_info *pii)
 {
-	struct ceph_databuf *reply;
-	struct page *req_page;
+	struct ceph_databuf *request, *reply;
 	void *p;
 	int ret = -ENOMEM;
 
-	req_page = alloc_page(GFP_KERNEL);
-	if (!req_page)
+	request = ceph_databuf_req_alloc(0, sizeof(__le64), GFP_KERNEL);
+	if (!request)
 		goto out;
 
 	reply = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_KERNEL);
 	if (!reply)
 		goto out_free;
 
-	p = kmap_local_page(req_page);
+	p = kmap_ceph_databuf_page(request, 0);
 	ceph_encode_64(&p, rbd_dev->spec->snap_id);
 	kunmap_local(p);
-	ret = __get_parent_info(rbd_dev, req_page, reply, pii);
+	ceph_databuf_added_data(request, sizeof(__le64));
+	ret = __get_parent_info(rbd_dev, request, reply, pii);
 	if (ret > 0)
-		ret = __get_parent_info_legacy(rbd_dev, req_page, reply, pii);
+		ret = __get_parent_info_legacy(rbd_dev, request, reply, pii);
 
 	ceph_databuf_release(reply);
 out_free:
-	__free_page(req_page);
+	ceph_databuf_release(request);
 out:
 	return ret;
 }
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 9182aa5075b2..d31e59bd128c 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -568,7 +568,7 @@ int ceph_osdc_call(struct ceph_osd_client *osdc,
 		   struct ceph_object_locator *oloc,
 		   const char *class, const char *method,
 		   unsigned int flags,
-		   struct page *req_page, size_t req_len,
+		   struct ceph_databuf *request,
 		   struct ceph_databuf *response);
 
 /* watch/notify */
diff --git a/net/ceph/cls_lock_client.c b/net/ceph/cls_lock_client.c
index 37bb8708e8bb..6c8608aabe5f 100644
--- a/net/ceph/cls_lock_client.c
+++ b/net/ceph/cls_lock_client.c
@@ -34,7 +34,7 @@ int ceph_cls_lock(struct ceph_osd_client *osdc,
 	int tag_len = strlen(tag);
 	int desc_len = strlen(desc);
 	void *p, *end;
-	struct page *lock_op_page;
+	struct ceph_databuf *lock_op_req;
 	struct timespec64 mtime;
 	int ret;
 
@@ -49,11 +49,11 @@ int ceph_cls_lock(struct ceph_osd_client *osdc,
 	if (lock_op_buf_size > PAGE_SIZE)
 		return -E2BIG;
 
-	lock_op_page = alloc_page(GFP_NOIO);
-	if (!lock_op_page)
+	lock_op_req = ceph_databuf_req_alloc(0, lock_op_buf_size, GFP_NOIO);
+	if (!lock_op_req)
 		return -ENOMEM;
 
-	p = page_address(lock_op_page);
+	p = kmap_ceph_databuf_page(lock_op_req, 0);
 	end = p + lock_op_buf_size;
 
 	/* encode cls_lock_lock_op struct */
@@ -69,15 +69,16 @@ int ceph_cls_lock(struct ceph_osd_client *osdc,
 	ceph_encode_timespec64(p, &mtime);
 	p += sizeof(struct ceph_timespec);
 	ceph_encode_8(&p, flags);
+	kunmap_local(p);
+	ceph_databuf_added_data(lock_op_req, lock_op_buf_size);
 
 	dout("%s lock_name %s type %d cookie %s tag %s desc %s flags 0x%x\n",
 	     __func__, lock_name, type, cookie, tag, desc, flags);
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "lock",
-			     CEPH_OSD_FLAG_WRITE, lock_op_page,
-			     lock_op_buf_size, NULL);
+			     CEPH_OSD_FLAG_WRITE, lock_op_req, NULL);
 
 	dout("%s: status %d\n", __func__, ret);
-	__free_page(lock_op_page);
+	ceph_databuf_release(lock_op_req);
 	return ret;
 }
 EXPORT_SYMBOL(ceph_cls_lock);
@@ -99,7 +100,7 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc,
 	int name_len = strlen(lock_name);
 	int cookie_len = strlen(cookie);
 	void *p, *end;
-	struct page *unlock_op_page;
+	struct ceph_databuf *unlock_op_req;
 	int ret;
 
 	unlock_op_buf_size = name_len + sizeof(__le32) +
@@ -108,11 +109,11 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc,
 	if (unlock_op_buf_size > PAGE_SIZE)
 		return -E2BIG;
 
-	unlock_op_page = alloc_page(GFP_NOIO);
-	if (!unlock_op_page)
+	unlock_op_req = ceph_databuf_req_alloc(0, unlock_op_buf_size, GFP_NOIO);
+	if (!unlock_op_req)
 		return -ENOMEM;
 
-	p = page_address(unlock_op_page);
+	p = kmap_ceph_databuf_page(unlock_op_req, 0);
 	end = p + unlock_op_buf_size;
 
 	/* encode cls_lock_unlock_op struct */
@@ -120,14 +121,15 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc,
 			    unlock_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
 	ceph_encode_string(&p, end, lock_name, name_len);
 	ceph_encode_string(&p, end, cookie, cookie_len);
+	kunmap_local(p);
+	ceph_databuf_added_data(unlock_op_req, unlock_op_buf_size);
 
 	dout("%s lock_name %s cookie %s\n", __func__, lock_name, cookie);
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "unlock",
-			     CEPH_OSD_FLAG_WRITE, unlock_op_page,
-			     unlock_op_buf_size, NULL);
+			     CEPH_OSD_FLAG_WRITE, unlock_op_req, NULL);
 
 	dout("%s: status %d\n", __func__, ret);
-	__free_page(unlock_op_page);
+	ceph_databuf_release(unlock_op_req);
 	return ret;
 }
 EXPORT_SYMBOL(ceph_cls_unlock);
@@ -150,7 +152,7 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc,
 	int break_op_buf_size;
 	int name_len = strlen(lock_name);
 	int cookie_len = strlen(cookie);
-	struct page *break_op_page;
+	struct ceph_databuf *break_op_req;
 	void *p, *end;
 	int ret;
 
@@ -161,11 +163,11 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc,
 	if (break_op_buf_size > PAGE_SIZE)
 		return -E2BIG;
 
-	break_op_page = alloc_page(GFP_NOIO);
-	if (!break_op_page)
+	break_op_req = ceph_databuf_req_alloc(0, break_op_buf_size, GFP_NOIO);
+	if (!break_op_req)
 		return -ENOMEM;
 
-	p = page_address(break_op_page);
+	p = kmap_ceph_databuf_page(break_op_req, 0);
 	end = p + break_op_buf_size;
 
 	/* encode cls_lock_break_op struct */
@@ -174,15 +176,16 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc,
 	ceph_encode_string(&p, end, lock_name, name_len);
 	ceph_encode_copy(&p, locker, sizeof(*locker));
 	ceph_encode_string(&p, end, cookie, cookie_len);
+	kunmap_local(p);
+	ceph_databuf_added_data(break_op_req, break_op_buf_size);
 
 	dout("%s lock_name %s cookie %s locker %s%llu\n", __func__, lock_name,
 	     cookie, ENTITY_NAME(*locker));
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "break_lock",
-			     CEPH_OSD_FLAG_WRITE, break_op_page,
-			     break_op_buf_size, NULL);
+			     CEPH_OSD_FLAG_WRITE, break_op_req, NULL);
 
 	dout("%s: status %d\n", __func__, ret);
-	__free_page(break_op_page);
+	ceph_databuf_release(break_op_req);
 	return ret;
 }
 EXPORT_SYMBOL(ceph_cls_break_lock);
@@ -199,7 +202,7 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc,
 	int tag_len = strlen(tag);
 	int new_cookie_len = strlen(new_cookie);
 	void *p, *end;
-	struct page *cookie_op_page;
+	struct ceph_databuf *cookie_op_req;
 	int ret;
 
 	cookie_op_buf_size = name_len + sizeof(__le32) +
@@ -210,11 +213,11 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc,
 	if (cookie_op_buf_size > PAGE_SIZE)
 		return -E2BIG;
 
-	cookie_op_page = alloc_page(GFP_NOIO);
-	if (!cookie_op_page)
+	cookie_op_req = ceph_databuf_req_alloc(0, cookie_op_buf_size, GFP_NOIO);
+	if (!cookie_op_req)
 		return -ENOMEM;
 
-	p = page_address(cookie_op_page);
+	p = kmap_ceph_databuf_page(cookie_op_req, 0);
 	end = p + cookie_op_buf_size;
 
 	/* encode cls_lock_set_cookie_op struct */
@@ -225,15 +228,16 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc,
 	ceph_encode_string(&p, end, old_cookie, old_cookie_len);
 	ceph_encode_string(&p, end, tag, tag_len);
 	ceph_encode_string(&p, end, new_cookie, new_cookie_len);
+	kunmap_local(p);
+	ceph_databuf_added_data(cookie_op_req, cookie_op_buf_size);
 
 	dout("%s lock_name %s type %d old_cookie %s tag %s new_cookie %s\n",
 	     __func__, lock_name, type, old_cookie, tag, new_cookie);
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "set_cookie",
-			     CEPH_OSD_FLAG_WRITE, cookie_op_page,
-			     cookie_op_buf_size, NULL);
+			     CEPH_OSD_FLAG_WRITE, cookie_op_req, NULL);
 
 	dout("%s: status %d\n", __func__, ret);
-	__free_page(cookie_op_page);
+	ceph_databuf_release(cookie_op_req);
 	return ret;
 }
 EXPORT_SYMBOL(ceph_cls_set_cookie);
@@ -340,7 +344,7 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc,
 	struct ceph_databuf *reply;
 	int get_info_op_buf_size;
 	int name_len = strlen(lock_name);
-	struct page *get_info_op_page;
+	struct ceph_databuf *get_info_op_req;
 	void *p, *end;
 	int ret;
 
@@ -349,28 +353,30 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc,
 	if (get_info_op_buf_size > PAGE_SIZE)
 		return -E2BIG;
 
-	get_info_op_page = alloc_page(GFP_NOIO);
-	if (!get_info_op_page)
+	get_info_op_req = ceph_databuf_req_alloc(0, get_info_op_buf_size,
+						 GFP_NOIO);
+	if (!get_info_op_req)
 		return -ENOMEM;
 
 	reply = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO);
 	if (!reply) {
-		__free_page(get_info_op_page);
+		ceph_databuf_release(get_info_op_req);
 		return -ENOMEM;
 	}
 
-	p = page_address(get_info_op_page);
+	p = kmap_ceph_databuf_page(get_info_op_req, 0);
 	end = p + get_info_op_buf_size;
 
 	/* encode cls_lock_get_info_op struct */
 	ceph_start_encoding(&p, 1, 1,
 			    get_info_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
 	ceph_encode_string(&p, end, lock_name, name_len);
+	kunmap_local(p);
+	ceph_databuf_added_data(get_info_op_req, get_info_op_buf_size);
 
 	dout("%s lock_name %s\n", __func__, lock_name);
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "get_info",
-			     CEPH_OSD_FLAG_READ, get_info_op_page,
-			     get_info_op_buf_size, reply);
+			     CEPH_OSD_FLAG_READ, get_info_op_req, reply);
 
 	dout("%s: status %d\n", __func__, ret);
 	if (ret >= 0) {
@@ -381,8 +387,8 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc,
 		kunmap_local(p);
 	}
 
-	__free_page(get_info_op_page);
 	ceph_databuf_release(reply);
+	ceph_databuf_release(get_info_op_req);
 	return ret;
 }
 EXPORT_SYMBOL(ceph_cls_lock_info);
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 405ccf7e7a91..c4525feb8e26 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -264,7 +264,7 @@ void osd_req_op_cls_request_databuf(struct ceph_osd_request *osd_req,
 	BUG_ON(!ceph_databuf_len(dbuf));
 
 	osd_data = osd_req_op_data(osd_req, which, cls, request_data);
-	ceph_osd_databuf_init(osd_data, dbuf);
+	ceph_osd_databuf_init(osd_data, ceph_databuf_get(dbuf));
 	osd_req->r_ops[which].cls.indata_len += ceph_databuf_len(dbuf);
 	osd_req->r_ops[which].indata_len += ceph_databuf_len(dbuf);
 }
@@ -283,19 +283,6 @@ void osd_req_op_cls_request_data_pagelist(
 }
 EXPORT_SYMBOL(osd_req_op_cls_request_data_pagelist);
 
-static void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req,
-			unsigned int which, struct page **pages, u64 length,
-			u32 offset, bool pages_from_pool, bool own_pages)
-{
-	struct ceph_osd_data *osd_data;
-
-	osd_data = osd_req_op_data(osd_req, which, cls, request_data);
-	ceph_osd_data_pages_init(osd_data, pages, length, offset,
-				pages_from_pool, own_pages);
-	osd_req->r_ops[which].cls.indata_len += length;
-	osd_req->r_ops[which].indata_len += length;
-}
-
 void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req,
 				     unsigned int which,
 				     struct ceph_databuf *dbuf)
@@ -5089,15 +5076,12 @@ int ceph_osdc_call(struct ceph_osd_client *osdc,
 		   struct ceph_object_locator *oloc,
 		   const char *class, const char *method,
 		   unsigned int flags,
-		   struct page *req_page, size_t req_len,
+		   struct ceph_databuf *request,
 		   struct ceph_databuf *response)
 {
 	struct ceph_osd_request *req;
 	int ret;
 
-	if (req_len > PAGE_SIZE)
-		return -E2BIG;
-
 	req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_NOIO);
 	if (!req)
 		return -ENOMEM;
@@ -5110,9 +5094,8 @@ int ceph_osdc_call(struct ceph_osd_client *osdc,
 	if (ret)
 		goto out_put_req;
 
-	if (req_page)
-		osd_req_op_cls_request_data_pages(req, 0, &req_page, req_len,
-						  0, false, false);
+	if (request)
+		osd_req_op_cls_request_databuf(req, 0, request);
 	if (response)
 		osd_req_op_cls_response_databuf(req, 0, response);
 


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 17/35] libceph, rbd: Use ceph_databuf encoding start/stop
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (15 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 16/35] libceph: Convert req_page of ceph_osdc_call() to ceph_databuf David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-18 19:59   ` Viacheslav Dubeyko
  2025-03-18 22:19   ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 18/35] libceph, rbd: Convert some page arrays to ceph_databuf David Howells
                   ` (17 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Use ceph_databuf_enc_start() and ceph_databuf_enc_stop() to encode RPC
parameter data where possible.  The start function maps the buffer and
returns a pointer to the point to start writing at; the stop function
updates the buffer size.

The code is also made a bit more consistent in the use of size_t for length
variables and using 'request' for a pointer to the request buffer.

The end pointer is dropped from ceph_encode_string() as we shouldn't
overrun with the string length being included in the buffer size
precalculation.  The final pointer is checked by ceph_databuf_enc_stop().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 drivers/block/rbd.c         |   3 +-
 include/linux/ceph/decode.h |   4 +-
 net/ceph/cls_lock_client.c  | 195 +++++++++++++++++-------------------
 net/ceph/mon_client.c       |  10 +-
 net/ceph/osd_client.c       |  26 +++--
 5 files changed, 112 insertions(+), 126 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index ec09d578b0b0..078bb1e3e1da 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5762,8 +5762,7 @@ static char *rbd_dev_image_name(struct rbd_device *rbd_dev)
 		return NULL;
 
 	p = image_id;
-	end = image_id + image_id_size;
-	ceph_encode_string(&p, end, rbd_dev->spec->image_id, (u32)len);
+	ceph_encode_string(&p, rbd_dev->spec->image_id, len);
 
 	size = sizeof (__le32) + RBD_IMAGE_NAME_LEN_MAX;
 	reply_buf = kmalloc(size, GFP_KERNEL);
diff --git a/include/linux/ceph/decode.h b/include/linux/ceph/decode.h
index 8fc1aed64113..e2726c3152db 100644
--- a/include/linux/ceph/decode.h
+++ b/include/linux/ceph/decode.h
@@ -292,10 +292,8 @@ static inline void ceph_encode_filepath(void **p, void *end,
 	*p += len;
 }
 
-static inline void ceph_encode_string(void **p, void *end,
-				      const char *s, u32 len)
+static inline void ceph_encode_string(void **p, const char *s, u32 len)
 {
-	BUG_ON(*p + sizeof(len) + len > end);
 	ceph_encode_32(p, len);
 	if (len)
 		memcpy(*p, s, len);
diff --git a/net/ceph/cls_lock_client.c b/net/ceph/cls_lock_client.c
index 6c8608aabe5f..c91259ff8557 100644
--- a/net/ceph/cls_lock_client.c
+++ b/net/ceph/cls_lock_client.c
@@ -28,14 +28,14 @@ int ceph_cls_lock(struct ceph_osd_client *osdc,
 		  char *lock_name, u8 type, char *cookie,
 		  char *tag, char *desc, u8 flags)
 {
-	int lock_op_buf_size;
-	int name_len = strlen(lock_name);
-	int cookie_len = strlen(cookie);
-	int tag_len = strlen(tag);
-	int desc_len = strlen(desc);
-	void *p, *end;
-	struct ceph_databuf *lock_op_req;
+	struct ceph_databuf *request;
 	struct timespec64 mtime;
+	size_t lock_op_buf_size;
+	size_t name_len = strlen(lock_name);
+	size_t cookie_len = strlen(cookie);
+	size_t tag_len = strlen(tag);
+	size_t desc_len = strlen(desc);
+	void *p;
 	int ret;
 
 	lock_op_buf_size = name_len + sizeof(__le32) +
@@ -49,36 +49,34 @@ int ceph_cls_lock(struct ceph_osd_client *osdc,
 	if (lock_op_buf_size > PAGE_SIZE)
 		return -E2BIG;
 
-	lock_op_req = ceph_databuf_req_alloc(0, lock_op_buf_size, GFP_NOIO);
-	if (!lock_op_req)
+	request = ceph_databuf_req_alloc(1, lock_op_buf_size, GFP_NOIO);
+	if (!request)
 		return -ENOMEM;
 
-	p = kmap_ceph_databuf_page(lock_op_req, 0);
-	end = p + lock_op_buf_size;
+	p = ceph_databuf_enc_start(request);
 
 	/* encode cls_lock_lock_op struct */
 	ceph_start_encoding(&p, 1, 1,
 			    lock_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
-	ceph_encode_string(&p, end, lock_name, name_len);
+	ceph_encode_string(&p, lock_name, name_len);
 	ceph_encode_8(&p, type);
-	ceph_encode_string(&p, end, cookie, cookie_len);
-	ceph_encode_string(&p, end, tag, tag_len);
-	ceph_encode_string(&p, end, desc, desc_len);
+	ceph_encode_string(&p, cookie, cookie_len);
+	ceph_encode_string(&p, tag, tag_len);
+	ceph_encode_string(&p, desc, desc_len);
 	/* only support infinite duration */
 	memset(&mtime, 0, sizeof(mtime));
 	ceph_encode_timespec64(p, &mtime);
 	p += sizeof(struct ceph_timespec);
 	ceph_encode_8(&p, flags);
-	kunmap_local(p);
-	ceph_databuf_added_data(lock_op_req, lock_op_buf_size);
+	ceph_databuf_enc_stop(request, p);
 
 	dout("%s lock_name %s type %d cookie %s tag %s desc %s flags 0x%x\n",
 	     __func__, lock_name, type, cookie, tag, desc, flags);
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "lock",
-			     CEPH_OSD_FLAG_WRITE, lock_op_req, NULL);
+			     CEPH_OSD_FLAG_WRITE, request, NULL);
 
 	dout("%s: status %d\n", __func__, ret);
-	ceph_databuf_release(lock_op_req);
+	ceph_databuf_release(request);
 	return ret;
 }
 EXPORT_SYMBOL(ceph_cls_lock);
@@ -96,11 +94,11 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc,
 		    struct ceph_object_locator *oloc,
 		    char *lock_name, char *cookie)
 {
-	int unlock_op_buf_size;
-	int name_len = strlen(lock_name);
-	int cookie_len = strlen(cookie);
-	void *p, *end;
-	struct ceph_databuf *unlock_op_req;
+	struct ceph_databuf *request;
+	size_t unlock_op_buf_size;
+	size_t name_len = strlen(lock_name);
+	size_t cookie_len = strlen(cookie);
+	void *p;
 	int ret;
 
 	unlock_op_buf_size = name_len + sizeof(__le32) +
@@ -109,27 +107,25 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc,
 	if (unlock_op_buf_size > PAGE_SIZE)
 		return -E2BIG;
 
-	unlock_op_req = ceph_databuf_req_alloc(0, unlock_op_buf_size, GFP_NOIO);
-	if (!unlock_op_req)
+	request = ceph_databuf_req_alloc(1, unlock_op_buf_size, GFP_NOIO);
+	if (!request)
 		return -ENOMEM;
 
-	p = kmap_ceph_databuf_page(unlock_op_req, 0);
-	end = p + unlock_op_buf_size;
+	p = ceph_databuf_enc_start(request);
 
 	/* encode cls_lock_unlock_op struct */
 	ceph_start_encoding(&p, 1, 1,
 			    unlock_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
-	ceph_encode_string(&p, end, lock_name, name_len);
-	ceph_encode_string(&p, end, cookie, cookie_len);
-	kunmap_local(p);
-	ceph_databuf_added_data(unlock_op_req, unlock_op_buf_size);
+	ceph_encode_string(&p, lock_name, name_len);
+	ceph_encode_string(&p, cookie, cookie_len);
+	ceph_databuf_enc_stop(request, p);
 
 	dout("%s lock_name %s cookie %s\n", __func__, lock_name, cookie);
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "unlock",
-			     CEPH_OSD_FLAG_WRITE, unlock_op_req, NULL);
+			     CEPH_OSD_FLAG_WRITE, request, NULL);
 
 	dout("%s: status %d\n", __func__, ret);
-	ceph_databuf_release(unlock_op_req);
+	ceph_databuf_release(request);
 	return ret;
 }
 EXPORT_SYMBOL(ceph_cls_unlock);
@@ -149,11 +145,11 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc,
 			char *lock_name, char *cookie,
 			struct ceph_entity_name *locker)
 {
-	int break_op_buf_size;
-	int name_len = strlen(lock_name);
-	int cookie_len = strlen(cookie);
-	struct ceph_databuf *break_op_req;
-	void *p, *end;
+	struct ceph_databuf *request;
+	size_t break_op_buf_size;
+	size_t name_len = strlen(lock_name);
+	size_t cookie_len = strlen(cookie);
+	void *p;
 	int ret;
 
 	break_op_buf_size = name_len + sizeof(__le32) +
@@ -163,29 +159,27 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc,
 	if (break_op_buf_size > PAGE_SIZE)
 		return -E2BIG;
 
-	break_op_req = ceph_databuf_req_alloc(0, break_op_buf_size, GFP_NOIO);
-	if (!break_op_req)
+	request = ceph_databuf_req_alloc(1, break_op_buf_size, GFP_NOIO);
+	if (!request)
 		return -ENOMEM;
 
-	p = kmap_ceph_databuf_page(break_op_req, 0);
-	end = p + break_op_buf_size;
+	p = ceph_databuf_enc_start(request);
 
 	/* encode cls_lock_break_op struct */
 	ceph_start_encoding(&p, 1, 1,
 			    break_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
-	ceph_encode_string(&p, end, lock_name, name_len);
+	ceph_encode_string(&p, lock_name, name_len);
 	ceph_encode_copy(&p, locker, sizeof(*locker));
-	ceph_encode_string(&p, end, cookie, cookie_len);
-	kunmap_local(p);
-	ceph_databuf_added_data(break_op_req, break_op_buf_size);
+	ceph_encode_string(&p, cookie, cookie_len);
+	ceph_databuf_enc_stop(request, p);
 
 	dout("%s lock_name %s cookie %s locker %s%llu\n", __func__, lock_name,
 	     cookie, ENTITY_NAME(*locker));
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "break_lock",
-			     CEPH_OSD_FLAG_WRITE, break_op_req, NULL);
+			     CEPH_OSD_FLAG_WRITE, request, NULL);
 
 	dout("%s: status %d\n", __func__, ret);
-	ceph_databuf_release(break_op_req);
+	ceph_databuf_release(request);
 	return ret;
 }
 EXPORT_SYMBOL(ceph_cls_break_lock);
@@ -196,13 +190,13 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc,
 			char *lock_name, u8 type, char *old_cookie,
 			char *tag, char *new_cookie)
 {
-	int cookie_op_buf_size;
-	int name_len = strlen(lock_name);
-	int old_cookie_len = strlen(old_cookie);
-	int tag_len = strlen(tag);
-	int new_cookie_len = strlen(new_cookie);
-	void *p, *end;
-	struct ceph_databuf *cookie_op_req;
+	struct ceph_databuf *request;
+	size_t cookie_op_buf_size;
+	size_t name_len = strlen(lock_name);
+	size_t old_cookie_len = strlen(old_cookie);
+	size_t tag_len = strlen(tag);
+	size_t new_cookie_len = strlen(new_cookie);
+	void *p;
 	int ret;
 
 	cookie_op_buf_size = name_len + sizeof(__le32) +
@@ -213,31 +207,29 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc,
 	if (cookie_op_buf_size > PAGE_SIZE)
 		return -E2BIG;
 
-	cookie_op_req = ceph_databuf_req_alloc(0, cookie_op_buf_size, GFP_NOIO);
-	if (!cookie_op_req)
+	request = ceph_databuf_req_alloc(1, cookie_op_buf_size, GFP_NOIO);
+	if (!request)
 		return -ENOMEM;
 
-	p = kmap_ceph_databuf_page(cookie_op_req, 0);
-	end = p + cookie_op_buf_size;
+	p = ceph_databuf_enc_start(request);
 
 	/* encode cls_lock_set_cookie_op struct */
 	ceph_start_encoding(&p, 1, 1,
 			    cookie_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
-	ceph_encode_string(&p, end, lock_name, name_len);
+	ceph_encode_string(&p, lock_name, name_len);
 	ceph_encode_8(&p, type);
-	ceph_encode_string(&p, end, old_cookie, old_cookie_len);
-	ceph_encode_string(&p, end, tag, tag_len);
-	ceph_encode_string(&p, end, new_cookie, new_cookie_len);
-	kunmap_local(p);
-	ceph_databuf_added_data(cookie_op_req, cookie_op_buf_size);
+	ceph_encode_string(&p, old_cookie, old_cookie_len);
+	ceph_encode_string(&p, tag, tag_len);
+	ceph_encode_string(&p, new_cookie, new_cookie_len);
+	ceph_databuf_enc_stop(request, p);
 
 	dout("%s lock_name %s type %d old_cookie %s tag %s new_cookie %s\n",
 	     __func__, lock_name, type, old_cookie, tag, new_cookie);
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "set_cookie",
-			     CEPH_OSD_FLAG_WRITE, cookie_op_req, NULL);
+			     CEPH_OSD_FLAG_WRITE, request, NULL);
 
 	dout("%s: status %d\n", __func__, ret);
-	ceph_databuf_release(cookie_op_req);
+	ceph_databuf_release(request);
 	return ret;
 }
 EXPORT_SYMBOL(ceph_cls_set_cookie);
@@ -289,9 +281,10 @@ static int decode_locker(void **p, void *end, struct ceph_locker *locker)
 	return 0;
 }
 
-static int decode_lockers(void **p, void *end, u8 *type, char **tag,
+static int decode_lockers(void **p, size_t size, u8 *type, char **tag,
 			  struct ceph_locker **lockers, u32 *num_lockers)
 {
+	void *end = *p + size;
 	u8 struct_v;
 	u32 struct_len;
 	char *s;
@@ -341,11 +334,10 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc,
 		       char *lock_name, u8 *type, char **tag,
 		       struct ceph_locker **lockers, u32 *num_lockers)
 {
-	struct ceph_databuf *reply;
-	int get_info_op_buf_size;
-	int name_len = strlen(lock_name);
-	struct ceph_databuf *get_info_op_req;
-	void *p, *end;
+	struct ceph_databuf *request, *reply;
+	size_t get_info_op_buf_size;
+	size_t name_len = strlen(lock_name);
+	void *p;
 	int ret;
 
 	get_info_op_buf_size = name_len + sizeof(__le32) +
@@ -353,42 +345,39 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc,
 	if (get_info_op_buf_size > PAGE_SIZE)
 		return -E2BIG;
 
-	get_info_op_req = ceph_databuf_req_alloc(0, get_info_op_buf_size,
-						 GFP_NOIO);
-	if (!get_info_op_req)
+	request = ceph_databuf_req_alloc(1, get_info_op_buf_size, GFP_NOIO);
+	if (!request)
 		return -ENOMEM;
 
 	reply = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO);
 	if (!reply) {
-		ceph_databuf_release(get_info_op_req);
+		ceph_databuf_release(request);
 		return -ENOMEM;
 	}
 
-	p = kmap_ceph_databuf_page(get_info_op_req, 0);
-	end = p + get_info_op_buf_size;
+	p = ceph_databuf_enc_start(request);
 
 	/* encode cls_lock_get_info_op struct */
 	ceph_start_encoding(&p, 1, 1,
 			    get_info_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
-	ceph_encode_string(&p, end, lock_name, name_len);
-	kunmap_local(p);
-	ceph_databuf_added_data(get_info_op_req, get_info_op_buf_size);
+	ceph_encode_string(&p, lock_name, name_len);
+	ceph_databuf_enc_stop(request, p);
 
 	dout("%s lock_name %s\n", __func__, lock_name);
 	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "get_info",
-			     CEPH_OSD_FLAG_READ, get_info_op_req, reply);
+			     CEPH_OSD_FLAG_READ, request, reply);
 
 	dout("%s: status %d\n", __func__, ret);
 	if (ret >= 0) {
 		p = kmap_ceph_databuf_page(reply, 0);
-		end = p + ceph_databuf_len(reply);
 
-		ret = decode_lockers(&p, end, type, tag, lockers, num_lockers);
+		ret = decode_lockers(&p, ceph_databuf_len(reply),
+				     type, tag, lockers, num_lockers);
 		kunmap_local(p);
 	}
 
 	ceph_databuf_release(reply);
-	ceph_databuf_release(get_info_op_req);
+	ceph_databuf_release(request);
 	return ret;
 }
 EXPORT_SYMBOL(ceph_cls_lock_info);
@@ -396,12 +385,12 @@ EXPORT_SYMBOL(ceph_cls_lock_info);
 int ceph_cls_assert_locked(struct ceph_osd_request *req, int which,
 			   char *lock_name, u8 type, char *cookie, char *tag)
 {
-	struct ceph_databuf *dbuf;
-	int assert_op_buf_size;
-	int name_len = strlen(lock_name);
-	int cookie_len = strlen(cookie);
-	int tag_len = strlen(tag);
-	void *p, *end;
+	struct ceph_databuf *request;
+	size_t assert_op_buf_size;
+	size_t name_len = strlen(lock_name);
+	size_t cookie_len = strlen(cookie);
+	size_t tag_len = strlen(tag);
+	void *p;
 	int ret;
 
 	assert_op_buf_size = name_len + sizeof(__le32) +
@@ -415,25 +404,23 @@ int ceph_cls_assert_locked(struct ceph_osd_request *req, int which,
 	if (ret)
 		return ret;
 
-	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
-	if (!dbuf)
+	request = ceph_databuf_req_alloc(1, assert_op_buf_size, GFP_NOIO);
+	if (!request)
 		return -ENOMEM;
 
-	p = kmap_ceph_databuf_page(dbuf, 0);
-	end = p + assert_op_buf_size;
+	p = ceph_databuf_enc_start(request);
 
 	/* encode cls_lock_assert_op struct */
 	ceph_start_encoding(&p, 1, 1,
 			    assert_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
-	ceph_encode_string(&p, end, lock_name, name_len);
+	ceph_encode_string(&p, lock_name, name_len);
 	ceph_encode_8(&p, type);
-	ceph_encode_string(&p, end, cookie, cookie_len);
-	ceph_encode_string(&p, end, tag, tag_len);
-	kunmap(p);
-	WARN_ON(p != end);
-	ceph_databuf_added_data(dbuf, assert_op_buf_size);
+	ceph_encode_string(&p, cookie, cookie_len);
+	ceph_encode_string(&p, tag, tag_len);
+	ceph_databuf_enc_stop(request, p);
+	WARN_ON(ceph_databuf_len(request) != assert_op_buf_size);
 
-	osd_req_op_cls_request_databuf(req, which, dbuf);
+	osd_req_op_cls_request_databuf(req, which, request);
 	return 0;
 }
 EXPORT_SYMBOL(ceph_cls_assert_locked);
diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
index ab66b599ac47..39103e4bb07d 100644
--- a/net/ceph/mon_client.c
+++ b/net/ceph/mon_client.c
@@ -367,7 +367,8 @@ static void __send_subscribe(struct ceph_mon_client *monc)
 		dout("%s %s start %llu flags 0x%x\n", __func__, buf,
 		     le64_to_cpu(monc->subs[i].item.start),
 		     monc->subs[i].item.flags);
-		ceph_encode_string(&p, end, buf, len);
+		BUG_ON(p + sizeof(__le32) + len > end);
+		ceph_encode_string(&p, buf, len);
 		memcpy(p, &monc->subs[i].item, sizeof(monc->subs[i].item));
 		p += sizeof(monc->subs[i].item);
 	}
@@ -854,13 +855,14 @@ __ceph_monc_get_version(struct ceph_mon_client *monc, const char *what,
 			ceph_monc_callback_t cb, u64 private_data)
 {
 	struct ceph_mon_generic_request *req;
+	size_t wsize = strlen(what);
 
 	req = alloc_generic_request(monc, GFP_NOIO);
 	if (!req)
 		goto err_put_req;
 
 	req->request = ceph_msg_new(CEPH_MSG_MON_GET_VERSION,
-				    sizeof(u64) + sizeof(u32) + strlen(what),
+				    sizeof(u64) + sizeof(u32) + wsize,
 				    GFP_NOIO, true);
 	if (!req->request)
 		goto err_put_req;
@@ -873,6 +875,8 @@ __ceph_monc_get_version(struct ceph_mon_client *monc, const char *what,
 	req->complete_cb = cb;
 	req->private_data = private_data;
 
+	BUG_ON(sizeof(__le64) + sizeof(__le32) + wsize > req->request->front_alloc_len);
+
 	mutex_lock(&monc->mutex);
 	register_generic_request(req);
 	{
@@ -880,7 +884,7 @@ __ceph_monc_get_version(struct ceph_mon_client *monc, const char *what,
 		void *const end = p + req->request->front_alloc_len;
 
 		ceph_encode_64(&p, req->tid); /* handle */
-		ceph_encode_string(&p, end, what, strlen(what));
+		ceph_encode_string(&p, what, wsize);
 		WARN_ON(p != end);
 	}
 	send_generic_request(monc, req);
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index c4525feb8e26..b4adb299f9cd 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1831,15 +1831,15 @@ static int hoid_encoding_size(const struct ceph_hobject_id *hoid)
 	       4 + hoid->key_len + 4 + hoid->oid_len + 4 + hoid->nspace_len;
 }
 
-static void encode_hoid(void **p, void *end, const struct ceph_hobject_id *hoid)
+static void encode_hoid(void **p, const struct ceph_hobject_id *hoid)
 {
 	ceph_start_encoding(p, 4, 3, hoid_encoding_size(hoid));
-	ceph_encode_string(p, end, hoid->key, hoid->key_len);
-	ceph_encode_string(p, end, hoid->oid, hoid->oid_len);
+	ceph_encode_string(p, hoid->key, hoid->key_len);
+	ceph_encode_string(p, hoid->oid, hoid->oid_len);
 	ceph_encode_64(p, hoid->snapid);
 	ceph_encode_32(p, hoid->hash);
 	ceph_encode_8(p, hoid->is_max);
-	ceph_encode_string(p, end, hoid->nspace, hoid->nspace_len);
+	ceph_encode_string(p, hoid->nspace, hoid->nspace_len);
 	ceph_encode_64(p, hoid->pool);
 }
 
@@ -2072,16 +2072,14 @@ static void encode_spgid(void **p, const struct ceph_spg *spgid)
 	ceph_encode_8(p, spgid->shard);
 }
 
-static void encode_oloc(void **p, void *end,
-			const struct ceph_object_locator *oloc)
+static void encode_oloc(void **p, const struct ceph_object_locator *oloc)
 {
 	ceph_start_encoding(p, 5, 4, ceph_oloc_encoding_size(oloc));
 	ceph_encode_64(p, oloc->pool);
 	ceph_encode_32(p, -1); /* preferred */
 	ceph_encode_32(p, 0);  /* key len */
 	if (oloc->pool_ns)
-		ceph_encode_string(p, end, oloc->pool_ns->str,
-				   oloc->pool_ns->len);
+		ceph_encode_string(p, oloc->pool_ns->str, oloc->pool_ns->len);
 	else
 		ceph_encode_32(p, 0);
 }
@@ -2122,8 +2120,8 @@ static void encode_request_partial(struct ceph_osd_request *req,
 	ceph_encode_timespec64(p, &req->r_mtime);
 	p += sizeof(struct ceph_timespec);
 
-	encode_oloc(&p, end, &req->r_t.target_oloc);
-	ceph_encode_string(&p, end, req->r_t.target_oid.name,
+	encode_oloc(&p, &req->r_t.target_oloc);
+	ceph_encode_string(&p, req->r_t.target_oid.name,
 			   req->r_t.target_oid.name_len);
 
 	/* ops, can imply data */
@@ -4329,8 +4327,8 @@ static struct ceph_msg *create_backoff_message(
 	ceph_encode_32(&p, map_epoch);
 	ceph_encode_8(&p, CEPH_OSD_BACKOFF_OP_ACK_BLOCK);
 	ceph_encode_64(&p, backoff->id);
-	encode_hoid(&p, end, backoff->begin);
-	encode_hoid(&p, end, backoff->end);
+	encode_hoid(&p, backoff->begin);
+	encode_hoid(&p, backoff->end);
 	BUG_ON(p != end);
 
 	msg->front.iov_len = p - msg->front.iov_base;
@@ -5264,8 +5262,8 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req,
 
 	p = page_address(pages[0]);
 	end = p + PAGE_SIZE;
-	ceph_encode_string(&p, end, src_oid->name, src_oid->name_len);
-	encode_oloc(&p, end, src_oloc);
+	ceph_encode_string(&p, src_oid->name, src_oid->name_len);
+	encode_oloc(&p, src_oloc);
 	ceph_encode_32(&p, truncate_seq);
 	ceph_encode_64(&p, truncate_size);
 	op->indata_len = PAGE_SIZE - (end - p);


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 18/35] libceph, rbd: Convert some page arrays to ceph_databuf
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (16 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 17/35] libceph, rbd: Use ceph_databuf encoding start/stop David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-18 20:02   ` Viacheslav Dubeyko
  2025-03-18 22:25   ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 19/35] libceph, ceph: Convert users of ceph_pagelist " David Howells
                   ` (16 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Convert some miscellaneous page arrays to ceph_databuf containers.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 drivers/block/rbd.c             | 12 ++++-----
 include/linux/ceph/osd_client.h |  3 +++
 net/ceph/osd_client.c           | 43 +++++++++++++++++++++------------
 3 files changed, 36 insertions(+), 22 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 078bb1e3e1da..eea12c7ab2a0 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2108,7 +2108,7 @@ static int rbd_obj_calc_img_extents(struct rbd_obj_request *obj_req,
 
 static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which)
 {
-	struct page **pages;
+	struct ceph_databuf *dbuf;
 
 	/*
 	 * The response data for a STAT call consists of:
@@ -2118,14 +2118,12 @@ static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which)
 	 *         le32 tv_nsec;
 	 *     } mtime;
 	 */
-	pages = ceph_alloc_page_vector(1, GFP_NOIO);
-	if (IS_ERR(pages))
-		return PTR_ERR(pages);
+	dbuf = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO);
+	if (!dbuf)
+		return -ENOMEM;
 
 	osd_req_op_init(osd_req, which, CEPH_OSD_OP_STAT, 0);
-	osd_req_op_raw_data_in_pages(osd_req, which, pages,
-				     8 + sizeof(struct ceph_timespec),
-				     0, false, true);
+	osd_req_op_raw_data_in_databuf(osd_req, which, dbuf);
 	return 0;
 }
 
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index d31e59bd128c..6e126e212271 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -482,6 +482,9 @@ extern void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *,
 					struct page **pages, u64 length,
 					u32 offset, bool pages_from_pool,
 					bool own_pages);
+void osd_req_op_raw_data_in_databuf(struct ceph_osd_request *osd_req,
+				    unsigned int which,
+				    struct ceph_databuf *databuf);
 extern void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *,
 					unsigned int which,
 					struct ceph_pagelist *pagelist);
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index b4adb299f9cd..64a06267e7b3 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -182,6 +182,17 @@ osd_req_op_extent_osd_data(struct ceph_osd_request *osd_req,
 }
 EXPORT_SYMBOL(osd_req_op_extent_osd_data);
 
+void osd_req_op_raw_data_in_databuf(struct ceph_osd_request *osd_req,
+				    unsigned int which,
+				    struct ceph_databuf *dbuf)
+{
+	struct ceph_osd_data *osd_data;
+
+	osd_data = osd_req_op_raw_data_in(osd_req, which);
+	ceph_osd_databuf_init(osd_data, dbuf);
+}
+EXPORT_SYMBOL(osd_req_op_raw_data_in_databuf);
+
 void osd_req_op_raw_data_in_pages(struct ceph_osd_request *osd_req,
 			unsigned int which, struct page **pages,
 			u64 length, u32 offset,
@@ -5000,7 +5011,7 @@ int ceph_osdc_list_watchers(struct ceph_osd_client *osdc,
 			    u32 *num_watchers)
 {
 	struct ceph_osd_request *req;
-	struct page **pages;
+	struct ceph_databuf *dbuf;
 	int ret;
 
 	req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_NOIO);
@@ -5011,16 +5022,16 @@ int ceph_osdc_list_watchers(struct ceph_osd_client *osdc,
 	ceph_oloc_copy(&req->r_base_oloc, oloc);
 	req->r_flags = CEPH_OSD_FLAG_READ;
 
-	pages = ceph_alloc_page_vector(1, GFP_NOIO);
-	if (IS_ERR(pages)) {
-		ret = PTR_ERR(pages);
+	dbuf = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO);
+	if (!dbuf) {
+		ret = -ENOMEM;
 		goto out_put_req;
 	}
 
 	osd_req_op_init(req, 0, CEPH_OSD_OP_LIST_WATCHERS, 0);
-	ceph_osd_data_pages_init(osd_req_op_data(req, 0, list_watchers,
-						 response_data),
-				 pages, PAGE_SIZE, 0, false, true);
+	ceph_osd_databuf_init(osd_req_op_data(req, 0, list_watchers,
+					      response_data),
+			      dbuf);
 
 	ret = ceph_osdc_alloc_messages(req, GFP_NOIO);
 	if (ret)
@@ -5029,10 +5040,11 @@ int ceph_osdc_list_watchers(struct ceph_osd_client *osdc,
 	ceph_osdc_start_request(osdc, req);
 	ret = ceph_osdc_wait_request(osdc, req);
 	if (ret >= 0) {
-		void *p = page_address(pages[0]);
+		void *p = kmap_ceph_databuf_page(dbuf, 0);
 		void *const end = p + req->r_ops[0].outdata_len;
 
 		ret = decode_watchers(&p, end, watchers, num_watchers);
+		kunmap(p);
 	}
 
 out_put_req:
@@ -5246,12 +5258,12 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req,
 			      u8 copy_from_flags)
 {
 	struct ceph_osd_req_op *op;
-	struct page **pages;
+	struct ceph_databuf *dbuf;
 	void *p, *end;
 
-	pages = ceph_alloc_page_vector(1, GFP_KERNEL);
-	if (IS_ERR(pages))
-		return PTR_ERR(pages);
+	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_KERNEL);
+	if (!dbuf)
+		return -ENOMEM;
 
 	op = osd_req_op_init(req, 0, CEPH_OSD_OP_COPY_FROM2,
 			     dst_fadvise_flags);
@@ -5260,16 +5272,17 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req,
 	op->copy_from.flags = copy_from_flags;
 	op->copy_from.src_fadvise_flags = src_fadvise_flags;
 
-	p = page_address(pages[0]);
+	p = kmap_ceph_databuf_page(dbuf, 0);
 	end = p + PAGE_SIZE;
 	ceph_encode_string(&p, src_oid->name, src_oid->name_len);
 	encode_oloc(&p, src_oloc);
 	ceph_encode_32(&p, truncate_seq);
 	ceph_encode_64(&p, truncate_size);
 	op->indata_len = PAGE_SIZE - (end - p);
+	ceph_databuf_added_data(dbuf, op->indata_len);
+	kunmap_local(p);
 
-	ceph_osd_data_pages_init(&op->copy_from.osd_data, pages,
-				 op->indata_len, 0, false, true);
+	ceph_osd_databuf_init(&op->copy_from.osd_data, dbuf);
 	return 0;
 }
 EXPORT_SYMBOL(osd_req_op_copy_from_init);


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 19/35] libceph, ceph: Convert users of ceph_pagelist to ceph_databuf
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (17 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 18/35] libceph, rbd: Convert some page arrays to ceph_databuf David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-18 20:09   ` Viacheslav Dubeyko
  2025-03-18 22:27   ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 20/35] libceph: Remove ceph_pagelist David Howells
                   ` (15 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Convert users of ceph_pagelist to use ceph_databuf instead.  ceph_pagelist
is then unused and can be removed.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/ceph/locks.c                 |  22 +++---
 fs/ceph/mds_client.c            | 122 +++++++++++++++-----------------
 fs/ceph/super.h                 |   6 +-
 include/linux/ceph/osd_client.h |   2 +-
 net/ceph/osd_client.c           |  61 ++++++++--------
 5 files changed, 104 insertions(+), 109 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index ebf4ac0055dd..32c7b0f0d61f 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -371,8 +371,8 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
 }
 
 /*
- * Fills in the passed counter variables, so you can prepare pagelist metadata
- * before calling ceph_encode_locks.
+ * Fills in the passed counter variables, so you can prepare metadata before
+ * calling ceph_encode_locks.
  */
 void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count)
 {
@@ -483,38 +483,38 @@ int ceph_encode_locks_to_buffer(struct inode *inode,
 }
 
 /*
- * Copy the encoded flock and fcntl locks into the pagelist.
+ * Copy the encoded flock and fcntl locks into the data buffer.
  * Format is: #fcntl locks, sequential fcntl locks, #flock locks,
  * sequential flock locks.
  * Returns zero on success.
  */
-int ceph_locks_to_pagelist(struct ceph_filelock *flocks,
-			   struct ceph_pagelist *pagelist,
+int ceph_locks_to_databuf(struct ceph_filelock *flocks,
+			   struct ceph_databuf *dbuf,
 			   int num_fcntl_locks, int num_flock_locks)
 {
 	int err = 0;
 	__le32 nlocks;
 
 	nlocks = cpu_to_le32(num_fcntl_locks);
-	err = ceph_pagelist_append(pagelist, &nlocks, sizeof(nlocks));
+	err = ceph_databuf_append(dbuf, &nlocks, sizeof(nlocks));
 	if (err)
 		goto out_fail;
 
 	if (num_fcntl_locks > 0) {
-		err = ceph_pagelist_append(pagelist, flocks,
-					   num_fcntl_locks * sizeof(*flocks));
+		err = ceph_databuf_append(dbuf, flocks,
+					  num_fcntl_locks * sizeof(*flocks));
 		if (err)
 			goto out_fail;
 	}
 
 	nlocks = cpu_to_le32(num_flock_locks);
-	err = ceph_pagelist_append(pagelist, &nlocks, sizeof(nlocks));
+	err = ceph_databuf_append(dbuf, &nlocks, sizeof(nlocks));
 	if (err)
 		goto out_fail;
 
 	if (num_flock_locks > 0) {
-		err = ceph_pagelist_append(pagelist, &flocks[num_fcntl_locks],
-					   num_flock_locks * sizeof(*flocks));
+		err = ceph_databuf_append(dbuf, &flocks[num_fcntl_locks],
+					  num_flock_locks * sizeof(*flocks));
 	}
 out_fail:
 	return err;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 09661a34f287..f1c6d0ebf548 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -55,7 +55,7 @@
 struct ceph_reconnect_state {
 	struct ceph_mds_session *session;
 	int nr_caps, nr_realms;
-	struct ceph_pagelist *pagelist;
+	struct ceph_databuf *dbuf;
 	unsigned msg_version;
 	bool allow_multi;
 };
@@ -4456,8 +4456,7 @@ static void replay_unsafe_requests(struct ceph_mds_client *mdsc,
 static int send_reconnect_partial(struct ceph_reconnect_state *recon_state)
 {
 	struct ceph_msg *reply;
-	struct ceph_pagelist *_pagelist;
-	struct page *page;
+	struct ceph_databuf *_dbuf;
 	__le32 *addr;
 	int err = -ENOMEM;
 
@@ -4467,9 +4466,9 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state)
 	/* can't handle message that contains both caps and realm */
 	BUG_ON(!recon_state->nr_caps == !recon_state->nr_realms);
 
-	/* pre-allocate new pagelist */
-	_pagelist = ceph_pagelist_alloc(GFP_NOFS);
-	if (!_pagelist)
+	/* pre-allocate new databuf */
+	_dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOFS);
+	if (!_dbuf)
 		return -ENOMEM;
 
 	reply = ceph_msg_new2(CEPH_MSG_CLIENT_RECONNECT, 0, 1, GFP_NOFS, false);
@@ -4477,28 +4476,27 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state)
 		goto fail_msg;
 
 	/* placeholder for nr_caps */
-	err = ceph_pagelist_encode_32(_pagelist, 0);
+	err = ceph_databuf_encode_32(_dbuf, 0);
 	if (err < 0)
 		goto fail;
 
 	if (recon_state->nr_caps) {
 		/* currently encoding caps */
-		err = ceph_pagelist_encode_32(recon_state->pagelist, 0);
+		err = ceph_databuf_encode_32(recon_state->dbuf, 0);
 		if (err)
 			goto fail;
 	} else {
 		/* placeholder for nr_realms (currently encoding relams) */
-		err = ceph_pagelist_encode_32(_pagelist, 0);
+		err = ceph_databuf_encode_32(_dbuf, 0);
 		if (err < 0)
 			goto fail;
 	}
 
-	err = ceph_pagelist_encode_8(recon_state->pagelist, 1);
+	err = ceph_databuf_encode_8(recon_state->dbuf, 1);
 	if (err)
 		goto fail;
 
-	page = list_first_entry(&recon_state->pagelist->head, struct page, lru);
-	addr = kmap_atomic(page);
+	addr = kmap_ceph_databuf_page(recon_state->dbuf, 0);
 	if (recon_state->nr_caps) {
 		/* currently encoding caps */
 		*addr = cpu_to_le32(recon_state->nr_caps);
@@ -4506,18 +4504,18 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state)
 		/* currently encoding relams */
 		*(addr + 1) = cpu_to_le32(recon_state->nr_realms);
 	}
-	kunmap_atomic(addr);
+	kunmap_local(addr);
 
 	reply->hdr.version = cpu_to_le16(5);
 	reply->hdr.compat_version = cpu_to_le16(4);
 
-	reply->hdr.data_len = cpu_to_le32(recon_state->pagelist->length);
-	ceph_msg_data_add_pagelist(reply, recon_state->pagelist);
+	reply->hdr.data_len = cpu_to_le32(ceph_databuf_len(recon_state->dbuf));
+	ceph_msg_data_add_databuf(reply, recon_state->dbuf);
 
 	ceph_con_send(&recon_state->session->s_con, reply);
-	ceph_pagelist_release(recon_state->pagelist);
+	ceph_databuf_release(recon_state->dbuf);
 
-	recon_state->pagelist = _pagelist;
+	recon_state->dbuf = _dbuf;
 	recon_state->nr_caps = 0;
 	recon_state->nr_realms = 0;
 	recon_state->msg_version = 5;
@@ -4525,7 +4523,7 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state)
 fail:
 	ceph_msg_put(reply);
 fail_msg:
-	ceph_pagelist_release(_pagelist);
+	ceph_databuf_release(_dbuf);
 	return err;
 }
 
@@ -4575,7 +4573,7 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
 	} rec;
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_reconnect_state *recon_state = arg;
-	struct ceph_pagelist *pagelist = recon_state->pagelist;
+	struct ceph_databuf *dbuf = recon_state->dbuf;
 	struct dentry *dentry;
 	struct ceph_cap *cap;
 	char *path;
@@ -4698,7 +4696,7 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
 			struct_v = 2;
 		}
 		/*
-		 * number of encoded locks is stable, so copy to pagelist
+		 * number of encoded locks is stable, so copy to databuf
 		 */
 		struct_len = 2 * sizeof(u32) +
 			    (num_fcntl_locks + num_flock_locks) *
@@ -4712,41 +4710,42 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
 
 		total_len += struct_len;
 
-		if (pagelist->length + total_len > RECONNECT_MAX_SIZE) {
+		if (ceph_databuf_len(dbuf) + total_len > RECONNECT_MAX_SIZE) {
 			err = send_reconnect_partial(recon_state);
 			if (err)
 				goto out_freeflocks;
-			pagelist = recon_state->pagelist;
+			dbuf = recon_state->dbuf;
 		}
 
-		err = ceph_pagelist_reserve(pagelist, total_len);
+		err = ceph_databuf_reserve(dbuf, total_len, GFP_NOFS);
 		if (err)
 			goto out_freeflocks;
 
-		ceph_pagelist_encode_64(pagelist, ceph_ino(inode));
+		ceph_databuf_encode_64(dbuf, ceph_ino(inode));
 		if (recon_state->msg_version >= 3) {
-			ceph_pagelist_encode_8(pagelist, struct_v);
-			ceph_pagelist_encode_8(pagelist, 1);
-			ceph_pagelist_encode_32(pagelist, struct_len);
+			ceph_databuf_encode_8(dbuf, struct_v);
+			ceph_databuf_encode_8(dbuf, 1);
+			ceph_databuf_encode_32(dbuf, struct_len);
 		}
-		ceph_pagelist_encode_string(pagelist, path, pathlen);
-		ceph_pagelist_append(pagelist, &rec, sizeof(rec.v2));
-		ceph_locks_to_pagelist(flocks, pagelist,
-				       num_fcntl_locks, num_flock_locks);
+		ceph_databuf_encode_string(dbuf, path, pathlen);
+		ceph_databuf_append(dbuf, &rec, sizeof(rec.v2));
+		ceph_locks_to_databuf(flocks, dbuf,
+				      num_fcntl_locks, num_flock_locks);
 		if (struct_v >= 2)
-			ceph_pagelist_encode_64(pagelist, snap_follows);
+			ceph_databuf_encode_64(dbuf, snap_follows);
 out_freeflocks:
 		kfree(flocks);
 	} else {
-		err = ceph_pagelist_reserve(pagelist,
-					    sizeof(u64) + sizeof(u32) +
-					    pathlen + sizeof(rec.v1));
+		err = ceph_databuf_reserve(dbuf,
+					   sizeof(u64) + sizeof(u32) +
+					   pathlen + sizeof(rec.v1),
+					   GFP_NOFS);
 		if (err)
 			goto out_err;
 
-		ceph_pagelist_encode_64(pagelist, ceph_ino(inode));
-		ceph_pagelist_encode_string(pagelist, path, pathlen);
-		ceph_pagelist_append(pagelist, &rec, sizeof(rec.v1));
+		ceph_databuf_encode_64(dbuf, ceph_ino(inode));
+		ceph_databuf_encode_string(dbuf, path, pathlen);
+		ceph_databuf_append(dbuf, &rec, sizeof(rec.v1));
 	}
 
 out_err:
@@ -4760,12 +4759,12 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
 			      struct ceph_reconnect_state *recon_state)
 {
 	struct rb_node *p;
-	struct ceph_pagelist *pagelist = recon_state->pagelist;
 	struct ceph_client *cl = mdsc->fsc->client;
+	struct ceph_databuf *dbuf = recon_state->dbuf;
 	int err = 0;
 
 	if (recon_state->msg_version >= 4) {
-		err = ceph_pagelist_encode_32(pagelist, mdsc->num_snap_realms);
+		err = ceph_databuf_encode_32(dbuf, mdsc->num_snap_realms);
 		if (err < 0)
 			goto fail;
 	}
@@ -4784,20 +4783,20 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
 			size_t need = sizeof(u8) * 2 + sizeof(u32) +
 				      sizeof(sr_rec);
 
-			if (pagelist->length + need > RECONNECT_MAX_SIZE) {
+			if (ceph_databuf_len(dbuf) + need > RECONNECT_MAX_SIZE) {
 				err = send_reconnect_partial(recon_state);
 				if (err)
 					goto fail;
-				pagelist = recon_state->pagelist;
+				dbuf = recon_state->dbuf;
 			}
 
-			err = ceph_pagelist_reserve(pagelist, need);
+			err = ceph_databuf_reserve(dbuf, need, GFP_NOFS);
 			if (err)
 				goto fail;
 
-			ceph_pagelist_encode_8(pagelist, 1);
-			ceph_pagelist_encode_8(pagelist, 1);
-			ceph_pagelist_encode_32(pagelist, sizeof(sr_rec));
+			ceph_databuf_encode_8(dbuf, 1);
+			ceph_databuf_encode_8(dbuf, 1);
+			ceph_databuf_encode_32(dbuf, sizeof(sr_rec));
 		}
 
 		doutc(cl, " adding snap realm %llx seq %lld parent %llx\n",
@@ -4806,7 +4805,7 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
 		sr_rec.seq = cpu_to_le64(realm->seq);
 		sr_rec.parent = cpu_to_le64(realm->parent_ino);
 
-		err = ceph_pagelist_append(pagelist, &sr_rec, sizeof(sr_rec));
+		err = ceph_databuf_append(dbuf, &sr_rec, sizeof(sr_rec));
 		if (err)
 			goto fail;
 
@@ -4841,9 +4840,9 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 
 	pr_info_client(cl, "mds%d reconnect start\n", mds);
 
-	recon_state.pagelist = ceph_pagelist_alloc(GFP_NOFS);
-	if (!recon_state.pagelist)
-		goto fail_nopagelist;
+	recon_state.dbuf = ceph_databuf_req_alloc(1, 0, GFP_NOFS);
+	if (!recon_state.dbuf)
+		goto fail_nodatabuf;
 
 	reply = ceph_msg_new2(CEPH_MSG_CLIENT_RECONNECT, 0, 1, GFP_NOFS, false);
 	if (!reply)
@@ -4891,7 +4890,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 	down_read(&mdsc->snap_rwsem);
 
 	/* placeholder for nr_caps */
-	err = ceph_pagelist_encode_32(recon_state.pagelist, 0);
+	err = ceph_databuf_encode_32(recon_state.dbuf, 0);
 	if (err)
 		goto fail;
 
@@ -4916,7 +4915,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 	/* check if all realms can be encoded into current message */
 	if (mdsc->num_snap_realms) {
 		size_t total_len =
-			recon_state.pagelist->length +
+			ceph_databuf_len(recon_state.dbuf) +
 			mdsc->num_snap_realms *
 			sizeof(struct ceph_mds_snaprealm_reconnect);
 		if (recon_state.msg_version >= 4) {
@@ -4945,31 +4944,28 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 		goto fail;
 
 	if (recon_state.msg_version >= 5) {
-		err = ceph_pagelist_encode_8(recon_state.pagelist, 0);
+		err = ceph_databuf_encode_8(recon_state.dbuf, 0);
 		if (err < 0)
 			goto fail;
 	}
 
 	if (recon_state.nr_caps || recon_state.nr_realms) {
-		struct page *page =
-			list_first_entry(&recon_state.pagelist->head,
-					struct page, lru);
-		__le32 *addr = kmap_atomic(page);
+		__le32 *addr = kmap_ceph_databuf_page(recon_state.dbuf, 0);
 		if (recon_state.nr_caps) {
 			WARN_ON(recon_state.nr_realms != mdsc->num_snap_realms);
 			*addr = cpu_to_le32(recon_state.nr_caps);
 		} else if (recon_state.msg_version >= 4) {
 			*(addr + 1) = cpu_to_le32(recon_state.nr_realms);
 		}
-		kunmap_atomic(addr);
+		kunmap_local(addr);
 	}
 
 	reply->hdr.version = cpu_to_le16(recon_state.msg_version);
 	if (recon_state.msg_version >= 4)
 		reply->hdr.compat_version = cpu_to_le16(4);
 
-	reply->hdr.data_len = cpu_to_le32(recon_state.pagelist->length);
-	ceph_msg_data_add_pagelist(reply, recon_state.pagelist);
+	reply->hdr.data_len = cpu_to_le32(ceph_databuf_len(recon_state.dbuf));
+	ceph_msg_data_add_databuf(reply, recon_state.dbuf);
 
 	ceph_con_send(&session->s_con, reply);
 
@@ -4980,7 +4976,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 	mutex_unlock(&mdsc->mutex);
 
 	up_read(&mdsc->snap_rwsem);
-	ceph_pagelist_release(recon_state.pagelist);
+	ceph_databuf_release(recon_state.dbuf);
 	return;
 
 fail:
@@ -4988,8 +4984,8 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 	up_read(&mdsc->snap_rwsem);
 	mutex_unlock(&session->s_mutex);
 fail_nomsg:
-	ceph_pagelist_release(recon_state.pagelist);
-fail_nopagelist:
+	ceph_databuf_release(recon_state.dbuf);
+fail_nodatabuf:
 	pr_err_client(cl, "error %d preparing reconnect for mds%d\n",
 		      err, mds);
 	return;
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 984a6d2a5378..b072572e2cf4 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1351,9 +1351,9 @@ extern int ceph_encode_locks_to_buffer(struct inode *inode,
 				       struct ceph_filelock *flocks,
 				       int num_fcntl_locks,
 				       int num_flock_locks);
-extern int ceph_locks_to_pagelist(struct ceph_filelock *flocks,
-				  struct ceph_pagelist *pagelist,
-				  int num_fcntl_locks, int num_flock_locks);
+extern int ceph_locks_to_databuf(struct ceph_filelock *flocks,
+				 struct ceph_databuf *dbuf,
+				 int num_fcntl_locks, int num_flock_locks);
 
 /* debugfs.c */
 extern void ceph_fs_debugfs_init(struct ceph_fs_client *client);
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 6e126e212271..ce04205b8143 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -334,7 +334,7 @@ struct ceph_osd_linger_request {
 	rados_watcherrcb_t errcb;
 	void *data;
 
-	struct ceph_pagelist *request_pl;
+	struct ceph_databuf *request_pl;
 	struct ceph_databuf *notify_id_buf;
 
 	struct page ***preply_pages;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 64a06267e7b3..a967309d01a7 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -810,37 +810,37 @@ int osd_req_op_xattr_init(struct ceph_osd_request *osd_req, unsigned int which,
 {
 	struct ceph_osd_req_op *op = osd_req_op_init(osd_req, which,
 						     opcode, 0);
-	struct ceph_pagelist *pagelist;
+	struct ceph_databuf *dbuf;
 	size_t payload_len;
 	int ret;
 
 	BUG_ON(opcode != CEPH_OSD_OP_SETXATTR && opcode != CEPH_OSD_OP_CMPXATTR);
 
-	pagelist = ceph_pagelist_alloc(GFP_NOFS);
-	if (!pagelist)
+	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOFS);
+	if (!dbuf)
 		return -ENOMEM;
 
 	payload_len = strlen(name);
 	op->xattr.name_len = payload_len;
-	ret = ceph_pagelist_append(pagelist, name, payload_len);
+	ret = ceph_databuf_append(dbuf, name, payload_len);
 	if (ret)
-		goto err_pagelist_free;
+		goto err_databuf_free;
 
 	op->xattr.value_len = size;
-	ret = ceph_pagelist_append(pagelist, value, size);
+	ret = ceph_databuf_append(dbuf, value, size);
 	if (ret)
-		goto err_pagelist_free;
+		goto err_databuf_free;
 	payload_len += size;
 
 	op->xattr.cmp_op = cmp_op;
 	op->xattr.cmp_mode = cmp_mode;
 
-	ceph_osd_data_pagelist_init(&op->xattr.osd_data, pagelist);
+	ceph_osd_databuf_init(&op->xattr.osd_data, dbuf);
 	op->indata_len = payload_len;
 	return 0;
 
-err_pagelist_free:
-	ceph_pagelist_release(pagelist);
+err_databuf_free:
+	ceph_databuf_release(dbuf);
 	return ret;
 }
 EXPORT_SYMBOL(osd_req_op_xattr_init);
@@ -864,15 +864,15 @@ static void osd_req_op_watch_init(struct ceph_osd_request *req, int which,
  * encoded in @request_pl
  */
 static void osd_req_op_notify_init(struct ceph_osd_request *req, int which,
-				   u64 cookie, struct ceph_pagelist *request_pl)
+				   u64 cookie, struct ceph_databuf *request_pl)
 {
 	struct ceph_osd_req_op *op;
 
 	op = osd_req_op_init(req, which, CEPH_OSD_OP_NOTIFY, 0);
 	op->notify.cookie = cookie;
 
-	ceph_osd_data_pagelist_init(&op->notify.request_data, request_pl);
-	op->indata_len = request_pl->length;
+	ceph_osd_databuf_init(&op->notify.request_data, request_pl);
+	op->indata_len = ceph_databuf_len(request_pl);
 }
 
 /*
@@ -2730,8 +2730,7 @@ static void linger_release(struct kref *kref)
 	WARN_ON(!list_empty(&lreq->pending_lworks));
 	WARN_ON(lreq->osd);
 
-	if (lreq->request_pl)
-		ceph_pagelist_release(lreq->request_pl);
+	ceph_databuf_release(lreq->request_pl);
 	ceph_databuf_release(lreq->notify_id_buf);
 	ceph_osdc_put_request(lreq->reg_req);
 	ceph_osdc_put_request(lreq->ping_req);
@@ -4800,30 +4799,30 @@ static int osd_req_op_notify_ack_init(struct ceph_osd_request *req, int which,
 				      u32 payload_len)
 {
 	struct ceph_osd_req_op *op;
-	struct ceph_pagelist *pl;
+	struct ceph_databuf *dbuf;
 	int ret;
 
 	op = osd_req_op_init(req, which, CEPH_OSD_OP_NOTIFY_ACK, 0);
 
-	pl = ceph_pagelist_alloc(GFP_NOIO);
-	if (!pl)
+	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
+	if (!dbuf)
 		return -ENOMEM;
 
-	ret = ceph_pagelist_encode_64(pl, notify_id);
-	ret |= ceph_pagelist_encode_64(pl, cookie);
+	ret = ceph_databuf_encode_64(dbuf, notify_id);
+	ret |= ceph_databuf_encode_64(dbuf, cookie);
 	if (payload) {
-		ret |= ceph_pagelist_encode_32(pl, payload_len);
-		ret |= ceph_pagelist_append(pl, payload, payload_len);
+		ret |= ceph_databuf_encode_32(dbuf, payload_len);
+		ret |= ceph_databuf_append(dbuf, payload, payload_len);
 	} else {
-		ret |= ceph_pagelist_encode_32(pl, 0);
+		ret |= ceph_databuf_encode_32(dbuf, 0);
 	}
 	if (ret) {
-		ceph_pagelist_release(pl);
+		ceph_databuf_release(dbuf);
 		return -ENOMEM;
 	}
 
-	ceph_osd_data_pagelist_init(&op->notify_ack.request_data, pl);
-	op->indata_len = pl->length;
+	ceph_osd_databuf_init(&op->notify_ack.request_data, dbuf);
+	op->indata_len = ceph_databuf_len(dbuf);
 	return 0;
 }
 
@@ -4894,16 +4893,16 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
 	if (!lreq)
 		return -ENOMEM;
 
-	lreq->request_pl = ceph_pagelist_alloc(GFP_NOIO);
+	lreq->request_pl = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
 	if (!lreq->request_pl) {
 		ret = -ENOMEM;
 		goto out_put_lreq;
 	}
 
-	ret = ceph_pagelist_encode_32(lreq->request_pl, 1); /* prot_ver */
-	ret |= ceph_pagelist_encode_32(lreq->request_pl, timeout);
-	ret |= ceph_pagelist_encode_32(lreq->request_pl, payload_len);
-	ret |= ceph_pagelist_append(lreq->request_pl, payload, payload_len);
+	ret = ceph_databuf_encode_32(lreq->request_pl, 1); /* prot_ver */
+	ret |= ceph_databuf_encode_32(lreq->request_pl, timeout);
+	ret |= ceph_databuf_encode_32(lreq->request_pl, payload_len);
+	ret |= ceph_databuf_append(lreq->request_pl, payload, payload_len);
 	if (ret) {
 		ret = -ENOMEM;
 		goto out_put_lreq;


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 20/35] libceph: Remove ceph_pagelist
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (18 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 19/35] libceph, ceph: Convert users of ceph_pagelist " David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 21/35] libceph: Make notify code use ceph_databuf_enc_start/stop David Howells
                   ` (14 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Remove ceph_pagelist and its helpers.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/ceph/locks.c                 |   1 -
 fs/ceph/mds_client.c            |   1 -
 fs/ceph/xattr.c                 |   1 -
 include/linux/ceph/messenger.h  |   8 --
 include/linux/ceph/osd_client.h |   9 ---
 include/linux/ceph/pagelist.h   |  60 --------------
 net/ceph/Makefile               |   2 +-
 net/ceph/messenger.c            | 110 --------------------------
 net/ceph/osd_client.c           |  41 ----------
 net/ceph/pagelist.c             | 133 --------------------------------
 10 files changed, 1 insertion(+), 365 deletions(-)
 delete mode 100644 include/linux/ceph/pagelist.h
 delete mode 100644 net/ceph/pagelist.c

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 32c7b0f0d61f..451b92d99cf1 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -8,7 +8,6 @@
 #include "super.h"
 #include "mds_client.h"
 #include <linux/filelock.h>
-#include <linux/ceph/pagelist.h>
 
 static u64 lock_secret;
 static int ceph_lock_wait_for_completion(struct ceph_mds_client *mdsc,
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index f1c6d0ebf548..26fa39d07ef0 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -21,7 +21,6 @@
 #include <linux/ceph/ceph_features.h>
 #include <linux/ceph/messenger.h>
 #include <linux/ceph/decode.h>
-#include <linux/ceph/pagelist.h>
 #include <linux/ceph/auth.h>
 #include <linux/ceph/debugfs.h>
 
diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index b083cd3b3974..de7b1c364bec 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -1,6 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/ceph/ceph_debug.h>
-#include <linux/ceph/pagelist.h>
 
 #include "super.h"
 #include "mds_client.h"
diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index ff0aea6d2d31..36896a71291c 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -119,7 +119,6 @@ enum ceph_msg_data_type {
 	CEPH_MSG_DATA_NONE,	/* message contains no data payload */
 	CEPH_MSG_DATA_DATABUF,	/* data source/destination is a data buffer */
 	CEPH_MSG_DATA_PAGES,	/* data source/destination is a page array */
-	CEPH_MSG_DATA_PAGELIST,	/* data source/destination is a pagelist */
 	CEPH_MSG_DATA_ITER,	/* data source/destination is an iov_iter */
 };
 
@@ -135,7 +134,6 @@ struct ceph_msg_data {
 			unsigned int	offset;		/* first page */
 			bool		own_pages;
 		};
-		struct ceph_pagelist	*pagelist;
 	};
 };
 
@@ -152,10 +150,6 @@ struct ceph_msg_data_cursor {
 			unsigned short	page_index;	/* index in array */
 			unsigned short	page_count;	/* pages in array */
 		};
-		struct {				/* pagelist */
-			struct page	*page;		/* page from list */
-			size_t		offset;		/* bytes from list */
-		};
 		struct {
 			struct iov_iter		iov_iter;
 			struct iov_iter		crc_iter;
@@ -511,8 +505,6 @@ extern bool ceph_con_keepalive_expired(struct ceph_connection *con,
 void ceph_msg_data_add_databuf(struct ceph_msg *msg, struct ceph_databuf *dbuf);
 void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
 			     size_t length, size_t offset, bool own_pages);
-extern void ceph_msg_data_add_pagelist(struct ceph_msg *msg,
-				struct ceph_pagelist *pagelist);
 void ceph_msg_data_add_iter(struct ceph_msg *msg,
 			    struct iov_iter *iter);
 
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index ce04205b8143..5a1ee66ca216 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -15,7 +15,6 @@
 #include <linux/ceph/messenger.h>
 #include <linux/ceph/msgpool.h>
 #include <linux/ceph/auth.h>
-#include <linux/ceph/pagelist.h>
 #include <linux/ceph/databuf.h>
 
 struct ceph_msg;
@@ -106,7 +105,6 @@ enum ceph_osd_data_type {
 	CEPH_OSD_DATA_TYPE_NONE = 0,
 	CEPH_OSD_DATA_TYPE_DATABUF,
 	CEPH_OSD_DATA_TYPE_PAGES,
-	CEPH_OSD_DATA_TYPE_PAGELIST,
 	CEPH_OSD_DATA_TYPE_ITER,
 };
 
@@ -122,7 +120,6 @@ struct ceph_osd_data {
 			bool		pages_from_pool;
 			bool		own_pages;
 		};
-		struct ceph_pagelist	*pagelist;
 	};
 };
 
@@ -485,18 +482,12 @@ extern void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *,
 void osd_req_op_raw_data_in_databuf(struct ceph_osd_request *osd_req,
 				    unsigned int which,
 				    struct ceph_databuf *databuf);
-extern void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *,
-					unsigned int which,
-					struct ceph_pagelist *pagelist);
 void osd_req_op_extent_osd_iter(struct ceph_osd_request *osd_req,
 				unsigned int which, struct iov_iter *iter);
 
 void osd_req_op_cls_request_databuf(struct ceph_osd_request *req,
 				    unsigned int which,
 				    struct ceph_databuf *dbuf);
-extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *,
-					unsigned int which,
-					struct ceph_pagelist *pagelist);
 void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req,
 				     unsigned int which,
 				     struct ceph_databuf *dbuf);
diff --git a/include/linux/ceph/pagelist.h b/include/linux/ceph/pagelist.h
deleted file mode 100644
index 879bec0863aa..000000000000
--- a/include/linux/ceph/pagelist.h
+++ /dev/null
@@ -1,60 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __FS_CEPH_PAGELIST_H
-#define __FS_CEPH_PAGELIST_H
-
-#include <asm/byteorder.h>
-#include <linux/refcount.h>
-#include <linux/list.h>
-#include <linux/types.h>
-
-struct ceph_pagelist {
-	struct list_head head;
-	void *mapped_tail;
-	size_t length;
-	size_t room;
-	struct list_head free_list;
-	size_t num_pages_free;
-	refcount_t refcnt;
-};
-
-struct ceph_pagelist *ceph_pagelist_alloc(gfp_t gfp_flags);
-
-extern void ceph_pagelist_release(struct ceph_pagelist *pl);
-
-extern int ceph_pagelist_append(struct ceph_pagelist *pl, const void *d, size_t l);
-
-extern int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space);
-
-extern int ceph_pagelist_free_reserve(struct ceph_pagelist *pl);
-
-static inline int ceph_pagelist_encode_64(struct ceph_pagelist *pl, u64 v)
-{
-	__le64 ev = cpu_to_le64(v);
-	return ceph_pagelist_append(pl, &ev, sizeof(ev));
-}
-static inline int ceph_pagelist_encode_32(struct ceph_pagelist *pl, u32 v)
-{
-	__le32 ev = cpu_to_le32(v);
-	return ceph_pagelist_append(pl, &ev, sizeof(ev));
-}
-static inline int ceph_pagelist_encode_16(struct ceph_pagelist *pl, u16 v)
-{
-	__le16 ev = cpu_to_le16(v);
-	return ceph_pagelist_append(pl, &ev, sizeof(ev));
-}
-static inline int ceph_pagelist_encode_8(struct ceph_pagelist *pl, u8 v)
-{
-	return ceph_pagelist_append(pl, &v, 1);
-}
-static inline int ceph_pagelist_encode_string(struct ceph_pagelist *pl,
-					      char *s, u32 len)
-{
-	int ret = ceph_pagelist_encode_32(pl, len);
-	if (ret)
-		return ret;
-	if (len)
-		return ceph_pagelist_append(pl, s, len);
-	return 0;
-}
-
-#endif
diff --git a/net/ceph/Makefile b/net/ceph/Makefile
index 4b2e0b654e45..0c8787e2e733 100644
--- a/net/ceph/Makefile
+++ b/net/ceph/Makefile
@@ -4,7 +4,7 @@
 #
 obj-$(CONFIG_CEPH_LIB) += libceph.o
 
-libceph-y := ceph_common.o messenger.o msgpool.o buffer.o pagelist.o \
+libceph-y := ceph_common.o messenger.o msgpool.o buffer.o \
 	mon_client.o decode.o \
 	cls_lock_client.o \
 	osd_client.o osdmap.o crush/crush.o crush/mapper.o crush/hash.o \
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index cb66a768bd7c..4b20df1ab8e4 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -20,7 +20,6 @@
 #include <linux/ceph/libceph.h>
 #include <linux/ceph/messenger.h>
 #include <linux/ceph/decode.h>
-#include <linux/ceph/pagelist.h>
 #include <linux/export.h>
 
 /*
@@ -775,87 +774,6 @@ static bool ceph_msg_data_pages_advance(struct ceph_msg_data_cursor *cursor,
 	return true;
 }
 
-/*
- * For a pagelist, a piece is whatever remains to be consumed in the
- * first page in the list, or the front of the next page.
- */
-static void
-ceph_msg_data_pagelist_cursor_init(struct ceph_msg_data_cursor *cursor,
-					size_t length)
-{
-	struct ceph_msg_data *data = cursor->data;
-	struct ceph_pagelist *pagelist;
-	struct page *page;
-
-	BUG_ON(data->type != CEPH_MSG_DATA_PAGELIST);
-
-	pagelist = data->pagelist;
-	BUG_ON(!pagelist);
-
-	if (!length)
-		return;		/* pagelist can be assigned but empty */
-
-	BUG_ON(list_empty(&pagelist->head));
-	page = list_first_entry(&pagelist->head, struct page, lru);
-
-	cursor->resid = min(length, pagelist->length);
-	cursor->page = page;
-	cursor->offset = 0;
-}
-
-static struct page *
-ceph_msg_data_pagelist_next(struct ceph_msg_data_cursor *cursor,
-				size_t *page_offset, size_t *length)
-{
-	struct ceph_msg_data *data = cursor->data;
-	struct ceph_pagelist *pagelist;
-
-	BUG_ON(data->type != CEPH_MSG_DATA_PAGELIST);
-
-	pagelist = data->pagelist;
-	BUG_ON(!pagelist);
-
-	BUG_ON(!cursor->page);
-	BUG_ON(cursor->offset + cursor->resid != pagelist->length);
-
-	/* offset of first page in pagelist is always 0 */
-	*page_offset = cursor->offset & ~PAGE_MASK;
-	*length = min_t(size_t, cursor->resid, PAGE_SIZE - *page_offset);
-	return cursor->page;
-}
-
-static bool ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor,
-						size_t bytes)
-{
-	struct ceph_msg_data *data = cursor->data;
-	struct ceph_pagelist *pagelist;
-
-	BUG_ON(data->type != CEPH_MSG_DATA_PAGELIST);
-
-	pagelist = data->pagelist;
-	BUG_ON(!pagelist);
-
-	BUG_ON(cursor->offset + cursor->resid != pagelist->length);
-	BUG_ON((cursor->offset & ~PAGE_MASK) + bytes > PAGE_SIZE);
-
-	/* Advance the cursor offset */
-
-	cursor->resid -= bytes;
-	cursor->offset += bytes;
-	/* offset of first page in pagelist is always 0 */
-	if (!bytes || cursor->offset & ~PAGE_MASK)
-		return false;	/* more bytes to process in the current page */
-
-	if (!cursor->resid)
-		return false;   /* no more data */
-
-	/* Move on to the next page */
-
-	BUG_ON(list_is_last(&cursor->page->lru, &pagelist->head));
-	cursor->page = list_next_entry(cursor->page, lru);
-	return true;
-}
-
 static void ceph_msg_data_iter_cursor_init(struct ceph_msg_data_cursor *cursor,
 					   size_t length)
 {
@@ -926,9 +844,6 @@ static void __ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor)
 	size_t length = cursor->total_resid;
 
 	switch (cursor->data->type) {
-	case CEPH_MSG_DATA_PAGELIST:
-		ceph_msg_data_pagelist_cursor_init(cursor, length);
-		break;
 	case CEPH_MSG_DATA_PAGES:
 		ceph_msg_data_pages_cursor_init(cursor, length);
 		break;
@@ -969,9 +884,6 @@ struct page *ceph_msg_data_next(struct ceph_msg_data_cursor *cursor,
 	struct page *page;
 
 	switch (cursor->data->type) {
-	case CEPH_MSG_DATA_PAGELIST:
-		page = ceph_msg_data_pagelist_next(cursor, page_offset, length);
-		break;
 	case CEPH_MSG_DATA_PAGES:
 		page = ceph_msg_data_pages_next(cursor, page_offset, length);
 		break;
@@ -1003,9 +915,6 @@ void ceph_msg_data_advance(struct ceph_msg_data_cursor *cursor, size_t bytes)
 
 	BUG_ON(bytes > cursor->resid);
 	switch (cursor->data->type) {
-	case CEPH_MSG_DATA_PAGELIST:
-		new_piece = ceph_msg_data_pagelist_advance(cursor, bytes);
-		break;
 	case CEPH_MSG_DATA_PAGES:
 		new_piece = ceph_msg_data_pages_advance(cursor, bytes);
 		break;
@@ -1744,8 +1653,6 @@ static void ceph_msg_data_destroy(struct ceph_msg_data *data)
 	} else if (data->type == CEPH_MSG_DATA_PAGES && data->own_pages) {
 		int num_pages = calc_pages_for(data->offset, data->length);
 		ceph_release_page_vector(data->pages, num_pages);
-	} else if (data->type == CEPH_MSG_DATA_PAGELIST) {
-		ceph_pagelist_release(data->pagelist);
 	}
 }
 
@@ -1784,23 +1691,6 @@ void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
 }
 EXPORT_SYMBOL(ceph_msg_data_add_pages);
 
-void ceph_msg_data_add_pagelist(struct ceph_msg *msg,
-				struct ceph_pagelist *pagelist)
-{
-	struct ceph_msg_data *data;
-
-	BUG_ON(!pagelist);
-	BUG_ON(!pagelist->length);
-
-	data = ceph_msg_data_add(msg);
-	data->type = CEPH_MSG_DATA_PAGELIST;
-	refcount_inc(&pagelist->refcnt);
-	data->pagelist = pagelist;
-
-	msg->data_length += pagelist->length;
-}
-EXPORT_SYMBOL(ceph_msg_data_add_pagelist);
-
 void ceph_msg_data_add_iter(struct ceph_msg *msg,
 			    struct iov_iter *iter)
 {
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index a967309d01a7..0ac439e7e730 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -16,7 +16,6 @@
 #include <linux/ceph/messenger.h>
 #include <linux/ceph/decode.h>
 #include <linux/ceph/auth.h>
-#include <linux/ceph/pagelist.h>
 #include <linux/ceph/striper.h>
 
 #define OSD_OPREPLY_FRONT_LEN	512
@@ -138,16 +137,6 @@ static void ceph_osd_data_pages_init(struct ceph_osd_data *osd_data,
 	osd_data->own_pages = own_pages;
 }
 
-/*
- * Consumes a ref on @pagelist.
- */
-static void ceph_osd_data_pagelist_init(struct ceph_osd_data *osd_data,
-			struct ceph_pagelist *pagelist)
-{
-	osd_data->type = CEPH_OSD_DATA_TYPE_PAGELIST;
-	osd_data->pagelist = pagelist;
-}
-
 static void ceph_osd_iter_init(struct ceph_osd_data *osd_data,
 			       struct iov_iter *iter)
 {
@@ -230,16 +219,6 @@ void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *osd_req,
 }
 EXPORT_SYMBOL(osd_req_op_extent_osd_data_pages);
 
-void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *osd_req,
-			unsigned int which, struct ceph_pagelist *pagelist)
-{
-	struct ceph_osd_data *osd_data;
-
-	osd_data = osd_req_op_data(osd_req, which, extent, osd_data);
-	ceph_osd_data_pagelist_init(osd_data, pagelist);
-}
-EXPORT_SYMBOL(osd_req_op_extent_osd_data_pagelist);
-
 /**
  * osd_req_op_extent_osd_iter - Set up an operation with an iterator buffer
  * @osd_req: The request to set up
@@ -281,19 +260,6 @@ void osd_req_op_cls_request_databuf(struct ceph_osd_request *osd_req,
 }
 EXPORT_SYMBOL(osd_req_op_cls_request_databuf);
 
-void osd_req_op_cls_request_data_pagelist(
-			struct ceph_osd_request *osd_req,
-			unsigned int which, struct ceph_pagelist *pagelist)
-{
-	struct ceph_osd_data *osd_data;
-
-	osd_data = osd_req_op_data(osd_req, which, cls, request_data);
-	ceph_osd_data_pagelist_init(osd_data, pagelist);
-	osd_req->r_ops[which].cls.indata_len += pagelist->length;
-	osd_req->r_ops[which].indata_len += pagelist->length;
-}
-EXPORT_SYMBOL(osd_req_op_cls_request_data_pagelist);
-
 void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req,
 				     unsigned int which,
 				     struct ceph_databuf *dbuf)
@@ -316,8 +282,6 @@ static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data)
 		return ceph_databuf_len(osd_data->dbuf);
 	case CEPH_OSD_DATA_TYPE_PAGES:
 		return osd_data->length;
-	case CEPH_OSD_DATA_TYPE_PAGELIST:
-		return (u64)osd_data->pagelist->length;
 	case CEPH_OSD_DATA_TYPE_ITER:
 		return iov_iter_count(&osd_data->iter);
 	default:
@@ -336,8 +300,6 @@ static void ceph_osd_data_release(struct ceph_osd_data *osd_data)
 		num_pages = calc_pages_for((u64)osd_data->offset,
 						(u64)osd_data->length);
 		ceph_release_page_vector(osd_data->pages, num_pages);
-	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) {
-		ceph_pagelist_release(osd_data->pagelist);
 	}
 	ceph_osd_data_init(osd_data);
 }
@@ -913,9 +875,6 @@ static void ceph_osdc_msg_data_add(struct ceph_msg *msg,
 		if (length)
 			ceph_msg_data_add_pages(msg, osd_data->pages,
 					length, osd_data->offset, false);
-	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) {
-		BUG_ON(!length);
-		ceph_msg_data_add_pagelist(msg, osd_data->pagelist);
 	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_ITER) {
 		ceph_msg_data_add_iter(msg, &osd_data->iter);
 	} else {
diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
deleted file mode 100644
index 5a9c4be5f222..000000000000
--- a/net/ceph/pagelist.c
+++ /dev/null
@@ -1,133 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-#include <linux/module.h>
-#include <linux/gfp.h>
-#include <linux/slab.h>
-#include <linux/pagemap.h>
-#include <linux/highmem.h>
-#include <linux/ceph/pagelist.h>
-
-struct ceph_pagelist *ceph_pagelist_alloc(gfp_t gfp_flags)
-{
-	struct ceph_pagelist *pl;
-
-	pl = kmalloc(sizeof(*pl), gfp_flags);
-	if (!pl)
-		return NULL;
-
-	INIT_LIST_HEAD(&pl->head);
-	pl->mapped_tail = NULL;
-	pl->length = 0;
-	pl->room = 0;
-	INIT_LIST_HEAD(&pl->free_list);
-	pl->num_pages_free = 0;
-	refcount_set(&pl->refcnt, 1);
-
-	return pl;
-}
-EXPORT_SYMBOL(ceph_pagelist_alloc);
-
-static void ceph_pagelist_unmap_tail(struct ceph_pagelist *pl)
-{
-	if (pl->mapped_tail) {
-		struct page *page = list_entry(pl->head.prev, struct page, lru);
-		kunmap(page);
-		pl->mapped_tail = NULL;
-	}
-}
-
-void ceph_pagelist_release(struct ceph_pagelist *pl)
-{
-	if (!refcount_dec_and_test(&pl->refcnt))
-		return;
-	ceph_pagelist_unmap_tail(pl);
-	while (!list_empty(&pl->head)) {
-		struct page *page = list_first_entry(&pl->head, struct page,
-						     lru);
-		list_del(&page->lru);
-		__free_page(page);
-	}
-	ceph_pagelist_free_reserve(pl);
-	kfree(pl);
-}
-EXPORT_SYMBOL(ceph_pagelist_release);
-
-static int ceph_pagelist_addpage(struct ceph_pagelist *pl)
-{
-	struct page *page;
-
-	if (!pl->num_pages_free) {
-		page = __page_cache_alloc(GFP_NOFS);
-	} else {
-		page = list_first_entry(&pl->free_list, struct page, lru);
-		list_del(&page->lru);
-		--pl->num_pages_free;
-	}
-	if (!page)
-		return -ENOMEM;
-	pl->room += PAGE_SIZE;
-	ceph_pagelist_unmap_tail(pl);
-	list_add_tail(&page->lru, &pl->head);
-	pl->mapped_tail = kmap(page);
-	return 0;
-}
-
-int ceph_pagelist_append(struct ceph_pagelist *pl, const void *buf, size_t len)
-{
-	while (pl->room < len) {
-		size_t bit = pl->room;
-		int ret;
-
-		memcpy(pl->mapped_tail + (pl->length & ~PAGE_MASK),
-		       buf, bit);
-		pl->length += bit;
-		pl->room -= bit;
-		buf += bit;
-		len -= bit;
-		ret = ceph_pagelist_addpage(pl);
-		if (ret)
-			return ret;
-	}
-
-	memcpy(pl->mapped_tail + (pl->length & ~PAGE_MASK), buf, len);
-	pl->length += len;
-	pl->room -= len;
-	return 0;
-}
-EXPORT_SYMBOL(ceph_pagelist_append);
-
-/* Allocate enough pages for a pagelist to append the given amount
- * of data without allocating.
- * Returns: 0 on success, -ENOMEM on error.
- */
-int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space)
-{
-	if (space <= pl->room)
-		return 0;
-	space -= pl->room;
-	space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT;   /* conv to num pages */
-
-	while (space > pl->num_pages_free) {
-		struct page *page = __page_cache_alloc(GFP_NOFS);
-		if (!page)
-			return -ENOMEM;
-		list_add_tail(&page->lru, &pl->free_list);
-		++pl->num_pages_free;
-	}
-	return 0;
-}
-EXPORT_SYMBOL(ceph_pagelist_reserve);
-
-/* Free any pages that have been preallocated. */
-int ceph_pagelist_free_reserve(struct ceph_pagelist *pl)
-{
-	while (!list_empty(&pl->free_list)) {
-		struct page *page = list_first_entry(&pl->free_list,
-						     struct page, lru);
-		list_del(&page->lru);
-		__free_page(page);
-		--pl->num_pages_free;
-	}
-	BUG_ON(pl->num_pages_free);
-	return 0;
-}
-EXPORT_SYMBOL(ceph_pagelist_free_reserve);


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 21/35] libceph: Make notify code use ceph_databuf_enc_start/stop
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (19 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 20/35] libceph: Remove ceph_pagelist David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-18 20:12   ` Viacheslav Dubeyko
  2025-03-18 22:36   ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 22/35] libceph, rbd: Convert ceph_osdc_notify() reply to ceph_databuf David Howells
                   ` (13 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Make the ceph_osdc_notify*() functions use ceph_databuf_enc_start() and
ceph_databuf_enc_stop() when filling out the request data.  Also use
ceph_encode_*() rather than ceph_databuf_encode_*() as the latter will do
an iterator copy to deal with page crossing and misalignment (the latter
being something that the CPU will handle on some arches).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 net/ceph/osd_client.c | 55 +++++++++++++++++++++----------------------
 1 file changed, 27 insertions(+), 28 deletions(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 0ac439e7e730..1a0cb2cdcc52 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -4759,7 +4759,10 @@ static int osd_req_op_notify_ack_init(struct ceph_osd_request *req, int which,
 {
 	struct ceph_osd_req_op *op;
 	struct ceph_databuf *dbuf;
-	int ret;
+	void *p;
+
+	if (!payload)
+		payload_len = 0;
 
 	op = osd_req_op_init(req, which, CEPH_OSD_OP_NOTIFY_ACK, 0);
 
@@ -4767,18 +4770,13 @@ static int osd_req_op_notify_ack_init(struct ceph_osd_request *req, int which,
 	if (!dbuf)
 		return -ENOMEM;
 
-	ret = ceph_databuf_encode_64(dbuf, notify_id);
-	ret |= ceph_databuf_encode_64(dbuf, cookie);
-	if (payload) {
-		ret |= ceph_databuf_encode_32(dbuf, payload_len);
-		ret |= ceph_databuf_append(dbuf, payload, payload_len);
-	} else {
-		ret |= ceph_databuf_encode_32(dbuf, 0);
-	}
-	if (ret) {
-		ceph_databuf_release(dbuf);
-		return -ENOMEM;
-	}
+	p = ceph_databuf_enc_start(dbuf);
+	ceph_encode_64(&p, notify_id);
+	ceph_encode_64(&p, cookie);
+	ceph_encode_32(&p, payload_len);
+	if (payload)
+		ceph_encode_copy(&p, payload, payload_len);
+	ceph_databuf_enc_stop(dbuf, p);
 
 	ceph_osd_databuf_init(&op->notify_ack.request_data, dbuf);
 	op->indata_len = ceph_databuf_len(dbuf);
@@ -4840,8 +4838,12 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
 		     size_t *preply_len)
 {
 	struct ceph_osd_linger_request *lreq;
+	void *p;
 	int ret;
 
+	if (WARN_ON_ONCE(payload_len > PAGE_SIZE - 3 * 4))
+		return -EIO;
+
 	WARN_ON(!timeout);
 	if (preply_pages) {
 		*preply_pages = NULL;
@@ -4852,20 +4854,19 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
 	if (!lreq)
 		return -ENOMEM;
 
-	lreq->request_pl = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
+	lreq->request_pl = ceph_databuf_req_alloc(0, 3 * 4 + payload_len,
+						  GFP_NOIO);
 	if (!lreq->request_pl) {
 		ret = -ENOMEM;
 		goto out_put_lreq;
 	}
 
-	ret = ceph_databuf_encode_32(lreq->request_pl, 1); /* prot_ver */
-	ret |= ceph_databuf_encode_32(lreq->request_pl, timeout);
-	ret |= ceph_databuf_encode_32(lreq->request_pl, payload_len);
-	ret |= ceph_databuf_append(lreq->request_pl, payload, payload_len);
-	if (ret) {
-		ret = -ENOMEM;
-		goto out_put_lreq;
-	}
+	p = ceph_databuf_enc_start(lreq->request_pl);
+	ceph_encode_32(&p, 1); /* prot_ver */
+	ceph_encode_32(&p, timeout);
+	ceph_encode_32(&p, payload_len);
+	ceph_encode_copy(&p, payload, payload_len);
+	ceph_databuf_enc_stop(lreq->request_pl, p);
 
 	/* for notify_id */
 	lreq->notify_id_buf = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO);
@@ -5217,7 +5218,7 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req,
 {
 	struct ceph_osd_req_op *op;
 	struct ceph_databuf *dbuf;
-	void *p, *end;
+	void *p;
 
 	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_KERNEL);
 	if (!dbuf)
@@ -5230,15 +5231,13 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req,
 	op->copy_from.flags = copy_from_flags;
 	op->copy_from.src_fadvise_flags = src_fadvise_flags;
 
-	p = kmap_ceph_databuf_page(dbuf, 0);
-	end = p + PAGE_SIZE;
+	p = ceph_databuf_enc_start(dbuf);
 	ceph_encode_string(&p, src_oid->name, src_oid->name_len);
 	encode_oloc(&p, src_oloc);
 	ceph_encode_32(&p, truncate_seq);
 	ceph_encode_64(&p, truncate_size);
-	op->indata_len = PAGE_SIZE - (end - p);
-	ceph_databuf_added_data(dbuf, op->indata_len);
-	kunmap_local(p);
+	ceph_databuf_enc_stop(dbuf, p);
+	op->indata_len = ceph_databuf_len(dbuf);
 
 	ceph_osd_databuf_init(&op->copy_from.osd_data, dbuf);
 	return 0;


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 22/35] libceph, rbd: Convert ceph_osdc_notify() reply to ceph_databuf
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (20 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 21/35] libceph: Make notify code use ceph_databuf_enc_start/stop David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-19  0:08   ` Viacheslav Dubeyko
  2025-03-20 14:44   ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 23/35] rbd: Use ceph_databuf_enc_start/stop() David Howells
                   ` (12 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Convert the reply buffer of ceph_osdc_notify() to ceph_databuf rather than
an array of pages.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 drivers/block/rbd.c             | 36 +++++++++++++++++----------
 include/linux/ceph/databuf.h    | 16 ++++++++++++
 include/linux/ceph/osd_client.h |  7 ++----
 net/ceph/osd_client.c           | 44 +++++++++++----------------------
 4 files changed, 55 insertions(+), 48 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index eea12c7ab2a0..a2674077edea 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3585,8 +3585,7 @@ static void rbd_unlock(struct rbd_device *rbd_dev)
 
 static int __rbd_notify_op_lock(struct rbd_device *rbd_dev,
 				enum rbd_notify_op notify_op,
-				struct page ***preply_pages,
-				size_t *preply_len)
+				struct ceph_databuf *reply)
 {
 	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
 	struct rbd_client_id cid = rbd_get_cid(rbd_dev);
@@ -3604,13 +3603,13 @@ static int __rbd_notify_op_lock(struct rbd_device *rbd_dev,
 
 	return ceph_osdc_notify(osdc, &rbd_dev->header_oid,
 				&rbd_dev->header_oloc, buf, buf_size,
-				RBD_NOTIFY_TIMEOUT, preply_pages, preply_len);
+				RBD_NOTIFY_TIMEOUT, reply);
 }
 
 static void rbd_notify_op_lock(struct rbd_device *rbd_dev,
 			       enum rbd_notify_op notify_op)
 {
-	__rbd_notify_op_lock(rbd_dev, notify_op, NULL, NULL);
+	__rbd_notify_op_lock(rbd_dev, notify_op, NULL);
 }
 
 static void rbd_notify_acquired_lock(struct work_struct *work)
@@ -3631,23 +3630,29 @@ static void rbd_notify_released_lock(struct work_struct *work)
 
 static int rbd_request_lock(struct rbd_device *rbd_dev)
 {
-	struct page **reply_pages;
-	size_t reply_len;
+	struct ceph_databuf *reply;
 	bool lock_owner_responded = false;
 	int ret;
 
 	dout("%s rbd_dev %p\n", __func__, rbd_dev);
 
-	ret = __rbd_notify_op_lock(rbd_dev, RBD_NOTIFY_OP_REQUEST_LOCK,
-				   &reply_pages, &reply_len);
+	/* The actual reply pages will be allocated in the read path and then
+	 * pasted in in handle_watch_notify().
+	 */
+	reply = ceph_databuf_reply_alloc(0, 0, GFP_KERNEL);
+	if (!reply)
+		return -ENOMEM;
+
+	ret = __rbd_notify_op_lock(rbd_dev, RBD_NOTIFY_OP_REQUEST_LOCK, reply);
 	if (ret && ret != -ETIMEDOUT) {
 		rbd_warn(rbd_dev, "failed to request lock: %d", ret);
 		goto out;
 	}
 
-	if (reply_len > 0 && reply_len <= PAGE_SIZE) {
-		void *p = page_address(reply_pages[0]);
-		void *const end = p + reply_len;
+	if (ceph_databuf_len(reply) > 0 && ceph_databuf_len(reply) <= PAGE_SIZE) {
+		void *s = kmap_ceph_databuf_page(reply, 0);
+		void *p = s;
+		void *const end = p + ceph_databuf_len(reply);
 		u32 n;
 
 		ceph_decode_32_safe(&p, end, n, e_inval); /* num_acks */
@@ -3659,10 +3664,12 @@ static int rbd_request_lock(struct rbd_device *rbd_dev)
 			p += 8 + 8; /* skip gid and cookie */
 
 			ceph_decode_32_safe(&p, end, len, e_inval);
-			if (!len)
+			if (!len) {
 				continue;
+			}
 
 			if (lock_owner_responded) {
+				kunmap_local(s);
 				rbd_warn(rbd_dev,
 					 "duplicate lock owners detected");
 				ret = -EIO;
@@ -3673,6 +3680,7 @@ static int rbd_request_lock(struct rbd_device *rbd_dev)
 			ret = ceph_start_decoding(&p, end, 1, "ResponseMessage",
 						  &struct_v, &len);
 			if (ret) {
+				kunmap_local(s);
 				rbd_warn(rbd_dev,
 					 "failed to decode ResponseMessage: %d",
 					 ret);
@@ -3681,6 +3689,8 @@ static int rbd_request_lock(struct rbd_device *rbd_dev)
 
 			ret = ceph_decode_32(&p);
 		}
+
+		kunmap_local(s);
 	}
 
 	if (!lock_owner_responded) {
@@ -3689,7 +3699,7 @@ static int rbd_request_lock(struct rbd_device *rbd_dev)
 	}
 
 out:
-	ceph_release_page_vector(reply_pages, calc_pages_for(0, reply_len));
+	ceph_databuf_release(reply);
 	return ret;
 
 e_inval:
diff --git a/include/linux/ceph/databuf.h b/include/linux/ceph/databuf.h
index 54b76d0c91a0..25154b3d08fa 100644
--- a/include/linux/ceph/databuf.h
+++ b/include/linux/ceph/databuf.h
@@ -150,4 +150,20 @@ static inline bool ceph_databuf_is_all_zero(struct ceph_databuf *dbuf, size_t co
 			    ceph_databuf_scan_for_nonzero) == count;
 }
 
+static inline void ceph_databuf_transfer(struct ceph_databuf *to,
+					 struct ceph_databuf *from)
+{
+	BUG_ON(to->nr_bvec || to->bvec);
+	to->bvec	= from->bvec;
+	to->nr_bvec	= from->nr_bvec;
+	to->max_bvec	= from->max_bvec;
+	to->limit	= from->limit;
+	to->iter	= from->iter;
+
+	from->bvec = NULL;
+	from->nr_bvec = from->max_bvec = 0;
+	from->limit = 0;
+	iov_iter_discard(&from->iter, ITER_DEST, 0);
+}
+
 #endif /* __FS_CEPH_DATABUF_H */
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 5a1ee66ca216..7eff589711cc 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -333,9 +333,7 @@ struct ceph_osd_linger_request {
 
 	struct ceph_databuf *request_pl;
 	struct ceph_databuf *notify_id_buf;
-
-	struct page ***preply_pages;
-	size_t *preply_len;
+	struct ceph_databuf *reply;
 };
 
 struct ceph_watch_item {
@@ -589,8 +587,7 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
 		     void *payload,
 		     u32 payload_len,
 		     u32 timeout,
-		     struct page ***preply_pages,
-		     size_t *preply_len);
+		     struct ceph_databuf *reply);
 int ceph_osdc_list_watchers(struct ceph_osd_client *osdc,
 			    struct ceph_object_id *oid,
 			    struct ceph_object_locator *oloc,
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 1a0cb2cdcc52..92aaa5ed9145 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -4523,17 +4523,11 @@ static void handle_watch_notify(struct ceph_osd_client *osdc,
 			dout("lreq %p notify_id %llu != %llu, ignoring\n", lreq,
 			     lreq->notify_id, notify_id);
 		} else if (!completion_done(&lreq->notify_finish_wait)) {
-			struct ceph_msg_data *data =
-			    msg->num_data_items ? &msg->data[0] : NULL;
-
-			if (data) {
-				if (lreq->preply_pages) {
-					WARN_ON(data->type !=
-							CEPH_MSG_DATA_PAGES);
-					*lreq->preply_pages = data->pages;
-					*lreq->preply_len = data->length;
-					data->own_pages = false;
-				}
+			if (msg->num_data_items && lreq->reply) {
+				struct ceph_msg_data *data = &msg->data[0];
+
+				WARN_ON(data->type != CEPH_MSG_DATA_DATABUF);
+				ceph_databuf_transfer(lreq->reply, data->dbuf);
 			}
 			lreq->notify_finish_error = return_code;
 			complete_all(&lreq->notify_finish_wait);
@@ -4823,10 +4817,7 @@ EXPORT_SYMBOL(ceph_osdc_notify_ack);
 /*
  * @timeout: in seconds
  *
- * @preply_{pages,len} are initialized both on success and error.
- * The caller is responsible for:
- *
- *     ceph_release_page_vector(reply_pages, calc_pages_for(0, reply_len))
+ * @reply should be an empty ceph_databuf.
  */
 int ceph_osdc_notify(struct ceph_osd_client *osdc,
 		     struct ceph_object_id *oid,
@@ -4834,8 +4825,7 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
 		     void *payload,
 		     u32 payload_len,
 		     u32 timeout,
-		     struct page ***preply_pages,
-		     size_t *preply_len)
+		     struct ceph_databuf *reply)
 {
 	struct ceph_osd_linger_request *lreq;
 	void *p;
@@ -4845,10 +4835,6 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
 		return -EIO;
 
 	WARN_ON(!timeout);
-	if (preply_pages) {
-		*preply_pages = NULL;
-		*preply_len = 0;
-	}
 
 	lreq = linger_alloc(osdc);
 	if (!lreq)
@@ -4875,8 +4861,7 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
 		goto out_put_lreq;
 	}
 
-	lreq->preply_pages = preply_pages;
-	lreq->preply_len = preply_len;
+	lreq->reply = reply;
 
 	ceph_oid_copy(&lreq->t.base_oid, oid);
 	ceph_oloc_copy(&lreq->t.base_oloc, oloc);
@@ -5383,7 +5368,7 @@ static struct ceph_msg *get_reply(struct ceph_connection *con,
 	return m;
 }
 
-static struct ceph_msg *alloc_msg_with_page_vector(struct ceph_msg_header *hdr)
+static struct ceph_msg *alloc_msg_with_data_buffer(struct ceph_msg_header *hdr)
 {
 	struct ceph_msg *m;
 	int type = le16_to_cpu(hdr->type);
@@ -5395,16 +5380,15 @@ static struct ceph_msg *alloc_msg_with_page_vector(struct ceph_msg_header *hdr)
 		return NULL;
 
 	if (data_len) {
-		struct page **pages;
+		struct ceph_databuf *dbuf;
 
-		pages = ceph_alloc_page_vector(calc_pages_for(0, data_len),
-					       GFP_NOIO);
-		if (IS_ERR(pages)) {
+		dbuf = ceph_databuf_reply_alloc(0, data_len, GFP_NOIO);
+		if (!dbuf) {
 			ceph_msg_put(m);
 			return NULL;
 		}
 
-		ceph_msg_data_add_pages(m, pages, data_len, 0, true);
+		ceph_msg_data_add_databuf(m, dbuf);
 	}
 
 	return m;
@@ -5422,7 +5406,7 @@ static struct ceph_msg *osd_alloc_msg(struct ceph_connection *con,
 	case CEPH_MSG_OSD_MAP:
 	case CEPH_MSG_OSD_BACKOFF:
 	case CEPH_MSG_WATCH_NOTIFY:
-		return alloc_msg_with_page_vector(hdr);
+		return alloc_msg_with_data_buffer(hdr);
 	case CEPH_MSG_OSD_OPREPLY:
 		return get_reply(con, hdr, skip);
 	default:


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 23/35] rbd: Use ceph_databuf_enc_start/stop()
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (21 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 22/35] libceph, rbd: Convert ceph_osdc_notify() reply to ceph_databuf David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-19  0:32   ` Viacheslav Dubeyko
  2025-03-20 14:59   ` Why use plain numbers and totals rather than predef'd constants for RPC sizes? David Howells
  2025-03-13 23:33 ` [RFC PATCH 24/35] ceph: Make ceph_calc_file_object_mapping() return size as size_t David Howells
                   ` (11 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Make rbd use ceph_databuf_enc_start() and ceph_databuf_enc_stop() when
filling out the request data.  Also use ceph_encode_*() rather than
ceph_databuf_encode_*() as the latter will do an iterator copy to deal with
page crossing and misalignment (the latter being something that the CPU
will handle on some arches).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 drivers/block/rbd.c | 64 ++++++++++++++++++++++-----------------------
 1 file changed, 31 insertions(+), 33 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index a2674077edea..956fc4a8f1da 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1970,19 +1970,19 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req,
 				     int which, u64 objno, u8 new_state,
 				     const u8 *current_state)
 {
-	struct ceph_databuf *dbuf;
-	void *p, *start;
+	struct ceph_databuf *request;
+	void *p;
 	int ret;
 
 	ret = osd_req_op_cls_init(req, which, "rbd", "object_map_update");
 	if (ret)
 		return ret;
 
-	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
-	if (!dbuf)
+	request = ceph_databuf_req_alloc(1, 8 * 2 + 3 * 1, GFP_NOIO);
+	if (!request)
 		return -ENOMEM;
 
-	p = start = kmap_ceph_databuf_page(dbuf, 0);
+	p = ceph_databuf_enc_start(request);
 	ceph_encode_64(&p, objno);
 	ceph_encode_64(&p, objno + 1);
 	ceph_encode_8(&p, new_state);
@@ -1992,10 +1992,9 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req,
 	} else {
 		ceph_encode_8(&p, 0);
 	}
-	kunmap_local(p);
-	ceph_databuf_added_data(dbuf, p - start);
+	ceph_databuf_enc_stop(request, p);
 
-	osd_req_op_cls_request_databuf(req, which, dbuf);
+	osd_req_op_cls_request_databuf(req, which, request);
 	return 0;
 }
 
@@ -2108,7 +2107,7 @@ static int rbd_obj_calc_img_extents(struct rbd_obj_request *obj_req,
 
 static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which)
 {
-	struct ceph_databuf *dbuf;
+	struct ceph_databuf *request;
 
 	/*
 	 * The response data for a STAT call consists of:
@@ -2118,12 +2117,12 @@ static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which)
 	 *         le32 tv_nsec;
 	 *     } mtime;
 	 */
-	dbuf = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO);
-	if (!dbuf)
+	request = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO);
+	if (!request)
 		return -ENOMEM;
 
 	osd_req_op_init(osd_req, which, CEPH_OSD_OP_STAT, 0);
-	osd_req_op_raw_data_in_databuf(osd_req, which, dbuf);
+	osd_req_op_raw_data_in_databuf(osd_req, which, request);
 	return 0;
 }
 
@@ -2964,16 +2963,16 @@ static int rbd_obj_copyup_current_snapc(struct rbd_obj_request *obj_req,
 
 static int setup_copyup_buf(struct rbd_obj_request *obj_req, u64 obj_overlap)
 {
-	struct ceph_databuf *dbuf;
+	struct ceph_databuf *request;
 
 	rbd_assert(!obj_req->copyup_buf);
 
-	dbuf = ceph_databuf_req_alloc(calc_pages_for(0, obj_overlap),
+	request = ceph_databuf_req_alloc(calc_pages_for(0, obj_overlap),
 				      obj_overlap, GFP_NOIO);
-	if (!dbuf)
+	if (!request)
 		return -ENOMEM;
 
-	obj_req->copyup_buf = dbuf;
+	obj_req->copyup_buf = request;
 	return 0;
 }
 
@@ -4580,10 +4579,9 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
 		if (!request)
 			return -ENOMEM;
 
-		p = kmap_ceph_databuf_page(request, 0);
-		memcpy(p, outbound, outbound_size);
-		kunmap_local(p);
-		ceph_databuf_added_data(request, outbound_size);
+		p = ceph_databuf_enc_start(request);
+		ceph_encode_copy(&p, outbound, outbound_size);
+		ceph_databuf_enc_stop(request, p);
 	}
 
 	reply = ceph_databuf_reply_alloc(1, inbound_size, GFP_KERNEL);
@@ -4712,7 +4710,7 @@ static void rbd_free_disk(struct rbd_device *rbd_dev)
 static int rbd_obj_read_sync(struct rbd_device *rbd_dev,
 			     struct ceph_object_id *oid,
 			     struct ceph_object_locator *oloc,
-			     struct ceph_databuf *dbuf, int len)
+			     struct ceph_databuf *request, int len)
 {
 	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
 	struct ceph_osd_request *req;
@@ -4727,7 +4725,7 @@ static int rbd_obj_read_sync(struct rbd_device *rbd_dev,
 	req->r_flags = CEPH_OSD_FLAG_READ;
 
 	osd_req_op_extent_init(req, 0, CEPH_OSD_OP_READ, 0, len, 0, 0);
-	osd_req_op_extent_osd_databuf(req, 0, dbuf);
+	osd_req_op_extent_osd_databuf(req, 0, request);
 
 	ret = ceph_osdc_alloc_messages(req, GFP_KERNEL);
 	if (ret)
@@ -4750,16 +4748,16 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
 				  bool first_time)
 {
 	struct rbd_image_header_ondisk *ondisk;
-	struct ceph_databuf *dbuf = NULL;
+	struct ceph_databuf *request = NULL;
 	u32 snap_count = 0;
 	u64 names_size = 0;
 	u32 want_count;
 	int ret;
 
-	dbuf = ceph_databuf_req_alloc(1, sizeof(*ondisk), GFP_KERNEL);
-	if (!dbuf)
+	request = ceph_databuf_req_alloc(1, sizeof(*ondisk), GFP_KERNEL);
+	if (!request)
 		return -ENOMEM;
-	ondisk = kmap_ceph_databuf_page(dbuf, 0);
+	ondisk = kmap_ceph_databuf_page(request, 0);
 
 	/*
 	 * The complete header will include an array of its 64-bit
@@ -4776,13 +4774,13 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
 		size += names_size;
 
 		ret = -ENOMEM;
-		if (size > dbuf->limit &&
-		    ceph_databuf_reserve(dbuf, size - dbuf->limit,
+		if (size > request->limit &&
+		    ceph_databuf_reserve(request, size - request->limit,
 					 GFP_KERNEL) < 0)
 			goto out;
 
 		ret = rbd_obj_read_sync(rbd_dev, &rbd_dev->header_oid,
-					&rbd_dev->header_oloc, dbuf, size);
+					&rbd_dev->header_oloc, request, size);
 		if (ret < 0)
 			goto out;
 		if ((size_t)ret < size) {
@@ -4806,7 +4804,7 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
 	ret = rbd_header_from_disk(header, ondisk, first_time);
 out:
 	kunmap_local(ondisk);
-	ceph_databuf_release(dbuf);
+	ceph_databuf_release(request);
 	return ret;
 }
 
@@ -5625,10 +5623,10 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev,
 	if (!reply)
 		goto out_free;
 
-	p = kmap_ceph_databuf_page(request, 0);
+	p = ceph_databuf_enc_start(request);
 	ceph_encode_64(&p, rbd_dev->spec->snap_id);
-	kunmap_local(p);
-	ceph_databuf_added_data(request, sizeof(__le64));
+	ceph_databuf_enc_stop(request, p);
+
 	ret = __get_parent_info(rbd_dev, request, reply, pii);
 	if (ret > 0)
 		ret = __get_parent_info_legacy(rbd_dev, request, reply, pii);


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 24/35] ceph: Make ceph_calc_file_object_mapping() return size as size_t
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (22 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 23/35] rbd: Use ceph_databuf_enc_start/stop() David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 25/35] ceph: Wrap POSIX_FADV_WILLNEED to get caps David Howells
                   ` (10 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Make ceph_calc_file_object_mapping() return the size as a size_t.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/ceph/addr.c               | 4 ++--
 fs/ceph/crypto.c             | 2 +-
 fs/ceph/file.c               | 9 ++++-----
 fs/ceph/ioctl.c              | 2 +-
 include/linux/ceph/striper.h | 6 +++---
 net/ceph/osd_client.c        | 2 +-
 net/ceph/striper.c           | 4 ++--
 7 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 482a9f41a685..7c89cafcb91a 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -335,8 +335,8 @@ static int ceph_netfs_prepare_read(struct netfs_io_subrequest *subreq)
 	struct inode *inode = rreq->inode;
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	size_t xlen;
 	u64 objno, objoff;
-	u32 xlen;
 
 	/* Truncate the extent at the end of the current block */
 	ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len,
@@ -1205,9 +1205,9 @@ void ceph_allocate_page_array(struct address_space *mapping,
 {
 	struct inode *inode = mapping->host;
 	struct ceph_inode_info *ci = ceph_inode(inode);
+	size_t xlen;
 	u64 objnum;
 	u64 objoff;
-	u32 xlen;
 
 	/* prepare async write request */
 	ceph_wbc->offset = (u64)folio_pos(folio);
diff --git a/fs/ceph/crypto.c b/fs/ceph/crypto.c
index 3b3c4d8d401e..a28dea74ca6f 100644
--- a/fs/ceph/crypto.c
+++ b/fs/ceph/crypto.c
@@ -594,8 +594,8 @@ int ceph_fscrypt_decrypt_extents(struct inode *inode, struct page **page,
 	struct ceph_client *cl = ceph_inode_to_client(inode);
 	int i, ret = 0;
 	struct ceph_inode_info *ci = ceph_inode(inode);
+	size_t xlen;
 	u64 objno, objoff;
-	u32 xlen;
 
 	/* Nothing to do for empty array */
 	if (ext_cnt == 0) {
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index fb4024bc8274..ffd36e00b0de 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1731,12 +1731,11 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
 		u64 write_pos = pos;
 		u64 write_len = len;
 		u64 objnum, objoff;
-		u32 xlen;
 		u64 assert_ver = 0;
 		bool rmw;
 		bool first, last;
 		struct iov_iter saved_iter = *from;
-		size_t off;
+		size_t off, xlen;
 
 		ceph_fscrypt_adjust_off_and_len(inode, &write_pos, &write_len);
 
@@ -2870,8 +2869,8 @@ static ssize_t ceph_do_objects_copy(struct ceph_inode_info *src_ci, u64 *src_off
 	struct ceph_osd_client *osdc;
 	struct ceph_osd_request *req;
 	size_t bytes = 0;
+	size_t src_objlen, dst_objlen;
 	u64 src_objnum, src_objoff, dst_objnum, dst_objoff;
-	u32 src_objlen, dst_objlen;
 	u32 object_size = src_ci->i_layout.object_size;
 	struct ceph_client *cl = fsc->client;
 	int ret;
@@ -2948,8 +2947,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 	struct ceph_client *cl = src_fsc->client;
 	loff_t size;
 	ssize_t ret = -EIO, bytes;
+	size_t src_objlen, dst_objlen;
 	u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
-	u32 src_objlen, dst_objlen;
 	int src_got = 0, dst_got = 0, err, dirty;
 
 	if (src_inode->i_sb != dst_inode->i_sb) {
@@ -3060,7 +3059,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 	 * starting at the src_off
 	 */
 	if (src_objoff) {
-		doutc(cl, "Initial partial copy of %u bytes\n", src_objlen);
+		doutc(cl, "Initial partial copy of %zu bytes\n", src_objlen);
 
 		/*
 		 * we need to temporarily drop all caps as we'll be calling
diff --git a/fs/ceph/ioctl.c b/fs/ceph/ioctl.c
index e861de3c79b9..fab0e89ad7b4 100644
--- a/fs/ceph/ioctl.c
+++ b/fs/ceph/ioctl.c
@@ -186,7 +186,7 @@ static long ceph_ioctl_get_dataloc(struct file *file, void __user *arg)
 		&ceph_sb_to_fs_client(inode->i_sb)->client->osdc;
 	struct ceph_object_locator oloc;
 	CEPH_DEFINE_OID_ONSTACK(oid);
-	u32 xlen;
+	size_t xlen;
 	u64 tmp;
 	struct ceph_pg pgid;
 	int r;
diff --git a/include/linux/ceph/striper.h b/include/linux/ceph/striper.h
index 50bc1b88c5c4..e1036e953d7b 100644
--- a/include/linux/ceph/striper.h
+++ b/include/linux/ceph/striper.h
@@ -10,7 +10,7 @@ struct ceph_file_layout;
 
 void ceph_calc_file_object_mapping(struct ceph_file_layout *l,
 				   u64 off, u64 len,
-				   u64 *objno, u64 *objoff, u32 *xlen);
+				   u64 *objno, u64 *objoff, size_t *xlen);
 
 struct ceph_object_extent {
 	struct list_head oe_item;
@@ -97,14 +97,14 @@ int ceph_iterate_extents(struct ceph_file_layout *l, u64 off, u64 len,
 	while (len) {
 		struct ceph_object_extent *ex;
 		u64 objno, objoff;
-		u32 xlen;
+		size_t xlen;
 
 		ceph_calc_file_object_mapping(l, off, len, &objno, &objoff,
 					      &xlen);
 
 		ex = ceph_lookup_containing(object_extents, objno, objoff, xlen);
 		if (!ex) {
-			WARN(1, "%s: objno %llu %llu~%u not found!\n",
+			WARN(1, "%s: objno %llu %llu~%zu not found!\n",
 			     __func__, objno, objoff, xlen);
 			return -EINVAL;
 		}
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 92aaa5ed9145..f943d4e85a13 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -100,7 +100,7 @@ static int calc_layout(struct ceph_file_layout *layout, u64 off, u64 *plen,
 			u64 *objnum, u64 *objoff, u64 *objlen)
 {
 	u64 orig_len = *plen;
-	u32 xlen;
+	size_t xlen;
 
 	/* object extent? */
 	ceph_calc_file_object_mapping(layout, off, orig_len, objnum,
diff --git a/net/ceph/striper.c b/net/ceph/striper.c
index 3dedbf018fa6..c934c9addc9d 100644
--- a/net/ceph/striper.c
+++ b/net/ceph/striper.c
@@ -23,7 +23,7 @@
  */
 void ceph_calc_file_object_mapping(struct ceph_file_layout *l,
 				   u64 off, u64 len,
-				   u64 *objno, u64 *objoff, u32 *xlen)
+				   u64 *objno, u64 *objoff, size_t *xlen)
 {
 	u32 stripes_per_object = l->object_size / l->stripe_unit;
 	u64 blockno;	/* which su in the file (i.e. globally) */
@@ -100,7 +100,7 @@ int ceph_file_to_extents(struct ceph_file_layout *l, u64 off, u64 len,
 	while (len) {
 		struct list_head *add_pos = NULL;
 		u64 objno, objoff;
-		u32 xlen;
+		size_t xlen;
 
 		ceph_calc_file_object_mapping(l, off, len, &objno, &objoff,
 					      &xlen);


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 25/35] ceph: Wrap POSIX_FADV_WILLNEED to get caps
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (23 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 24/35] ceph: Make ceph_calc_file_object_mapping() return size as size_t David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 26/35] ceph: Kill ceph_rw_context David Howells
                   ` (9 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Wrap the handling of fadvise(POSIX_FADV_WILLNEED) so that we get the
appropriate caps needed to do it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/ceph/file.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index ffd36e00b0de..b876cecbaba5 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -13,6 +13,7 @@
 #include <linux/iversion.h>
 #include <linux/ktime.h>
 #include <linux/splice.h>
+#include <linux/fadvise.h>
 
 #include "super.h"
 #include "mds_client.h"
@@ -3150,6 +3151,49 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
 	return ret;
 }
 
+/*
+ * If the user wants to manually trigger readahead, we have to get a cap to
+ * allow that.
+ */
+static int ceph_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
+{
+	struct inode *inode = file_inode(file);
+	struct ceph_file_info *fi = file->private_data;
+	struct ceph_client *cl = ceph_inode_to_client(inode);
+	int want = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO, got = 0;
+	int ret;
+
+	if (advice != POSIX_FADV_WILLNEED)
+		return generic_fadvise(file, offset, len, advice);
+
+	if (!(fi->flags & CEPH_F_SYNC))
+		return -EACCES;
+	if (fi->fmode & CEPH_FILE_MODE_LAZY)
+		return -EACCES;
+
+	ret = ceph_get_caps(file, CEPH_CAP_FILE_RD, want, -1, &got);
+	if (ret < 0) {
+		doutc(cl, "%llx.%llx, error getting cap\n", ceph_vinop(inode));
+		goto out;
+	}
+
+	if ((got & want) == want) {
+		doutc(cl, "fadvise(WILLNEED) %p %llx.%llx %llu~%llu got cap refs on %s\n",
+		      inode, ceph_vinop(inode), offset, len,
+		      ceph_cap_string(got));
+		ret = generic_fadvise(file, offset, len, advice);
+	} else {
+		doutc(cl, "%llx.%llx, no cache cap\n", ceph_vinop(inode));
+		ret = -EACCES;
+	}
+
+	doutc(cl, "%p %llx.%llx dropping cap refs on %s = %d\n",
+	      inode, ceph_vinop(inode), ceph_cap_string(got), ret);
+	ceph_put_cap_refs(ceph_inode(inode), got);
+out:
+	return ret;
+}
+
 const struct file_operations ceph_file_fops = {
 	.open = ceph_open,
 	.release = ceph_release,
@@ -3167,4 +3211,5 @@ const struct file_operations ceph_file_fops = {
 	.compat_ioctl = compat_ptr_ioctl,
 	.fallocate	= ceph_fallocate,
 	.copy_file_range = ceph_copy_file_range,
+	.fadvise	= ceph_fadvise,
 };


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 26/35] ceph: Kill ceph_rw_context
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (24 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 25/35] ceph: Wrap POSIX_FADV_WILLNEED to get caps David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 27/35] netfs: Pass extra write context to write functions David Howells
                   ` (8 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

With all invokers of readahead:

	- read() and co.
	- splice()
	- fadvise(POSIX_FADV_WILLNEED)
	- madvise(MADV_WILLNEED)
	- fault-in

now getting the FILE_CACHE cap or the LAZYIO cap and holding it across
readahead invocation, there's no need for the ceph_rw_context.  It can be
assumed that we have one or other cap - and apparently it doesn't matter
which as we don't actually check rw_ctx->caps.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/ceph/addr.c  | 19 +++++++------------
 fs/ceph/file.c  | 13 +------------
 fs/ceph/super.h | 47 -----------------------------------------------
 3 files changed, 8 insertions(+), 71 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 7c89cafcb91a..27f27ab24446 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -473,18 +473,16 @@ static int ceph_init_request(struct netfs_io_request *rreq, struct file *file)
 	if (!priv)
 		return -ENOMEM;
 
+	/*
+	 * If we are doing readahead triggered by a read, fault-in or
+	 * MADV/FADV_WILLNEED, someone higher up the stack must be holding the
+	 * FILE_CACHE and/or LAZYIO caps.
+	 */
 	if (file) {
-		struct ceph_rw_context *rw_ctx;
-		struct ceph_file_info *fi = file->private_data;
-
 		priv->file_ra_pages = file->f_ra.ra_pages;
 		priv->file_ra_disabled = file->f_mode & FMODE_RANDOM;
-
-		rw_ctx = ceph_find_rw_context(fi);
-		if (rw_ctx) {
-			rreq->netfs_priv = priv;
-			return 0;
-		}
+		rreq->netfs_priv = priv;
+		return 0;
 	}
 
 	/*
@@ -1982,10 +1980,7 @@ static vm_fault_t ceph_filemap_fault(struct vm_fault *vmf)
 
 	if ((got & (CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO)) ||
 	    !ceph_has_inline_data(ci)) {
-		CEPH_DEFINE_RW_CONTEXT(rw_ctx, got);
-		ceph_add_rw_context(fi, &rw_ctx);
 		ret = filemap_fault(vmf);
-		ceph_del_rw_context(fi, &rw_ctx);
 		doutc(cl, "%llx.%llx %llu drop cap refs %s ret %x\n",
 		      ceph_vinop(inode), off, ceph_cap_string(got), ret);
 	} else
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index b876cecbaba5..4512215cccc6 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -229,8 +229,6 @@ static int ceph_init_file_info(struct inode *inode, struct file *file,
 	ceph_get_fmode(ci, fmode, 1);
 	fi->fmode = fmode;
 
-	spin_lock_init(&fi->rw_contexts_lock);
-	INIT_LIST_HEAD(&fi->rw_contexts);
 	fi->filp_gen = READ_ONCE(ceph_inode_to_fs_client(inode)->filp_gen);
 
 	if ((file->f_mode & FMODE_WRITE) && ceph_has_inline_data(ci)) {
@@ -999,7 +997,6 @@ int ceph_release(struct inode *inode, struct file *file)
 		struct ceph_dir_file_info *dfi = file->private_data;
 		doutc(cl, "%p %llx.%llx dir file %p\n", inode,
 		      ceph_vinop(inode), file);
-		WARN_ON(!list_empty(&dfi->file_info.rw_contexts));
 
 		ceph_put_fmode(ci, dfi->file_info.fmode, 1);
 
@@ -1012,7 +1009,6 @@ int ceph_release(struct inode *inode, struct file *file)
 		struct ceph_file_info *fi = file->private_data;
 		doutc(cl, "%p %llx.%llx regular file %p\n", inode,
 		      ceph_vinop(inode), file);
-		WARN_ON(!list_empty(&fi->rw_contexts));
 
 		ceph_fscache_unuse_cookie(inode, file->f_mode & FMODE_WRITE);
 		ceph_put_fmode(ci, fi->fmode, 1);
@@ -2154,13 +2150,10 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			retry_op = READ_INLINE;
 		}
 	} else {
-		CEPH_DEFINE_RW_CONTEXT(rw_ctx, got);
 		doutc(cl, "async %p %llx.%llx %llu~%u got cap refs on %s\n",
 		      inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len,
 		      ceph_cap_string(got));
-		ceph_add_rw_context(fi, &rw_ctx);
 		ret = generic_file_read_iter(iocb, to);
-		ceph_del_rw_context(fi, &rw_ctx);
 	}
 
 	doutc(cl, "%p %llx.%llx dropping cap refs on %s = %d\n",
@@ -2256,7 +2249,6 @@ static ssize_t ceph_splice_read(struct file *in, loff_t *ppos,
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	ssize_t ret;
 	int want = 0, got = 0;
-	CEPH_DEFINE_RW_CONTEXT(rw_ctx, 0);
 
 	dout("splice_read %p %llx.%llx %llu~%zu trying to get caps on %p\n",
 	     inode, ceph_vinop(inode), *ppos, len, inode);
@@ -2291,10 +2283,7 @@ static ssize_t ceph_splice_read(struct file *in, loff_t *ppos,
 	dout("splice_read %p %llx.%llx %llu~%zu got cap refs on %s\n",
 	     inode, ceph_vinop(inode), *ppos, len, ceph_cap_string(got));
 
-	rw_ctx.caps = got;
-	ceph_add_rw_context(fi, &rw_ctx);
 	ret = filemap_splice_read(in, ppos, pipe, len, flags);
-	ceph_del_rw_context(fi, &rw_ctx);
 
 	dout("splice_read %p %llx.%llx dropping cap refs on %s = %zd\n",
 	     inode, ceph_vinop(inode), ceph_cap_string(got), ret);
@@ -3177,7 +3166,7 @@ static int ceph_fadvise(struct file *file, loff_t offset, loff_t len, int advice
 		goto out;
 	}
 
-	if ((got & want) == want) {
+	if (got & want) {
 		doutc(cl, "fadvise(WILLNEED) %p %llx.%llx %llu~%llu got cap refs on %s\n",
 		      inode, ceph_vinop(inode), offset, len,
 		      ceph_cap_string(got));
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index b072572e2cf4..14784ad86670 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -833,10 +833,6 @@ extern void change_auth_cap_ses(struct ceph_inode_info *ci,
 struct ceph_file_info {
 	short fmode;     /* initialized on open */
 	short flags;     /* CEPH_F_* */
-
-	spinlock_t rw_contexts_lock;
-	struct list_head rw_contexts;
-
 	u32 filp_gen;
 };
 
@@ -859,49 +855,6 @@ struct ceph_dir_file_info {
 	int dir_info_len;
 };
 
-struct ceph_rw_context {
-	struct list_head list;
-	struct task_struct *thread;
-	int caps;
-};
-
-#define CEPH_DEFINE_RW_CONTEXT(_name, _caps)	\
-	struct ceph_rw_context _name = {	\
-		.thread = current,		\
-		.caps = _caps,			\
-	}
-
-static inline void ceph_add_rw_context(struct ceph_file_info *cf,
-				       struct ceph_rw_context *ctx)
-{
-	spin_lock(&cf->rw_contexts_lock);
-	list_add(&ctx->list, &cf->rw_contexts);
-	spin_unlock(&cf->rw_contexts_lock);
-}
-
-static inline void ceph_del_rw_context(struct ceph_file_info *cf,
-				       struct ceph_rw_context *ctx)
-{
-	spin_lock(&cf->rw_contexts_lock);
-	list_del(&ctx->list);
-	spin_unlock(&cf->rw_contexts_lock);
-}
-
-static inline struct ceph_rw_context*
-ceph_find_rw_context(struct ceph_file_info *cf)
-{
-	struct ceph_rw_context *ctx, *found = NULL;
-	spin_lock(&cf->rw_contexts_lock);
-	list_for_each_entry(ctx, &cf->rw_contexts, list) {
-		if (ctx->thread == current) {
-			found = ctx;
-			break;
-		}
-	}
-	spin_unlock(&cf->rw_contexts_lock);
-	return found;
-}
-
 struct ceph_readdir_cache_control {
 	struct folio *folio;
 	struct dentry **dentries;


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 27/35] netfs: Pass extra write context to write functions
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (25 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 26/35] ceph: Kill ceph_rw_context David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 28/35] netfs: Adjust group handling David Howells
                   ` (7 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel, Xiubo Li

Allow the filesystem to pass in an extra bit of context to certain write
functions so that netfs_page_mkwrite() and netfs_perform_write() can pass
it back to the filesystem's ->post_modify() function.

This can be used by ceph to pass in a preallocated ceph_cap_flush record.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/9p/vfs_file.c          |  2 +-
 fs/afs/write.c            |  2 +-
 fs/netfs/buffered_write.c | 21 ++++++++++++---------
 fs/smb/client/file.c      |  4 ++--
 include/linux/netfs.h     |  9 +++++----
 5 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index 348cc90bf9c5..838332d5372c 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -477,7 +477,7 @@ v9fs_file_mmap(struct file *filp, struct vm_area_struct *vma)
 static vm_fault_t
 v9fs_vm_page_mkwrite(struct vm_fault *vmf)
 {
-	return netfs_page_mkwrite(vmf, NULL);
+	return netfs_page_mkwrite(vmf, NULL, NULL);
 }
 
 static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 18b0a9f1615e..054f3a07d2a5 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -276,7 +276,7 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
 
 	if (afs_validate(AFS_FS_I(file_inode(file)), afs_file_key(file)) < 0)
 		return VM_FAULT_SIGBUS;
-	return netfs_page_mkwrite(vmf, NULL);
+	return netfs_page_mkwrite(vmf, NULL, NULL);
 }
 
 /*
diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
index f3370846ba18..0245449b93e3 100644
--- a/fs/netfs/buffered_write.c
+++ b/fs/netfs/buffered_write.c
@@ -86,7 +86,8 @@ static void netfs_update_i_size(struct netfs_inode *ctx, struct inode *inode,
  * netfs_perform_write - Copy data into the pagecache.
  * @iocb: The operation parameters
  * @iter: The source buffer
- * @netfs_group: Grouping for dirty folios (eg. ceph snaps).
+ * @netfs_group: Grouping for dirty folios (eg. ceph snaps)
+ * @fs_priv: Private data to be passed to ->post_modify()
  *
  * Copy data into pagecache folios attached to the inode specified by @iocb.
  * The caller must hold appropriate inode locks.
@@ -97,7 +98,7 @@ static void netfs_update_i_size(struct netfs_inode *ctx, struct inode *inode,
  * a new one is started.
  */
 ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
-			    struct netfs_group *netfs_group)
+			    struct netfs_group *netfs_group, void *fs_priv)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
@@ -382,7 +383,7 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
 		 */
 		set_bit(NETFS_ICTX_MODIFIED_ATTR, &ctx->flags);
 		if (unlikely(ctx->ops->post_modify))
-			ctx->ops->post_modify(inode);
+			ctx->ops->post_modify(inode, fs_priv);
 	}
 
 	if (unlikely(wreq)) {
@@ -411,7 +412,8 @@ EXPORT_SYMBOL(netfs_perform_write);
  * netfs_buffered_write_iter_locked - write data to a file
  * @iocb:	IO state structure (file, offset, etc.)
  * @from:	iov_iter with data to write
- * @netfs_group: Grouping for dirty folios (eg. ceph snaps).
+ * @netfs_group: Grouping for dirty folios (eg. ceph snaps)
+ * @fs_priv: Private data to be passed to ->post_modify()
  *
  * This function does all the work needed for actually writing data to a
  * file. It does all basic checks, removes SUID from the file, updates
@@ -431,7 +433,7 @@ EXPORT_SYMBOL(netfs_perform_write);
  * * negative error code if no data has been written at all
  */
 ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from,
-					 struct netfs_group *netfs_group)
+					 struct netfs_group *netfs_group, void *fs_priv)
 {
 	struct file *file = iocb->ki_filp;
 	ssize_t ret;
@@ -446,7 +448,7 @@ ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *fr
 	if (ret)
 		return ret;
 
-	return netfs_perform_write(iocb, from, netfs_group);
+	return netfs_perform_write(iocb, from, netfs_group, fs_priv);
 }
 EXPORT_SYMBOL(netfs_buffered_write_iter_locked);
 
@@ -485,7 +487,7 @@ ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	ret = generic_write_checks(iocb, from);
 	if (ret > 0)
-		ret = netfs_buffered_write_iter_locked(iocb, from, NULL);
+		ret = netfs_buffered_write_iter_locked(iocb, from, NULL, NULL);
 	netfs_end_io_write(inode);
 	if (ret > 0)
 		ret = generic_write_sync(iocb, ret);
@@ -499,7 +501,8 @@ EXPORT_SYMBOL(netfs_file_write_iter);
  * we only track group on a per-folio basis, so we block more often than
  * we might otherwise.
  */
-vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group)
+vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group,
+			      void *fs_priv)
 {
 	struct netfs_group *group;
 	struct folio *folio = page_folio(vmf->page);
@@ -554,7 +557,7 @@ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_gr
 	file_update_time(file);
 	set_bit(NETFS_ICTX_MODIFIED_ATTR, &ictx->flags);
 	if (ictx->ops->post_modify)
-		ictx->ops->post_modify(inode);
+		ictx->ops->post_modify(inode, fs_priv);
 	ret = VM_FAULT_LOCKED;
 out:
 	sb_end_pagefault(inode->i_sb);
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index 8582cf61242c..4329c2bbf74f 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -2779,7 +2779,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
 		goto out;
 	}
 
-	rc = netfs_buffered_write_iter_locked(iocb, from, NULL);
+	rc = netfs_buffered_write_iter_locked(iocb, from, NULL, NULL);
 
 out:
 	up_read(&cinode->lock_sem);
@@ -2955,7 +2955,7 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
 
 static vm_fault_t cifs_page_mkwrite(struct vm_fault *vmf)
 {
-	return netfs_page_mkwrite(vmf, NULL);
+	return netfs_page_mkwrite(vmf, NULL, NULL);
 }
 
 static const struct vm_operations_struct cifs_file_vm_ops = {
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index ec1c51697c04..a67297de8a20 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -335,7 +335,7 @@ struct netfs_request_ops {
 
 	/* Modification handling */
 	void (*update_i_size)(struct inode *inode, loff_t i_size);
-	void (*post_modify)(struct inode *inode);
+	void (*post_modify)(struct inode *inode, void *fs_priv);
 
 	/* Write request handling */
 	void (*begin_writeback)(struct netfs_io_request *wreq);
@@ -435,9 +435,9 @@ ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter);
 
 /* High-level write API */
 ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
-			    struct netfs_group *netfs_group);
+			    struct netfs_group *netfs_group, void *fs_priv);
 ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from,
-					 struct netfs_group *netfs_group);
+					 struct netfs_group *netfs_group, void *fs_priv);
 ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from);
 ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter,
 					   struct netfs_group *netfs_group);
@@ -466,7 +466,8 @@ void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length);
 bool netfs_release_folio(struct folio *folio, gfp_t gfp);
 
 /* VMA operations API. */
-vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group);
+vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group,
+			      void *fs_priv);
 
 /* (Sub)request management API. */
 void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 28/35] netfs: Adjust group handling
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (26 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 27/35] netfs: Pass extra write context to write functions David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-19 18:57   ` Viacheslav Dubeyko
  2025-03-20 15:22   ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 29/35] netfs: Allow fs-private data to be handed through to request alloc David Howells
                   ` (6 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Make some adjustments to the handling of netfs groups so that ceph can use
them for snap contexts:

 - Move netfs_get_group(), netfs_put_group() and netfs_put_group_many() to
   linux/netfs.h so that ceph can build its snap context on netfs groups.

 - Move netfs_set_group() and __netfs_set_group() to linux/netfs.h so that
   ceph_dirty_folio() can call them from inside of the locked section in
   which it finds the snap context to attach.

 - Provide a netfs_writepages_group() that takes a group as a parameter and
   attaches it to the request and make netfs_free_request() drop the ref on
   it.  netfs_writepages() then becomes a wrapper that passes in a NULL
   group.

 - In netfs_perform_write(), only consider a folio to have a conflicting
   group if the folio's group pointer isn't NULL and if the folio is dirty.

 - In netfs_perform_write(), interject a small 10ms sleep after every 16
   attempts to flush a folio within a single call.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/buffered_write.c | 25 ++++-------------
 fs/netfs/internal.h       | 32 ---------------------
 fs/netfs/objects.c        |  1 +
 fs/netfs/write_issue.c    | 38 +++++++++++++++++++++----
 include/linux/netfs.h     | 59 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 98 insertions(+), 57 deletions(-)

diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
index 0245449b93e3..12ddbe9bc78b 100644
--- a/fs/netfs/buffered_write.c
+++ b/fs/netfs/buffered_write.c
@@ -11,26 +11,9 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/pagevec.h>
+#include <linux/delay.h>
 #include "internal.h"
 
-static void __netfs_set_group(struct folio *folio, struct netfs_group *netfs_group)
-{
-	if (netfs_group)
-		folio_attach_private(folio, netfs_get_group(netfs_group));
-}
-
-static void netfs_set_group(struct folio *folio, struct netfs_group *netfs_group)
-{
-	void *priv = folio_get_private(folio);
-
-	if (unlikely(priv != netfs_group)) {
-		if (netfs_group && (!priv || priv == NETFS_FOLIO_COPY_TO_CACHE))
-			folio_attach_private(folio, netfs_get_group(netfs_group));
-		else if (!netfs_group && priv == NETFS_FOLIO_COPY_TO_CACHE)
-			folio_detach_private(folio);
-	}
-}
-
 /*
  * Grab a folio for writing and lock it.  Attempt to allocate as large a folio
  * as possible to hold as much of the remaining length as possible in one go.
@@ -113,6 +96,7 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
 	};
 	struct netfs_io_request *wreq = NULL;
 	struct folio *folio = NULL, *writethrough = NULL;
+	unsigned int flush_counter = 0;
 	unsigned int bdp_flags = (iocb->ki_flags & IOCB_NOWAIT) ? BDP_ASYNC : 0;
 	ssize_t written = 0, ret, ret2;
 	loff_t i_size, pos = iocb->ki_pos;
@@ -208,7 +192,8 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
 		group = netfs_folio_group(folio);
 
 		if (unlikely(group != netfs_group) &&
-		    group != NETFS_FOLIO_COPY_TO_CACHE)
+		    group != NETFS_FOLIO_COPY_TO_CACHE &&
+		    (group || folio_test_dirty(folio)))
 			goto flush_content;
 
 		if (folio_test_uptodate(folio)) {
@@ -341,6 +326,8 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
 		trace_netfs_folio(folio, netfs_flush_content);
 		folio_unlock(folio);
 		folio_put(folio);
+		if ((++flush_counter & 0xf) == 0xf)
+			msleep(10);
 		ret = filemap_write_and_wait_range(mapping, fpos, fpos + flen - 1);
 		if (ret < 0)
 			goto error_folio_unlock;
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index eebb4f0f660e..2a6123c4da35 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -261,38 +261,6 @@ static inline bool netfs_is_cache_enabled(struct netfs_inode *ctx)
 #endif
 }
 
-/*
- * Get a ref on a netfs group attached to a dirty page (e.g. a ceph snap).
- */
-static inline struct netfs_group *netfs_get_group(struct netfs_group *netfs_group)
-{
-	if (netfs_group && netfs_group != NETFS_FOLIO_COPY_TO_CACHE)
-		refcount_inc(&netfs_group->ref);
-	return netfs_group;
-}
-
-/*
- * Dispose of a netfs group attached to a dirty page (e.g. a ceph snap).
- */
-static inline void netfs_put_group(struct netfs_group *netfs_group)
-{
-	if (netfs_group &&
-	    netfs_group != NETFS_FOLIO_COPY_TO_CACHE &&
-	    refcount_dec_and_test(&netfs_group->ref))
-		netfs_group->free(netfs_group);
-}
-
-/*
- * Dispose of a netfs group attached to a dirty page (e.g. a ceph snap).
- */
-static inline void netfs_put_group_many(struct netfs_group *netfs_group, int nr)
-{
-	if (netfs_group &&
-	    netfs_group != NETFS_FOLIO_COPY_TO_CACHE &&
-	    refcount_sub_and_test(nr, &netfs_group->ref))
-		netfs_group->free(netfs_group);
-}
-
 /*
  * Check to see if a buffer aligns with the crypto block size.  If it doesn't
  * the crypto layer is going to copy all the data - in which case relying on
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
index 52d6fce70837..7fdbaa5c5cab 100644
--- a/fs/netfs/objects.c
+++ b/fs/netfs/objects.c
@@ -153,6 +153,7 @@ static void netfs_free_request(struct work_struct *work)
 		kvfree(rreq->direct_bv);
 	}
 
+	netfs_put_group(rreq->group);
 	rolling_buffer_clear(&rreq->buffer);
 	rolling_buffer_clear(&rreq->bounce);
 	if (test_bit(NETFS_RREQ_PUT_RMW_TAIL, &rreq->flags))
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index 93601033ba08..3921fcf4f859 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -418,7 +418,7 @@ static int netfs_write_folio(struct netfs_io_request *wreq,
 		netfs_issue_write(wreq, upload);
 	} else if (fgroup != wreq->group) {
 		/* We can't write this page to the server yet. */
-		kdebug("wrong group");
+		kdebug("wrong group %px != %px", fgroup, wreq->group);
 		folio_redirty_for_writepage(wbc, folio);
 		folio_unlock(folio);
 		netfs_issue_write(wreq, upload);
@@ -593,11 +593,19 @@ static void netfs_end_issue_write(struct netfs_io_request *wreq)
 		netfs_wake_write_collector(wreq, false);
 }
 
-/*
- * Write some of the pending data back to the server
+/**
+ * netfs_writepages_group - Flush data from the pagecache for a file
+ * @mapping: The file to flush from
+ * @wbc: Details of what should be flushed
+ * @group: The write grouping to flush (or NULL)
+ *
+ * Start asynchronous write back operations to flush dirty data belonging to a
+ * particular group in a file's pagecache back to the server and to the local
+ * cache.
  */
-int netfs_writepages(struct address_space *mapping,
-		     struct writeback_control *wbc)
+int netfs_writepages_group(struct address_space *mapping,
+			   struct writeback_control *wbc,
+			   struct netfs_group *group)
 {
 	struct netfs_inode *ictx = netfs_inode(mapping->host);
 	struct netfs_io_request *wreq = NULL;
@@ -618,12 +626,15 @@ int netfs_writepages(struct address_space *mapping,
 	if (!folio)
 		goto out;
 
-	wreq = netfs_create_write_req(mapping, NULL, folio_pos(folio), NETFS_WRITEBACK);
+	wreq = netfs_create_write_req(mapping, NULL, folio_pos(folio),
+				      NETFS_WRITEBACK);
 	if (IS_ERR(wreq)) {
 		error = PTR_ERR(wreq);
 		goto couldnt_start;
 	}
 
+	wreq->group = netfs_get_group(group);
+
 	trace_netfs_write(wreq, netfs_write_trace_writeback);
 	netfs_stat(&netfs_n_wh_writepages);
 
@@ -659,6 +670,21 @@ int netfs_writepages(struct address_space *mapping,
 	_leave(" = %d", error);
 	return error;
 }
+EXPORT_SYMBOL(netfs_writepages_group);
+
+/**
+ * netfs_writepages - Flush data from the pagecache for a file
+ * @mapping: The file to flush from
+ * @wbc: Details of what should be flushed
+ *
+ * Start asynchronous write back operations to flush dirty data in a file's
+ * pagecache back to the server and to the local cache.
+ */
+int netfs_writepages(struct address_space *mapping,
+		     struct writeback_control *wbc)
+{
+	return netfs_writepages_group(mapping, wbc, NULL);
+}
 EXPORT_SYMBOL(netfs_writepages);
 
 /*
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index a67297de8a20..69052ac47ab1 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -457,6 +457,9 @@ int netfs_read_folio(struct file *, struct folio *);
 int netfs_write_begin(struct netfs_inode *, struct file *,
 		      struct address_space *, loff_t pos, unsigned int len,
 		      struct folio **, void **fsdata);
+int netfs_writepages_group(struct address_space *mapping,
+			   struct writeback_control *wbc,
+			   struct netfs_group *group);
 int netfs_writepages(struct address_space *mapping,
 		     struct writeback_control *wbc);
 bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio);
@@ -597,4 +600,60 @@ static inline void netfs_wait_for_outstanding_io(struct inode *inode)
 	wait_var_event(&ictx->io_count, atomic_read(&ictx->io_count) == 0);
 }
 
+/*
+ * Get a ref on a netfs group attached to a dirty page (e.g. a ceph snap).
+ */
+static inline struct netfs_group *netfs_get_group(struct netfs_group *netfs_group)
+{
+	if (netfs_group && netfs_group != NETFS_FOLIO_COPY_TO_CACHE)
+		refcount_inc(&netfs_group->ref);
+	return netfs_group;
+}
+
+/*
+ * Dispose of a netfs group attached to a dirty page (e.g. a ceph snap).
+ */
+static inline void netfs_put_group(struct netfs_group *netfs_group)
+{
+	if (netfs_group &&
+	    netfs_group != NETFS_FOLIO_COPY_TO_CACHE &&
+	    refcount_dec_and_test(&netfs_group->ref))
+		netfs_group->free(netfs_group);
+}
+
+/*
+ * Dispose of a netfs group attached to a dirty page (e.g. a ceph snap).
+ */
+static inline void netfs_put_group_many(struct netfs_group *netfs_group, int nr)
+{
+	if (netfs_group &&
+	    netfs_group != NETFS_FOLIO_COPY_TO_CACHE &&
+	    refcount_sub_and_test(nr, &netfs_group->ref))
+		netfs_group->free(netfs_group);
+}
+
+/*
+ * Set the group pointer directly on a folio.
+ */
+static inline void __netfs_set_group(struct folio *folio, struct netfs_group *netfs_group)
+{
+	if (netfs_group)
+		folio_attach_private(folio, netfs_get_group(netfs_group));
+}
+
+/*
+ * Set the group pointer on a folio or the folio info record.
+ */
+static inline void netfs_set_group(struct folio *folio, struct netfs_group *netfs_group)
+{
+	void *priv = folio_get_private(folio);
+
+	if (unlikely(priv != netfs_group)) {
+		if (netfs_group && (!priv || priv == NETFS_FOLIO_COPY_TO_CACHE))
+			folio_attach_private(folio, netfs_get_group(netfs_group));
+		else if (!netfs_group && priv == NETFS_FOLIO_COPY_TO_CACHE)
+			folio_detach_private(folio);
+	}
+}
+
 #endif /* _LINUX_NETFS_H */


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 29/35] netfs: Allow fs-private data to be handed through to request alloc
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (27 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 28/35] netfs: Adjust group handling David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 30/35] netfs: Make netfs_page_mkwrite() use folio_mkwrite_check_truncate() David Howells
                   ` (5 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Allow an fs-private pointer to be handed through to request alloc and
stashed in the netfs_io_request struct for the filesystem to retrieve.

This will be used by ceph to pass a pointer to the ceph_writeback_ctl to
the netfs operation functions.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/buffered_read.c | 11 ++++++-----
 fs/netfs/direct_read.c   |  2 +-
 fs/netfs/direct_write.c  |  2 +-
 fs/netfs/internal.h      |  2 ++
 fs/netfs/objects.c       |  2 ++
 fs/netfs/read_pgpriv2.c  |  2 +-
 fs/netfs/read_single.c   |  2 +-
 fs/netfs/write_issue.c   | 17 +++++++++++------
 fs/netfs/write_retry.c   |  2 +-
 include/linux/netfs.h    |  3 ++-
 10 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index 4dd505053fba..10daf2452324 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -343,7 +343,7 @@ void netfs_readahead(struct readahead_control *ractl)
 	int ret;
 
 	rreq = netfs_alloc_request(ractl->mapping, ractl->file, start, size,
-				   NETFS_READAHEAD);
+				   NULL, NETFS_READAHEAD);
 	if (IS_ERR(rreq))
 		return;
 
@@ -414,7 +414,8 @@ static int netfs_read_gaps(struct file *file, struct folio *folio)
 
 	_enter("%lx", folio->index);
 
-	rreq = netfs_alloc_request(mapping, file, folio_pos(folio), flen, NETFS_READ_GAPS);
+	rreq = netfs_alloc_request(mapping, file, folio_pos(folio), flen,
+				   NULL, NETFS_READ_GAPS);
 	if (IS_ERR(rreq)) {
 		ret = PTR_ERR(rreq);
 		goto alloc_error;
@@ -510,7 +511,7 @@ int netfs_read_folio(struct file *file, struct folio *folio)
 
 	rreq = netfs_alloc_request(mapping, file,
 				   folio_pos(folio), folio_size(folio),
-				   NETFS_READPAGE);
+				   NULL, NETFS_READPAGE);
 	if (IS_ERR(rreq)) {
 		ret = PTR_ERR(rreq);
 		goto alloc_error;
@@ -665,7 +666,7 @@ int netfs_write_begin(struct netfs_inode *ctx,
 
 	rreq = netfs_alloc_request(mapping, file,
 				   folio_pos(folio), folio_size(folio),
-				   NETFS_READ_FOR_WRITE);
+				   NULL, NETFS_READ_FOR_WRITE);
 	if (IS_ERR(rreq)) {
 		ret = PTR_ERR(rreq);
 		goto error;
@@ -730,7 +731,7 @@ int netfs_prefetch_for_write(struct file *file, struct folio *folio,
 	ret = -ENOMEM;
 
 	rreq = netfs_alloc_request(mapping, file, start, flen,
-				   NETFS_READ_FOR_WRITE);
+				   NULL, NETFS_READ_FOR_WRITE);
 	if (IS_ERR(rreq)) {
 		ret = PTR_ERR(rreq);
 		goto error;
diff --git a/fs/netfs/direct_read.c b/fs/netfs/direct_read.c
index fc0a053ad5a8..15a6923a92ca 100644
--- a/fs/netfs/direct_read.c
+++ b/fs/netfs/direct_read.c
@@ -264,7 +264,7 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *i
 
 	rreq = netfs_alloc_request(iocb->ki_filp->f_mapping, iocb->ki_filp,
 				   iocb->ki_pos, orig_count,
-				   NETFS_DIO_READ);
+				   NULL, NETFS_DIO_READ);
 	if (IS_ERR(rreq))
 		return PTR_ERR(rreq);
 
diff --git a/fs/netfs/direct_write.c b/fs/netfs/direct_write.c
index e41614687e49..83c5c06c4710 100644
--- a/fs/netfs/direct_write.c
+++ b/fs/netfs/direct_write.c
@@ -300,7 +300,7 @@ ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *
 
 	_debug("uw %llx-%llx", start, end);
 
-	wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp, start,
+	wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp, start, NULL,
 				      iocb->ki_flags & IOCB_DIRECT ?
 				      NETFS_DIO_WRITE : NETFS_UNBUFFERED_WRITE);
 	if (IS_ERR(wreq))
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index 2a6123c4da35..9724d5a1ddc7 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -101,6 +101,7 @@ int netfs_alloc_bounce(struct netfs_io_request *wreq, unsigned long long to, gfp
 struct netfs_io_request *netfs_alloc_request(struct address_space *mapping,
 					     struct file *file,
 					     loff_t start, size_t len,
+					     void *netfs_priv2,
 					     enum netfs_io_origin origin);
 void netfs_get_request(struct netfs_io_request *rreq, enum netfs_rreq_ref_trace what);
 void netfs_clear_subrequests(struct netfs_io_request *rreq, bool was_async);
@@ -218,6 +219,7 @@ void netfs_wake_write_collector(struct netfs_io_request *wreq, bool was_async);
 struct netfs_io_request *netfs_create_write_req(struct address_space *mapping,
 						struct file *file,
 						loff_t start,
+						void *netfs_priv2,
 						enum netfs_io_origin origin);
 void netfs_prepare_write(struct netfs_io_request *wreq,
 			 struct netfs_io_stream *stream,
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
index 7fdbaa5c5cab..4606e830c116 100644
--- a/fs/netfs/objects.c
+++ b/fs/netfs/objects.c
@@ -16,6 +16,7 @@
 struct netfs_io_request *netfs_alloc_request(struct address_space *mapping,
 					     struct file *file,
 					     loff_t start, size_t len,
+					     void *netfs_priv2,
 					     enum netfs_io_origin origin)
 {
 	static atomic_t debug_ids;
@@ -38,6 +39,7 @@ struct netfs_io_request *netfs_alloc_request(struct address_space *mapping,
 	rreq->len	= len;
 	rreq->origin	= origin;
 	rreq->netfs_ops	= ctx->ops;
+	rreq->netfs_priv2 = netfs_priv2;
 	rreq->mapping	= mapping;
 	rreq->inode	= inode;
 	rreq->i_size	= i_size_read(inode);
diff --git a/fs/netfs/read_pgpriv2.c b/fs/netfs/read_pgpriv2.c
index cf7727060215..e94140ebc6fb 100644
--- a/fs/netfs/read_pgpriv2.c
+++ b/fs/netfs/read_pgpriv2.c
@@ -103,7 +103,7 @@ static struct netfs_io_request *netfs_pgpriv2_begin_copy_to_cache(
 		goto cancel;
 
 	creq = netfs_create_write_req(rreq->mapping, NULL, folio_pos(folio),
-				      NETFS_PGPRIV2_COPY_TO_CACHE);
+				      NULL, NETFS_PGPRIV2_COPY_TO_CACHE);
 	if (IS_ERR(creq))
 		goto cancel;
 
diff --git a/fs/netfs/read_single.c b/fs/netfs/read_single.c
index b36a3020bb90..3a20e8340e06 100644
--- a/fs/netfs/read_single.c
+++ b/fs/netfs/read_single.c
@@ -169,7 +169,7 @@ ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_ite
 	ssize_t ret;
 
 	rreq = netfs_alloc_request(inode->i_mapping, file, 0, iov_iter_count(iter),
-				   NETFS_READ_SINGLE);
+				   NULL, NETFS_READ_SINGLE);
 	if (IS_ERR(rreq))
 		return PTR_ERR(rreq);
 
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index 3921fcf4f859..9b8d99477405 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -90,6 +90,7 @@ static void netfs_kill_dirty_pages(struct address_space *mapping,
 struct netfs_io_request *netfs_create_write_req(struct address_space *mapping,
 						struct file *file,
 						loff_t start,
+						void *netfs_priv2,
 						enum netfs_io_origin origin)
 {
 	struct netfs_io_request *wreq;
@@ -99,7 +100,7 @@ struct netfs_io_request *netfs_create_write_req(struct address_space *mapping,
 			     origin == NETFS_WRITETHROUGH ||
 			     origin == NETFS_PGPRIV2_COPY_TO_CACHE);
 
-	wreq = netfs_alloc_request(mapping, file, start, 0, origin);
+	wreq = netfs_alloc_request(mapping, file, start, 0, netfs_priv2, origin);
 	if (IS_ERR(wreq))
 		return wreq;
 
@@ -598,14 +599,18 @@ static void netfs_end_issue_write(struct netfs_io_request *wreq)
  * @mapping: The file to flush from
  * @wbc: Details of what should be flushed
  * @group: The write grouping to flush (or NULL)
+ * @netfs_priv2: Private data specific to the netfs (or NULL)
  *
  * Start asynchronous write back operations to flush dirty data belonging to a
  * particular group in a file's pagecache back to the server and to the local
  * cache.
+ *
+ * If not NULL, @netfs_priv2 will be set on wreq->netfs_priv2
  */
 int netfs_writepages_group(struct address_space *mapping,
 			   struct writeback_control *wbc,
-			   struct netfs_group *group)
+			   struct netfs_group *group,
+			   void *netfs_priv2)
 {
 	struct netfs_inode *ictx = netfs_inode(mapping->host);
 	struct netfs_io_request *wreq = NULL;
@@ -627,7 +632,7 @@ int netfs_writepages_group(struct address_space *mapping,
 		goto out;
 
 	wreq = netfs_create_write_req(mapping, NULL, folio_pos(folio),
-				      NETFS_WRITEBACK);
+				      netfs_priv2, NETFS_WRITEBACK);
 	if (IS_ERR(wreq)) {
 		error = PTR_ERR(wreq);
 		goto couldnt_start;
@@ -683,7 +688,7 @@ EXPORT_SYMBOL(netfs_writepages_group);
 int netfs_writepages(struct address_space *mapping,
 		     struct writeback_control *wbc)
 {
-	return netfs_writepages_group(mapping, wbc, NULL);
+	return netfs_writepages_group(mapping, wbc, NULL, NULL);
 }
 EXPORT_SYMBOL(netfs_writepages);
 
@@ -698,7 +703,7 @@ struct netfs_io_request *netfs_begin_writethrough(struct kiocb *iocb, size_t len
 	mutex_lock(&ictx->wb_lock);
 
 	wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp,
-				      iocb->ki_pos, NETFS_WRITETHROUGH);
+				      iocb->ki_pos, NULL, NETFS_WRITETHROUGH);
 	if (IS_ERR(wreq)) {
 		mutex_unlock(&ictx->wb_lock);
 		return wreq;
@@ -953,7 +958,7 @@ int netfs_writeback_single(struct address_space *mapping,
 		mutex_lock(&ictx->wb_lock);
 	}
 
-	wreq = netfs_create_write_req(mapping, NULL, 0, NETFS_WRITEBACK_SINGLE);
+	wreq = netfs_create_write_req(mapping, NULL, 0, NULL, NETFS_WRITEBACK_SINGLE);
 	if (IS_ERR(wreq)) {
 		ret = PTR_ERR(wreq);
 		goto couldnt_start;
diff --git a/fs/netfs/write_retry.c b/fs/netfs/write_retry.c
index 187882801d57..f727b48e2bfe 100644
--- a/fs/netfs/write_retry.c
+++ b/fs/netfs/write_retry.c
@@ -328,7 +328,7 @@ ssize_t netfs_rmw_read(struct netfs_io_request *wreq, struct file *file,
 		bufsize = bsize * 2;
 	}
 
-	rreq = netfs_alloc_request(wreq->mapping, file, start, len, NETFS_RMW_READ);
+	rreq = netfs_alloc_request(wreq->mapping, file, start, len, NULL, NETFS_RMW_READ);
 	if (IS_ERR(rreq))
 		return PTR_ERR(rreq);
 
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 69052ac47ab1..9d17d4bd9753 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -459,7 +459,8 @@ int netfs_write_begin(struct netfs_inode *, struct file *,
 		      struct folio **, void **fsdata);
 int netfs_writepages_group(struct address_space *mapping,
 			   struct writeback_control *wbc,
-			   struct netfs_group *group);
+			   struct netfs_group *group,
+			   void *netfs_priv2);
 int netfs_writepages(struct address_space *mapping,
 		     struct writeback_control *wbc);
 bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio);


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 30/35] netfs: Make netfs_page_mkwrite() use folio_mkwrite_check_truncate()
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (28 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 29/35] netfs: Allow fs-private data to be handed through to request alloc David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 31/35] netfs: Fix netfs_unbuffered_read() to return ssize_t rather than int David Howells
                   ` (4 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Make netfs_page_mkwrite() use folio_mkwrite_check_truncate() rather than
doing the checks itself (and it doesn't currently do all the checks).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/buffered_write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
index 12ddbe9bc78b..64a0f0620399 100644
--- a/fs/netfs/buffered_write.c
+++ b/fs/netfs/buffered_write.c
@@ -506,7 +506,7 @@ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_gr
 
 	if (folio_lock_killable(folio) < 0)
 		goto out;
-	if (folio->mapping != mapping)
+	if (folio_mkwrite_check_truncate(folio, inode) < 0)
 		goto unlock;
 	if (folio_wait_writeback_killable(folio) < 0)
 		goto unlock;


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 31/35] netfs: Fix netfs_unbuffered_read() to return ssize_t rather than int
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (29 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 30/35] netfs: Make netfs_page_mkwrite() use folio_mkwrite_check_truncate() David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 32/35] netfs: Add some more RMW support for ceph David Howells
                   ` (3 subsequent siblings)
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Fix netfs_unbuffered_read() to return an ssize_t rather than an int as
netfs_wait_for_read() returns ssize_t and this gets implicitly truncated.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/direct_read.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/netfs/direct_read.c b/fs/netfs/direct_read.c
index 15a6923a92ca..5e4bd1e5a378 100644
--- a/fs/netfs/direct_read.c
+++ b/fs/netfs/direct_read.c
@@ -201,9 +201,9 @@ static int netfs_dispatch_unbuffered_reads(struct netfs_io_request *rreq)
  * Perform a read to an application buffer, bypassing the pagecache and the
  * local disk cache.
  */
-static int netfs_unbuffered_read(struct netfs_io_request *rreq, bool sync)
+static ssize_t netfs_unbuffered_read(struct netfs_io_request *rreq, bool sync)
 {
-	int ret;
+	ssize_t ret;
 
 	_enter("R=%x %llx-%llx",
 	       rreq->debug_id, rreq->start, rreq->start + rreq->len - 1);
@@ -231,7 +231,7 @@ static int netfs_unbuffered_read(struct netfs_io_request *rreq, bool sync)
 	else
 		ret = -EIOCBQUEUED;
 out:
-	_leave(" = %d", ret);
+	_leave(" = %zd", ret);
 	return ret;
 }
 


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 32/35] netfs: Add some more RMW support for ceph
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (30 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 31/35] netfs: Fix netfs_unbuffered_read() to return ssize_t rather than int David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-19 19:14   ` Viacheslav Dubeyko
  2025-03-20 15:25   ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 33/35] ceph: Use netfslib [INCOMPLETE] David Howells
                   ` (2 subsequent siblings)
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Add some support for RMW in ceph:

 (1) Add netfs_unbuffered_read_from_inode() to allow reading from an inode
     without having a file pointer so that truncate can modify a
     now-partial tail block of a content-encrypted file.

     This takes an additional argument to cause it to fail or give a short
     read if a hole is encountered.  This is noted on the request with
     NETFS_RREQ_NO_READ_HOLE for the filesystem to pick up.

 (2) Set NETFS_RREQ_RMW when doing an RMW as part of a request.

 (3) Provide a ->rmw_read_done() op for netfslib to tell the filesystem
     that it has completed the read required for RMW.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/direct_read.c       | 75 ++++++++++++++++++++++++++++++++++++
 fs/netfs/direct_write.c      |  1 +
 fs/netfs/main.c              |  1 +
 fs/netfs/objects.c           |  1 +
 fs/netfs/read_collect.c      |  2 +
 fs/netfs/write_retry.c       |  3 ++
 include/linux/netfs.h        |  7 ++++
 include/trace/events/netfs.h |  3 ++
 8 files changed, 93 insertions(+)

diff --git a/fs/netfs/direct_read.c b/fs/netfs/direct_read.c
index 5e4bd1e5a378..4061f934dfe6 100644
--- a/fs/netfs/direct_read.c
+++ b/fs/netfs/direct_read.c
@@ -373,3 +373,78 @@ ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	return ret;
 }
 EXPORT_SYMBOL(netfs_unbuffered_read_iter);
+
+/**
+ * netfs_unbuffered_read_from_inode - Perform an unbuffered sync I/O read
+ * @inode: The inode being accessed
+ * @pos: The file position to read from
+ * @iter: The output buffer (also specifies read length)
+ * @nohole: True to return short/ENODATA if hole encountered
+ *
+ * Perform a synchronous unbuffered I/O from the inode to the output buffer.
+ * No use is made of the pagecache.  The output buffer must be suitably aligned
+ * if content encryption is to be used.  If @nohole is true then the read will
+ * stop short if a hole is encountered and return -ENODATA if the read begins
+ * with a hole.
+ *
+ * The caller must hold any appropriate locks.
+ */
+ssize_t netfs_unbuffered_read_from_inode(struct inode *inode, loff_t pos,
+					 struct iov_iter *iter, bool nohole)
+{
+	struct netfs_io_request *rreq;
+	ssize_t ret;
+	size_t orig_count = iov_iter_count(iter);
+
+	_enter("");
+
+	if (WARN_ON(user_backed_iter(iter)))
+		return -EIO;
+
+	if (!orig_count)
+		return 0; /* Don't update atime */
+
+	ret = filemap_write_and_wait_range(inode->i_mapping, pos, orig_count);
+	if (ret < 0)
+		return ret;
+	inode_update_time(inode, S_ATIME);
+
+	rreq = netfs_alloc_request(inode->i_mapping, NULL, pos, orig_count,
+				   NULL, NETFS_UNBUFFERED_READ);
+	if (IS_ERR(rreq))
+		return PTR_ERR(rreq);
+
+	ret = -EIO;
+	if (test_bit(NETFS_RREQ_CONTENT_ENCRYPTION, &rreq->flags) &&
+	    WARN_ON(!netfs_is_crypto_aligned(rreq, iter)))
+		goto out;
+
+	netfs_stat(&netfs_n_rh_dio_read);
+	trace_netfs_read(rreq, rreq->start, rreq->len,
+			 netfs_read_trace_unbuffered_read_from_inode);
+
+	rreq->buffer.iter	= *iter;
+	rreq->len		= orig_count;
+	rreq->direct_bv_unpin	= false;
+	iov_iter_advance(iter, orig_count);
+
+	if (nohole)
+		__set_bit(NETFS_RREQ_NO_READ_HOLE, &rreq->flags);
+
+	/* We're going to do the crypto in place in the destination buffer. */
+	if (test_bit(NETFS_RREQ_CONTENT_ENCRYPTION, &rreq->flags))
+		__set_bit(NETFS_RREQ_CRYPT_IN_PLACE, &rreq->flags);
+
+	ret = netfs_dispatch_unbuffered_reads(rreq);
+
+	if (!rreq->submitted) {
+		netfs_put_request(rreq, false, netfs_rreq_trace_put_no_submit);
+		goto out;
+	}
+
+	ret = netfs_wait_for_read(rreq);
+out:
+	netfs_put_request(rreq, false, netfs_rreq_trace_put_return);
+	return ret;
+}
+EXPORT_SYMBOL(netfs_unbuffered_read_from_inode);
diff --git a/fs/netfs/direct_write.c b/fs/netfs/direct_write.c
index 83c5c06c4710..a99722f90c71 100644
--- a/fs/netfs/direct_write.c
+++ b/fs/netfs/direct_write.c
@@ -145,6 +145,7 @@ static ssize_t netfs_write_through_bounce_buffer(struct netfs_io_request *wreq,
 		wreq->start		= gstart;
 		wreq->len		= gend - gstart;
 
+		__set_bit(NETFS_RREQ_RMW, &ictx->flags);
 		if (gstart >= end) {
 			/* At or after EOF, nothing to read. */
 		} else {
diff --git a/fs/netfs/main.c b/fs/netfs/main.c
index 07f8cffbda8c..0900dea53e4a 100644
--- a/fs/netfs/main.c
+++ b/fs/netfs/main.c
@@ -39,6 +39,7 @@ static const char *netfs_origins[nr__netfs_io_origin] = {
 	[NETFS_READ_GAPS]		= "RG",
 	[NETFS_READ_SINGLE]		= "R1",
 	[NETFS_READ_FOR_WRITE]		= "RW",
+	[NETFS_UNBUFFERED_READ]		= "UR",
 	[NETFS_DIO_READ]		= "DR",
 	[NETFS_WRITEBACK]		= "WB",
 	[NETFS_WRITEBACK_SINGLE]	= "W1",
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
index 4606e830c116..958c4d460d07 100644
--- a/fs/netfs/objects.c
+++ b/fs/netfs/objects.c
@@ -60,6 +60,7 @@ struct netfs_io_request *netfs_alloc_request(struct address_space *mapping,
 	    origin == NETFS_READ_GAPS ||
 	    origin == NETFS_READ_SINGLE ||
 	    origin == NETFS_READ_FOR_WRITE ||
+	    origin == NETFS_UNBUFFERED_READ ||
 	    origin == NETFS_DIO_READ) {
 		INIT_WORK(&rreq->work, netfs_read_collection_worker);
 		rreq->io_streams[0].avail = true;
diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c
index 0a0bff90ca9e..013a90738dcd 100644
--- a/fs/netfs/read_collect.c
+++ b/fs/netfs/read_collect.c
@@ -462,6 +462,7 @@ static void netfs_read_collection(struct netfs_io_request *rreq)
 	//netfs_rreq_is_still_valid(rreq);
 
 	switch (rreq->origin) {
+	case NETFS_UNBUFFERED_READ:
 	case NETFS_DIO_READ:
 	case NETFS_READ_GAPS:
 	case NETFS_RMW_READ:
@@ -681,6 +682,7 @@ ssize_t netfs_wait_for_read(struct netfs_io_request *rreq)
 	if (ret == 0) {
 		ret = rreq->transferred;
 		switch (rreq->origin) {
+		case NETFS_UNBUFFERED_READ:
 		case NETFS_DIO_READ:
 		case NETFS_READ_SINGLE:
 			ret = rreq->transferred;
diff --git a/fs/netfs/write_retry.c b/fs/netfs/write_retry.c
index f727b48e2bfe..9e4e79d5a403 100644
--- a/fs/netfs/write_retry.c
+++ b/fs/netfs/write_retry.c
@@ -386,6 +386,9 @@ ssize_t netfs_rmw_read(struct netfs_io_request *wreq, struct file *file,
 		ret = 0;
 	}
 
+	if (ret == 0 && rreq->netfs_ops->rmw_read_done)
+		rreq->netfs_ops->rmw_read_done(wreq, rreq);
+
 error:
 	netfs_put_request(rreq, false, netfs_rreq_trace_put_return);
 	return ret;
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 9d17d4bd9753..4049c985b9b4 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -220,6 +220,7 @@ enum netfs_io_origin {
 	NETFS_READ_GAPS,		/* This read is a synchronous read to fill gaps */
 	NETFS_READ_SINGLE,		/* This read should be treated as a single object */
 	NETFS_READ_FOR_WRITE,		/* This read is to prepare a write */
+	NETFS_UNBUFFERED_READ,		/* This is an unbuffered I/O read */
 	NETFS_DIO_READ,			/* This is a direct I/O read */
 	NETFS_WRITEBACK,		/* This write was triggered by writepages */
 	NETFS_WRITEBACK_SINGLE,		/* This monolithic write was triggered by writepages */
@@ -308,6 +309,9 @@ struct netfs_io_request {
 #define NETFS_RREQ_CONTENT_ENCRYPTION	16	/* Content encryption is in use */
 #define NETFS_RREQ_CRYPT_IN_PLACE	17	/* Do decryption in place */
 #define NETFS_RREQ_PUT_RMW_TAIL		18	/* Need to put ->rmw_tail */
+#define NETFS_RREQ_RMW			19	/* Performing RMW cycle */
+#define NETFS_RREQ_REPEAT_RMW		20	/* Need to perform an RMW cycle */
+#define NETFS_RREQ_NO_READ_HOLE		21	/* Give short read/error if hole encountered */
 #define NETFS_RREQ_USE_PGPRIV2		31	/* [DEPRECATED] Use PG_private_2 to mark
 						 * write to cache on read */
 	const struct netfs_request_ops *netfs_ops;
@@ -336,6 +340,7 @@ struct netfs_request_ops {
 	/* Modification handling */
 	void (*update_i_size)(struct inode *inode, loff_t i_size);
 	void (*post_modify)(struct inode *inode, void *fs_priv);
+	void (*rmw_read_done)(struct netfs_io_request *wreq, struct netfs_io_request *rreq);
 
 	/* Write request handling */
 	void (*begin_writeback)(struct netfs_io_request *wreq);
@@ -432,6 +437,8 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *i
 ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
 ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
 ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ssize_t netfs_unbuffered_read_from_inode(struct inode *inode, loff_t pos,
+					 struct iov_iter *iter, bool nohole);
 
 /* High-level write API */
 ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 74af82d773bd..9254c6f0e604 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -23,6 +23,7 @@
 	EM(netfs_read_trace_read_gaps,		"READ-GAPS")	\
 	EM(netfs_read_trace_read_single,	"READ-SNGL")	\
 	EM(netfs_read_trace_prefetch_for_write,	"PREFETCHW")	\
+	EM(netfs_read_trace_unbuffered_read_from_inode, "READ-INOD") \
 	E_(netfs_read_trace_write_begin,	"WRITEBEGN")
 
 #define netfs_write_traces					\
@@ -38,6 +39,7 @@
 	EM(NETFS_READ_GAPS,			"RG")		\
 	EM(NETFS_READ_SINGLE,			"R1")		\
 	EM(NETFS_READ_FOR_WRITE,		"RW")		\
+	EM(NETFS_UNBUFFERED_READ,		"UR")		\
 	EM(NETFS_DIO_READ,			"DR")		\
 	EM(NETFS_WRITEBACK,			"WB")		\
 	EM(NETFS_WRITEBACK_SINGLE,		"W1")		\
@@ -104,6 +106,7 @@
 	EM(netfs_sreq_trace_io_progress,	"IO   ")	\
 	EM(netfs_sreq_trace_limited,		"LIMIT")	\
 	EM(netfs_sreq_trace_need_clear,		"N-CLR")	\
+	EM(netfs_sreq_trace_need_rmw,		"N-RMW")	\
 	EM(netfs_sreq_trace_partial_read,	"PARTR")	\
 	EM(netfs_sreq_trace_need_retry,		"ND-RT")	\
 	EM(netfs_sreq_trace_pending,		"PEND ")	\


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 33/35] ceph: Use netfslib [INCOMPLETE]
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (31 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 32/35] netfs: Add some more RMW support for ceph David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-19 19:54   ` Viacheslav Dubeyko
  2025-03-20 15:38   ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 34/35] ceph: Enable multipage folios for ceph files David Howells
  2025-03-13 23:33 ` [RFC PATCH 35/35] ceph: Remove old I/O API bits David Howells
  34 siblings, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Implement netfslib support for ceph.

Note that I've put the new code into its own file for now rather than
attempting to modify the old code or putting it into an existing file.  The
old code is just #if'd out for removal in a subsequent patch to make this
patch easier to review.

Note also that this is incomplete as sparse map support and content crypto
support are currently non-functional - but plain I/O should work.

There may also be an inode ref leak due to the way the ceph sometimes takes
and holds on to an extra inode ref under some circumstances.  I'm not sure
these are actually necessary.  For instance, ceph_dirty_folio() will ihold
the inode if ci->i_wrbuffer_ref is 0

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 drivers/block/rbd.c             |    2 +-
 fs/ceph/Makefile                |    2 +-
 fs/ceph/addr.c                  |   46 +-
 fs/ceph/cache.h                 |    5 +
 fs/ceph/caps.c                  |    2 +-
 fs/ceph/crypto.c                |   54 ++
 fs/ceph/file.c                  |   15 +-
 fs/ceph/inode.c                 |   30 +-
 fs/ceph/rdwr.c                  | 1006 +++++++++++++++++++++++++++++++
 fs/ceph/super.h                 |   39 +-
 fs/netfs/internal.h             |    6 +-
 fs/netfs/main.c                 |    4 +-
 fs/netfs/write_issue.c          |    6 +-
 include/linux/ceph/libceph.h    |    3 +-
 include/linux/ceph/osd_client.h |    1 +
 include/linux/netfs.h           |   13 +-
 net/ceph/snapshot.c             |   20 +-
 17 files changed, 1190 insertions(+), 64 deletions(-)
 create mode 100644 fs/ceph/rdwr.c

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 956fc4a8f1da..94bb29c95b0d 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -468,7 +468,7 @@ static DEFINE_IDA(rbd_dev_id_ida);
 static struct workqueue_struct *rbd_wq;
 
 static struct ceph_snap_context rbd_empty_snapc = {
-	.nref = REFCOUNT_INIT(1),
+	.group.ref = REFCOUNT_INIT(1),
 };
 
 /*
diff --git a/fs/ceph/Makefile b/fs/ceph/Makefile
index 1f77ca04c426..e4d3c2d6e9c2 100644
--- a/fs/ceph/Makefile
+++ b/fs/ceph/Makefile
@@ -5,7 +5,7 @@
 
 obj-$(CONFIG_CEPH_FS) += ceph.o
 
-ceph-y := super.o inode.o dir.o file.o locks.o addr.o ioctl.o \
+ceph-y := super.o inode.o dir.o file.o locks.o addr.o rdwr.o ioctl.o \
 	export.o caps.o snap.o xattr.o quota.o io.o \
 	mds_client.o mdsmap.o strings.o ceph_frag.o \
 	debugfs.o util.o metric.o
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 27f27ab24446..325fbbce1eaa 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -64,27 +64,30 @@
 	(CONGESTION_ON_THRESH(congestion_kb) -				\
 	 (CONGESTION_ON_THRESH(congestion_kb) >> 2))
 
+#if 0 // TODO: Remove after netfs conversion
 static int ceph_netfs_check_write_begin(struct file *file, loff_t pos, unsigned int len,
 					struct folio **foliop, void **_fsdata);
 
-static inline struct ceph_snap_context *page_snap_context(struct page *page)
+static struct ceph_snap_context *page_snap_context(struct page *page)
 {
 	if (PagePrivate(page))
 		return (void *)page->private;
 	return NULL;
 }
+#endif // TODO: Remove after netfs conversion
 
 /*
  * Dirty a page.  Optimistically adjust accounting, on the assumption
  * that we won't race with invalidate.  If we do, readjust.
  */
-static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio)
+bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio)
 {
 	struct inode *inode = mapping->host;
 	struct ceph_client *cl = ceph_inode_to_client(inode);
 	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
 	struct ceph_inode_info *ci;
 	struct ceph_snap_context *snapc;
+	struct netfs_group *group;
 
 	if (folio_test_dirty(folio)) {
 		doutc(cl, "%llx.%llx %p idx %lu -- already dirty\n",
@@ -101,16 +104,28 @@ static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio)
 	spin_lock(&ci->i_ceph_lock);
 	if (__ceph_have_pending_cap_snap(ci)) {
 		struct ceph_cap_snap *capsnap =
-				list_last_entry(&ci->i_cap_snaps,
-						struct ceph_cap_snap,
-						ci_item);
-		snapc = ceph_get_snap_context(capsnap->context);
+			list_last_entry(&ci->i_cap_snaps,
+					struct ceph_cap_snap,
+					ci_item);
+		snapc = capsnap->context;
 		capsnap->dirty_pages++;
 	} else {
-		BUG_ON(!ci->i_head_snapc);
-		snapc = ceph_get_snap_context(ci->i_head_snapc);
+		snapc = ci->i_head_snapc;
+		BUG_ON(!snapc);
 		++ci->i_wrbuffer_ref_head;
 	}
+
+	/* Attach a reference to the snap/group to the folio. */
+	group = netfs_folio_group(folio);
+	if (group != &snapc->group) {
+		netfs_set_group(folio, &snapc->group);
+		if (group) {
+			doutc(cl, "Different group %px != %px\n",
+			      group, &snapc->group);
+			netfs_put_group(group);
+		}
+	}
+
 	if (ci->i_wrbuffer_ref == 0)
 		ihold(inode);
 	++ci->i_wrbuffer_ref;
@@ -122,16 +137,10 @@ static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio)
 	      snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/*
-	 * Reference snap context in folio->private.  Also set
-	 * PagePrivate so that we get invalidate_folio callback.
-	 */
-	VM_WARN_ON_FOLIO(folio->private, folio);
-	folio_attach_private(folio, snapc);
-
-	return ceph_fscache_dirty_folio(mapping, folio);
+	return netfs_dirty_folio(mapping, folio);
 }
 
+#if 0 // TODO: Remove after netfs conversion
 /*
  * If we are truncating the full folio (i.e. offset == 0), adjust the
  * dirty folio counters appropriately.  Only called if there is private
@@ -1236,6 +1245,7 @@ bool is_num_ops_too_big(struct ceph_writeback_ctl *ceph_wbc)
 	return ceph_wbc->num_ops >=
 		(ceph_wbc->from_pool ?  CEPH_OSD_SLAB_OPS : CEPH_OSD_MAX_OPS);
 }
+#endif // TODO: Remove after netfs conversion
 
 static inline
 bool is_write_congestion_happened(struct ceph_fs_client *fsc)
@@ -1244,6 +1254,7 @@ bool is_write_congestion_happened(struct ceph_fs_client *fsc)
 		CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb);
 }
 
+#if 0 // TODO: Remove after netfs conversion
 static inline int move_dirty_folio_in_page_array(struct address_space *mapping,
 		struct writeback_control *wbc,
 		struct ceph_writeback_ctl *ceph_wbc, struct folio *folio)
@@ -1930,6 +1941,7 @@ const struct address_space_operations ceph_aops = {
 	.direct_IO = noop_direct_IO,
 	.migrate_folio = filemap_migrate_folio,
 };
+#endif // TODO: Remove after netfs conversion
 
 static void ceph_block_sigs(sigset_t *oldset)
 {
@@ -2034,6 +2046,7 @@ static vm_fault_t ceph_filemap_fault(struct vm_fault *vmf)
 	return ret;
 }
 
+#if 0 // TODO: Remove after netfs conversion
 static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -2137,6 +2150,7 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
 		ret = vmf_error(err);
 	return ret;
 }
+#endif // TODO: Remove after netfs conversion
 
 void ceph_fill_inline_data(struct inode *inode, struct page *locked_page,
 			   char	*data, size_t len)
diff --git a/fs/ceph/cache.h b/fs/ceph/cache.h
index 20efac020394..d6afca292f08 100644
--- a/fs/ceph/cache.h
+++ b/fs/ceph/cache.h
@@ -43,6 +43,8 @@ static inline void ceph_fscache_resize(struct inode *inode, loff_t to)
 	}
 }
 
+#if 0 // TODO: Remove after netfs conversion
+
 static inline int ceph_fscache_unpin_writeback(struct inode *inode,
 						struct writeback_control *wbc)
 {
@@ -50,6 +52,7 @@ static inline int ceph_fscache_unpin_writeback(struct inode *inode,
 }
 
 #define ceph_fscache_dirty_folio netfs_dirty_folio
+#endif // TODO: Remove after netfs conversion
 
 static inline bool ceph_is_cache_enabled(struct inode *inode)
 {
@@ -100,6 +103,7 @@ static inline void ceph_fscache_resize(struct inode *inode, loff_t to)
 {
 }
 
+#if 0 // TODO: Remove after netfs conversion
 static inline int ceph_fscache_unpin_writeback(struct inode *inode,
 					       struct writeback_control *wbc)
 {
@@ -107,6 +111,7 @@ static inline int ceph_fscache_unpin_writeback(struct inode *inode,
 }
 
 #define ceph_fscache_dirty_folio filemap_dirty_folio
+#endif // TODO: Remove after netfs conversion
 
 static inline bool ceph_is_cache_enabled(struct inode *inode)
 {
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index a8d8b56cf9d2..53f23f351003 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2536,7 +2536,7 @@ int ceph_write_inode(struct inode *inode, struct writeback_control *wbc)
 	int wait = (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync);
 
 	doutc(cl, "%p %llx.%llx wait=%d\n", inode, ceph_vinop(inode), wait);
-	ceph_fscache_unpin_writeback(inode, wbc);
+	netfs_unpin_writeback(inode, wbc);
 	if (wait) {
 		err = ceph_wait_on_async_create(inode);
 		if (err)
diff --git a/fs/ceph/crypto.c b/fs/ceph/crypto.c
index a28dea74ca6f..8d4e908da7d8 100644
--- a/fs/ceph/crypto.c
+++ b/fs/ceph/crypto.c
@@ -636,6 +636,60 @@ int ceph_fscrypt_decrypt_extents(struct inode *inode, struct page **page,
 	return ret;
 }
 
+#if 0
+int ceph_decrypt_block(struct netfs_io_request *rreq, loff_t pos, size_t len,
+		       struct scatterlist *source_sg, unsigned int n_source,
+		       struct scatterlist *dest_sg, unsigned int n_dest)
+{
+	struct ceph_sparse_extent *map = op->extent.sparse_ext;
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	size_t xlen;
+	u64 objno, objoff;
+	u32 ext_cnt = op->extent.sparse_ext_cnt;
+	int i, ret = 0;
+
+	/* Nothing to do for empty array */
+	if (ext_cnt == 0) {
+		dout("%s: empty array, ret 0\n", __func__);
+		return 0;
+	}
+
+	ceph_calc_file_object_mapping(&ci->i_layout, pos, map[0].len,
+				      &objno, &objoff, &xlen);
+
+	for (i = 0; i < ext_cnt; ++i) {
+		struct ceph_sparse_extent *ext = &map[i];
+		int pgsoff = ext->off - objoff;
+		int pgidx = pgsoff >> PAGE_SHIFT;
+		int fret;
+
+		if ((ext->off | ext->len) & ~CEPH_FSCRYPT_BLOCK_MASK) {
+			pr_warn("%s: bad encrypted sparse extent idx %d off %llx len %llx\n",
+				__func__, i, ext->off, ext->len);
+			return -EIO;
+		}
+		fret = ceph_fscrypt_decrypt_pages(inode, &page[pgidx],
+						 off + pgsoff, ext->len);
+		dout("%s: [%d] 0x%llx~0x%llx fret %d\n", __func__, i,
+				ext->off, ext->len, fret);
+		if (fret < 0) {
+			if (ret == 0)
+				ret = fret;
+			break;
+		}
+		ret = pgsoff + fret;
+	}
+	dout("%s: ret %d\n", __func__, ret);
+	return ret;
+}
+
+int ceph_encrypt_block(struct netfs_io_request *wreq, loff_t pos, size_t len,
+		       struct scatterlist *source_sg, unsigned int n_source,
+		       struct scatterlist *dest_sg, unsigned int n_dest)
+{
+}
+#endif
+
 /**
  * ceph_fscrypt_encrypt_pages - encrypt an array of pages
  * @inode: pointer to inode associated with these pages
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 4512215cccc6..94b91b5bc843 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -77,6 +77,7 @@ static __le32 ceph_flags_sys2wire(struct ceph_mds_client *mdsc, u32 flags)
  * need to wait for MDS acknowledgement.
  */
 
+#if 0 // TODO: Remove after netfs conversion
 /*
  * How many pages to get in one call to iov_iter_get_pages().  This
  * determines the size of the on-stack array used as a buffer.
@@ -165,6 +166,7 @@ static void ceph_dirty_pages(struct ceph_databuf *dbuf)
 		if (bvec[i].bv_page)
 			set_page_dirty_lock(bvec[i].bv_page);
 }
+#endif // TODO: Remove after netfs conversion
 
 /*
  * Prepare an open request.  Preallocate ceph_cap to avoid an
@@ -1021,6 +1023,7 @@ int ceph_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+#if 0 // TODO: Remove after netfs conversion
 enum {
 	HAVE_RETRIED = 1,
 	CHECK_EOF =    2,
@@ -2234,6 +2237,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 	return ret;
 }
+#endif // TODO: Remove after netfs conversion
 
 /*
  * Wrap filemap_splice_read with checks for cap bits on the inode.
@@ -2294,6 +2298,7 @@ static ssize_t ceph_splice_read(struct file *in, loff_t *ppos,
 	return ret;
 }
 
+#if 0 // TODO: Remove after netfs conversion
 /*
  * Take cap references to avoid releasing caps to MDS mid-write.
  *
@@ -2488,6 +2493,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	ceph_free_cap_flush(prealloc_cf);
 	return written ? written : err;
 }
+#endif // TODO: Remove after netfs conversion
 
 /*
  * llseek.  be sure to verify file size on SEEK_END.
@@ -3160,6 +3166,10 @@ static int ceph_fadvise(struct file *file, loff_t offset, loff_t len, int advice
 	if (fi->fmode & CEPH_FILE_MODE_LAZY)
 		return -EACCES;
 
+	ret = netfs_start_io_read(inode);
+	if (ret < 0)
+		return ret;
+
 	ret = ceph_get_caps(file, CEPH_CAP_FILE_RD, want, -1, &got);
 	if (ret < 0) {
 		doutc(cl, "%llx.%llx, error getting cap\n", ceph_vinop(inode));
@@ -3180,6 +3190,7 @@ static int ceph_fadvise(struct file *file, loff_t offset, loff_t len, int advice
 	      inode, ceph_vinop(inode), ceph_cap_string(got), ret);
 	ceph_put_cap_refs(ceph_inode(inode), got);
 out:
+	netfs_end_io_read(inode);
 	return ret;
 }
 
@@ -3187,8 +3198,8 @@ const struct file_operations ceph_file_fops = {
 	.open = ceph_open,
 	.release = ceph_release,
 	.llseek = ceph_llseek,
-	.read_iter = ceph_read_iter,
-	.write_iter = ceph_write_iter,
+	.read_iter = ceph_netfs_read_iter,
+	.write_iter = ceph_netfs_write_iter,
 	.mmap = ceph_mmap,
 	.fsync = ceph_fsync,
 	.lock = ceph_lock,
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index ec9b80fec7be..8f73f3a55a3e 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2345,11 +2345,9 @@ static int fill_fscrypt_truncate(struct inode *inode,
 	struct iov_iter iter;
 	struct ceph_fscrypt_truncate_size_header *header;
 	void *p;
-	int retry_op = 0;
 	int len = CEPH_FSCRYPT_BLOCK_SIZE;
 	loff_t i_size = i_size_read(inode);
 	int got, ret, issued;
-	u64 objver;
 
 	ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got);
 	if (ret < 0)
@@ -2361,16 +2359,6 @@ static int fill_fscrypt_truncate(struct inode *inode,
 	      i_size, attr->ia_size, ceph_cap_string(got),
 	      ceph_cap_string(issued));
 
-	/* Try to writeback the dirty pagecaches */
-	if (issued & (CEPH_CAP_FILE_BUFFER)) {
-		loff_t lend = orig_pos + CEPH_FSCRYPT_BLOCK_SIZE - 1;
-
-		ret = filemap_write_and_wait_range(inode->i_mapping,
-						   orig_pos, lend);
-		if (ret < 0)
-			goto out;
-	}
-
 	ret = -ENOMEM;
 	dbuf = ceph_databuf_req_alloc(2, 0, GFP_KERNEL);
 	if (!dbuf)
@@ -2382,10 +2370,8 @@ static int fill_fscrypt_truncate(struct inode *inode,
 		goto out;
 
 	iov_iter_bvec(&iter, ITER_DEST, &dbuf->bvec[1], 1, len);
-
-	pos = orig_pos;
-	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op, &objver);
-	if (ret < 0)
+	ret = netfs_unbuffered_read_from_inode(inode, orig_pos, &iter, true);
+	if (ret < 0 && ret != -ENODATA)
 		goto out;
 
 	header = kmap_ceph_databuf_page(dbuf, 0);
@@ -2402,16 +2388,14 @@ static int fill_fscrypt_truncate(struct inode *inode,
 	header->block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE);
 
 	/*
-	 * If we hit a hole here, we should just skip filling
-	 * the fscrypt for the request, because once the fscrypt
-	 * is enabled, the file will be split into many blocks
-	 * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there
-	 * has a hole, the hole size should be multiple of block
-	 * size.
+	 * If we hit a hole here, we should just skip filling the fscrypt for
+	 * the request, because once the fscrypt is enabled, the file will be
+	 * split into many blocks with the size of CEPH_FSCRYPT_BLOCK_SIZE.  If
+	 * there was a hole, the hole size should be multiple of block size.
 	 *
 	 * If the Rados object doesn't exist, it will be set to 0.
 	 */
-	if (!objver) {
+	if (ret != -ENODATA) {
 		doutc(cl, "hit hole, ppos %lld < size %lld\n", pos, i_size);
 
 		header->data_len = cpu_to_le32(8 + 8 + 4);
diff --git a/fs/ceph/rdwr.c b/fs/ceph/rdwr.c
new file mode 100644
index 000000000000..952c36be2cd9
--- /dev/null
+++ b/fs/ceph/rdwr.c
@@ -0,0 +1,1006 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Ceph netfs-based file read-write operations.
+ *
+ * There are a few funny things going on here.
+ *
+ * The page->private field is used to reference a struct ceph_snap_context for
+ * _every_ dirty page.  This indicates which snapshot the page was logically
+ * dirtied in, and thus which snap context needs to be associated with the osd
+ * write during writeback.
+ *
+ * Similarly, struct ceph_inode_info maintains a set of counters to count dirty
+ * pages on the inode.  In the absence of snapshots, i_wrbuffer_ref ==
+ * i_wrbuffer_ref_head == the dirty page count.
+ *
+ * When a snapshot is taken (that is, when the client receives notification
+ * that a snapshot was taken), each inode with caps and with dirty pages (dirty
+ * pages implies there is a cap) gets a new ceph_cap_snap in the i_cap_snaps
+ * list (which is sorted in ascending order, new snaps go to the tail).  The
+ * i_wrbuffer_ref_head count is moved to capsnap->dirty. (Unless a sync write
+ * is currently in progress.  In that case, the capsnap is said to be
+ * "pending", new writes cannot start, and the capsnap isn't "finalized" until
+ * the write completes (or fails) and a final size/mtime for the inode for that
+ * snap can be settled upon.)  i_wrbuffer_ref_head is reset to 0.
+ *
+ * On writeback, we must submit writes to the osd IN SNAP ORDER.  So, we look
+ * for the first capsnap in i_cap_snaps and write out pages in that snap
+ * context _only_.  Then we move on to the next capsnap, eventually reaching
+ * the "live" or "head" context (i.e., pages that are not yet snapped) and are
+ * writing the most recently dirtied pages.
+ *
+ * Invalidate and so forth must take care to ensure the dirty page accounting
+ * is preserved.
+ *
+ * Copyright (C) 2025 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+#include <linux/ceph/ceph_debug.h>
+
+#include <linux/backing-dev.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/pagevec.h>
+#include <linux/task_io_accounting_ops.h>
+#include <linux/signal.h>
+#include <linux/iversion.h>
+#include <linux/ktime.h>
+#include <linux/netfs.h>
+#include <trace/events/netfs.h>
+
+#include "super.h"
+#include "mds_client.h"
+#include "cache.h"
+#include "metric.h"
+#include "crypto.h"
+#include <linux/ceph/osd_client.h>
+#include <linux/ceph/striper.h>
+
+struct ceph_writeback_ctl
+{
+	loff_t i_size;
+	u64 truncate_size;
+	u32 truncate_seq;
+	bool size_stable;
+	bool head_snapc;
+};
+
+struct kmem_cache *ceph_io_request_cachep;
+struct kmem_cache *ceph_io_subrequest_cachep;
+
+static struct ceph_io_subrequest *ceph_sreq2io(struct netfs_io_subrequest *subreq)
+{
+	BUILD_BUG_ON(sizeof(struct ceph_io_request) > NETFS_DEF_IO_REQUEST_SIZE);
+	BUILD_BUG_ON(sizeof(struct ceph_io_subrequest) > NETFS_DEF_IO_SUBREQUEST_SIZE);
+
+	return container_of(subreq, struct ceph_io_subrequest, sreq);
+}
+
+/*
+ * Get the snapc from the group attached to a request
+ */
+static struct ceph_snap_context *ceph_wreq_snapc(struct netfs_io_request *wreq)
+{
+	struct ceph_snap_context *snapc =
+		container_of(wreq->group, struct ceph_snap_context, group);
+	return snapc;
+}
+
+#if 0
+static void ceph_put_many_snap_context(struct ceph_snap_context *sc, unsigned int nr)
+{
+	if (sc)
+		netfs_put_group_many(&sc->group, nr);
+}
+#endif
+
+/*
+ * Handle the termination of a write to the server.
+ */
+static void ceph_netfs_write_callback(struct ceph_osd_request *req)
+{
+	struct netfs_io_subrequest *subreq = req->r_subreq;
+	struct ceph_io_subrequest *csub = ceph_sreq2io(subreq);
+	struct ceph_io_request *creq = csub->creq;
+	struct inode *inode = creq->rreq.inode;
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	struct ceph_client *cl = ceph_inode_to_client(inode);
+	size_t wrote = req->r_result ? 0 : subreq->len;
+	int err = req->r_result;
+
+	trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
+
+	ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
+				  req->r_end_latency, wrote, err);
+
+	if (err) {
+		doutc(cl, "sync_write osd write returned %d\n", err);
+		/* Version changed! Must re-do the rmw cycle */
+		if ((creq->rmw_assert_version && (err == -ERANGE || err == -EOVERFLOW)) ||
+		    (!creq->rmw_assert_version && err == -EEXIST)) {
+			/* We should only ever see this on a rmw */
+			WARN_ON_ONCE(!test_bit(NETFS_RREQ_RMW, &ci->netfs.flags));
+
+			/* The version should never go backward */
+			WARN_ON_ONCE(err == -EOVERFLOW);
+
+			/* FIXME: limit number of times we loop? */
+			set_bit(NETFS_RREQ_REPEAT_RMW, &creq->rreq.flags);
+			trace_netfs_sreq(subreq, netfs_sreq_trace_need_rmw);
+		}
+		ceph_set_error_write(ci);
+	} else {
+		ceph_clear_error_write(ci);
+	}
+
+	csub->req = NULL;
+	ceph_osdc_put_request(req);
+	netfs_write_subrequest_terminated(subreq, err ?: wrote, true);
+}
+
+/*
+ * Issue a subrequest to upload to the server.
+ */
+static void ceph_issue_write(struct netfs_io_subrequest *subreq)
+{
+	struct ceph_io_subrequest *csub = ceph_sreq2io(subreq);
+	struct ceph_snap_context *snapc = ceph_wreq_snapc(subreq->rreq);
+	struct ceph_osd_request *req;
+	struct ceph_io_request *creq = csub->creq;
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(subreq->rreq->inode);
+	struct ceph_osd_client *osdc = &fsc->client->osdc;
+	struct inode *inode = subreq->rreq->inode;
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_client *cl = ceph_inode_to_client(inode);
+	unsigned long long len;
+	unsigned int rmw = test_bit(NETFS_RREQ_RMW, &ci->netfs.flags) ? 1 : 0;
+
+	doutc(cl, "issue_write R=%08x[%x] ino %llx %lld~%zu -- %srmw\n",
+	      subreq->rreq->debug_id, subreq->debug_index, ci->i_vino.ino,
+	      subreq->start, subreq->len,
+	      rmw ? "" : "no ");
+
+	len = subreq->len;
+	req = ceph_osdc_new_request(osdc, &ci->i_layout, ci->i_vino,
+				    subreq->start, &len,
+				    rmw,	/* which: 0 or 1 */
+				    rmw + 1,	/* num_ops: 1 or 2 */
+				    CEPH_OSD_OP_WRITE,
+				    CEPH_OSD_FLAG_WRITE,
+				    snapc,
+				    ci->i_truncate_seq,
+				    ci->i_truncate_size, false);
+	if (IS_ERR(req)) {
+		netfs_write_subrequest_terminated(subreq, PTR_ERR(req), false);
+		return netfs_prepare_write_failed(subreq);
+	}
+
+	subreq->len = len;
+	doutc(cl, "write op %lld~%zu\n", subreq->start, subreq->len);
+	iov_iter_truncate(&subreq->io_iter, len);
+	osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter);
+	req->r_inode	= inode;
+	req->r_mtime	= current_time(inode);
+	req->r_callback	= ceph_netfs_write_callback;
+	req->r_subreq	= subreq;
+	csub->req	= req;
+
+	/*
+	 * If we're doing an RMW cycle, set up an assertion that the remote
+	 * data hasn't changed.  If we don't have a version number, then the
+	 * object doesn't exist yet.  Use an exclusive create instead of a
+	 * version assertion in that case.
+	 */
+	if (rmw) {
+		if (creq->rmw_assert_version) {
+			osd_req_op_init(req, 0, CEPH_OSD_OP_ASSERT_VER, 0);
+			req->r_ops[0].assert_ver.ver = creq->rmw_assert_version;
+		} else {
+			osd_req_op_init(req, 0, CEPH_OSD_OP_CREATE,
+					CEPH_OSD_OP_FLAG_EXCL);
+		}
+	}
+
+	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+	ceph_osdc_start_request(osdc, req);
+}
+
+/*
+ * Prepare a subrequest to upload to the server.
+ */
+static void ceph_prepare_write(struct netfs_io_subrequest *subreq)
+{
+	struct ceph_inode_info *ci = ceph_inode(subreq->rreq->inode);
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(subreq->rreq->inode);
+	u64 objnum, objoff;
+
+	/* Clamp the length to the next object boundary. */
+	ceph_calc_file_object_mapping(&ci->i_layout, subreq->start,
+				      fsc->mount_options->wsize,
+				      &objnum, &objoff,
+				      &subreq->rreq->io_streams[0].sreq_max_len);
+}
+
+/*
+ * Mark the caps as dirty
+ */
+static void ceph_netfs_post_modify(struct inode *inode, void *fs_priv)
+{
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_cap_flush **prealloc_cf = fs_priv;
+	int dirty;
+
+	spin_lock(&ci->i_ceph_lock);
+	dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR, prealloc_cf);
+	spin_unlock(&ci->i_ceph_lock);
+	if (dirty)
+		__mark_inode_dirty(inode, dirty);
+}
+
+static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq)
+{
+	struct inode *inode = rreq->inode;
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_file_layout *lo = &ci->i_layout;
+	unsigned long max_pages = inode->i_sb->s_bdi->ra_pages;
+	loff_t end = rreq->start + rreq->len, new_end;
+	struct ceph_io_request *priv = container_of(rreq, struct ceph_io_request, rreq);
+	unsigned long max_len;
+	u32 blockoff;
+
+	if (priv) {
+		/* Readahead is disabled by posix_fadvise POSIX_FADV_RANDOM */
+		if (priv->file_ra_disabled)
+			max_pages = 0;
+		else
+			max_pages = priv->file_ra_pages;
+
+	}
+
+	/* Readahead is disabled */
+	if (!max_pages)
+		return;
+
+	max_len = max_pages << PAGE_SHIFT;
+
+	/*
+	 * Try to expand the length forward by rounding up it to the next
+	 * block, but do not exceed the file size, unless the original
+	 * request already exceeds it.
+	 */
+	new_end = umin(round_up(end, lo->stripe_unit), rreq->i_size);
+	if (new_end > end && new_end <= rreq->start + max_len)
+		rreq->len = new_end - rreq->start;
+
+	/* Try to expand the start downward */
+	div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
+	if (rreq->len + blockoff <= max_len) {
+		rreq->start -= blockoff;
+		rreq->len += blockoff;
+	}
+}
+
+static int ceph_netfs_prepare_read(struct netfs_io_subrequest *subreq)
+{
+	struct netfs_io_request *rreq = subreq->rreq;
+	struct ceph_inode_info *ci = ceph_inode(rreq->inode);
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(rreq->inode);
+	size_t xlen;
+	u64 objno, objoff;
+
+	/* Truncate the extent at the end of the current block */
+	ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len,
+				      &objno, &objoff, &xlen);
+	rreq->io_streams[0].sreq_max_len = umin(xlen, fsc->mount_options->rsize);
+	return 0;
+}
+
+static void ceph_netfs_read_callback(struct ceph_osd_request *req)
+{
+	struct inode *inode = req->r_inode;
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	struct ceph_client *cl = fsc->client;
+	struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
+	struct netfs_io_subrequest *subreq = req->r_priv;
+	struct ceph_osd_req_op *op = &req->r_ops[0];
+	bool sparse = (op->op == CEPH_OSD_OP_SPARSE_READ);
+	int err = req->r_result;
+
+	ceph_update_read_metrics(&fsc->mdsc->metric, req->r_start_latency,
+				 req->r_end_latency, osd_data->iter.count, err);
+
+	doutc(cl, "result %d subreq->len=%zu i_size=%lld\n", req->r_result,
+	      subreq->len, i_size_read(req->r_inode));
+
+	/* no object means success but no data */
+	if (err == -ENOENT)
+		err = 0;
+	else if (err == -EBLOCKLISTED)
+		fsc->blocklisted = true;
+
+	if (err >= 0) {
+		if (sparse && err > 0)
+			err = ceph_sparse_ext_map_end(op);
+		if (err < subreq->len &&
+		    subreq->rreq->origin != NETFS_DIO_READ)
+			__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
+		if (IS_ENCRYPTED(inode) && err > 0) {
+#if 0
+			err = ceph_fscrypt_decrypt_extents(inode, osd_data->dbuf,
+							   subreq->start,
+							   op->extent.sparse_ext,
+							   op->extent.sparse_ext_cnt);
+			if (err > subreq->len)
+				err = subreq->len;
+#else
+			pr_err("TODO: Content-decrypt currently disabled\n");
+			err = -EOPNOTSUPP;
+#endif
+		}
+	}
+
+	if (err > 0) {
+		subreq->transferred = err;
+		err = 0;
+	}
+
+	subreq->error = err;
+	trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
+	ceph_dec_osd_stopping_blocker(fsc->mdsc);
+	netfs_read_subreq_terminated(subreq);
+}
+
+static void ceph_rmw_read_done(struct netfs_io_request *wreq, struct netfs_io_request *rreq)
+{
+	struct ceph_io_request *cwreq = container_of(wreq, struct ceph_io_request, rreq);
+	struct ceph_io_request *crreq = container_of(rreq, struct ceph_io_request, rreq);
+
+	cwreq->rmw_assert_version = crreq->rmw_assert_version;
+}
+
+static bool ceph_netfs_issue_read_inline(struct netfs_io_subrequest *subreq)
+{
+	struct netfs_io_request *rreq = subreq->rreq;
+	struct inode *inode = rreq->inode;
+	struct ceph_mds_reply_info_parsed *rinfo;
+	struct ceph_mds_reply_info_in *iinfo;
+	struct ceph_mds_request *req;
+	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	ssize_t err = 0;
+	size_t len, copied;
+	int mode;
+
+	__clear_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
+
+	if (subreq->start >= inode->i_size)
+		goto out;
+
+	/* We need to fetch the inline data. */
+	mode = ceph_try_to_choose_auth_mds(inode, CEPH_STAT_CAP_INLINE_DATA);
+	req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_GETATTR, mode);
+	if (IS_ERR(req)) {
+		err = PTR_ERR(req);
+		goto out;
+	}
+	req->r_ino1 = ci->i_vino;
+	req->r_args.getattr.mask = cpu_to_le32(CEPH_STAT_CAP_INLINE_DATA);
+	req->r_num_caps = 2;
+
+	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+	err = ceph_mdsc_do_request(mdsc, NULL, req);
+	if (err < 0)
+		goto out;
+
+	rinfo = &req->r_reply_info;
+	iinfo = &rinfo->targeti;
+	if (iinfo->inline_version == CEPH_INLINE_NONE) {
+		/* The data got uninlined */
+		ceph_mdsc_put_request(req);
+		return false;
+	}
+
+	len = umin(iinfo->inline_len - subreq->start, subreq->len);
+	copied = copy_to_iter(iinfo->inline_data + subreq->start, len, &subreq->io_iter);
+	if (copied) {
+		subreq->transferred += copied;
+		if (copied == len)
+			__set_bit(NETFS_SREQ_HIT_EOF, &subreq->flags);
+		subreq->error = 0;
+	} else {
+		subreq->error = -EFAULT;
+	}
+
+	ceph_mdsc_put_request(req);
+out:
+	netfs_read_subreq_terminated(subreq);
+	return true;
+}
+
+static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq)
+{
+	struct netfs_io_request *rreq = subreq->rreq;
+	struct inode *inode = rreq->inode;
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	struct ceph_client *cl = fsc->client;
+	struct ceph_osd_request *req = NULL;
+	struct ceph_vino vino = ceph_vino(inode);
+	int extent_cnt;
+	bool sparse = IS_ENCRYPTED(inode) || ceph_test_mount_opt(fsc, SPARSEREAD);
+	u64 off = subreq->start, len = subreq->len;
+	int err = 0;
+
+	if (ceph_inode_is_shutdown(inode)) {
+		err = -EIO;
+		goto out;
+	}
+
+	if (ceph_has_inline_data(ci) && ceph_netfs_issue_read_inline(subreq))
+		return;
+
+	req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, vino,
+				    off, &len, 0, 1,
+				    sparse ? CEPH_OSD_OP_SPARSE_READ : CEPH_OSD_OP_READ,
+				    CEPH_OSD_FLAG_READ, /*  read_from_replica will be or'd in */
+				    NULL, ci->i_truncate_seq, ci->i_truncate_size, false);
+	if (IS_ERR(req)) {
+		err = PTR_ERR(req);
+		req = NULL;
+		goto out;
+	}
+
+	if (sparse) {
+		extent_cnt = __ceph_sparse_read_ext_count(inode, len);
+		err = ceph_alloc_sparse_ext_map(&req->r_ops[0], extent_cnt);
+		if (err)
+			goto out;
+	}
+
+	doutc(cl, "%llx.%llx pos=%llu orig_len=%zu len=%llu\n",
+	      ceph_vinop(inode), subreq->start, subreq->len, len);
+
+	osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter);
+	if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
+		err = -EIO;
+		goto out;
+	}
+	req->r_callback = ceph_netfs_read_callback;
+	req->r_priv = subreq;
+	req->r_inode = inode;
+
+	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+	ceph_osdc_start_request(req->r_osdc, req);
+out:
+	ceph_osdc_put_request(req);
+	doutc(cl, "%llx.%llx result %d\n", ceph_vinop(inode), err);
+	if (err) {
+		subreq->error = err;
+		netfs_read_subreq_terminated(subreq);
+	}
+}
+
+static int ceph_init_request(struct netfs_io_request *rreq, struct file *file)
+{
+	struct ceph_io_request *priv = container_of(rreq, struct ceph_io_request, rreq);
+	struct inode *inode = rreq->inode;
+	struct ceph_client *cl = ceph_inode_to_client(inode);
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	int got = 0, want = CEPH_CAP_FILE_CACHE;
+	int ret = 0;
+
+	rreq->rsize = 1024 * 1024;
+	rreq->wsize = umin(i_blocksize(inode), fsc->mount_options->wsize);
+
+	switch (rreq->origin) {
+	case NETFS_READAHEAD:
+		goto init_readahead;
+	case NETFS_WRITEBACK:
+	case NETFS_WRITETHROUGH:
+	case NETFS_UNBUFFERED_WRITE:
+	case NETFS_DIO_WRITE:
+		if (S_ISREG(rreq->inode->i_mode))
+			rreq->io_streams[0].avail = true;
+		return 0;
+	default:
+		return 0;
+	}
+
+init_readahead:
+	/*
+	 * If we are doing readahead triggered by a read, fault-in or
+	 * MADV/FADV_WILLNEED, someone higher up the stack must be holding the
+	 * FILE_CACHE and/or LAZYIO caps.
+	 */
+	if (file) {
+		priv->file_ra_pages = file->f_ra.ra_pages;
+		priv->file_ra_disabled = file->f_mode & FMODE_RANDOM;
+		rreq->netfs_priv = priv;
+		return 0;
+	}
+
+	/*
+	 * readahead callers do not necessarily hold Fcb caps
+	 * (e.g. fadvise, madvise).
+	 */
+	ret = ceph_try_get_caps(inode, CEPH_CAP_FILE_RD, want, true, &got);
+	if (ret < 0) {
+		doutc(cl, "%llx.%llx, error getting cap\n", ceph_vinop(inode));
+		goto out;
+	}
+
+	if (!(got & want)) {
+		doutc(cl, "%llx.%llx, no cache cap\n", ceph_vinop(inode));
+		ret = -EACCES;
+		goto out;
+	}
+	if (ret > 0)
+		priv->caps = got;
+	else
+		ret = -EACCES;
+
+	rreq->io_streams[0].sreq_max_len = fsc->mount_options->rsize;
+out:
+	return ret;
+}
+
+static void ceph_netfs_free_request(struct netfs_io_request *rreq)
+{
+	struct ceph_io_request *creq = container_of(rreq, struct ceph_io_request, rreq);
+
+	if (creq->caps)
+		ceph_put_cap_refs(ceph_inode(rreq->inode), creq->caps);
+}
+
+const struct netfs_request_ops ceph_netfs_ops = {
+	.init_request		= ceph_init_request,
+	.free_request		= ceph_netfs_free_request,
+	.expand_readahead	= ceph_netfs_expand_readahead,
+	.prepare_read		= ceph_netfs_prepare_read,
+	.issue_read		= ceph_netfs_issue_read,
+	.rmw_read_done		= ceph_rmw_read_done,
+	.post_modify		= ceph_netfs_post_modify,
+	.prepare_write		= ceph_prepare_write,
+	.issue_write		= ceph_issue_write,
+};
+
+/*
+ * Get ref for the oldest snapc for an inode with dirty data... that is, the
+ * only snap context we are allowed to write back.
+ */
+static struct ceph_snap_context *
+ceph_get_oldest_context(struct inode *inode, struct ceph_writeback_ctl *ctl,
+			struct ceph_snap_context *folio_snapc)
+{
+	struct ceph_snap_context *snapc = NULL;
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_cap_snap *capsnap = NULL;
+	struct ceph_client *cl = ceph_inode_to_client(inode);
+
+	spin_lock(&ci->i_ceph_lock);
+	list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) {
+		doutc(cl, " capsnap %p snapc %p has %d dirty pages\n",
+		      capsnap, capsnap->context, capsnap->dirty_pages);
+		if (!capsnap->dirty_pages)
+			continue;
+
+		/* get i_size, truncate_{seq,size} for folio_snapc? */
+		if (snapc && capsnap->context != folio_snapc)
+			continue;
+
+		if (ctl) {
+			if (capsnap->writing) {
+				ctl->i_size = i_size_read(inode);
+				ctl->size_stable = false;
+			} else {
+				ctl->i_size = capsnap->size;
+				ctl->size_stable = true;
+			}
+			ctl->truncate_size = capsnap->truncate_size;
+			ctl->truncate_seq = capsnap->truncate_seq;
+			ctl->head_snapc = false;
+		}
+
+		if (snapc)
+			break;
+
+		snapc = ceph_get_snap_context(capsnap->context);
+		if (!folio_snapc ||
+		    folio_snapc == snapc ||
+		    folio_snapc->seq > snapc->seq)
+			break;
+	}
+	if (!snapc && ci->i_wrbuffer_ref_head) {
+		snapc = ceph_get_snap_context(ci->i_head_snapc);
+		doutc(cl, " head snapc %p has %d dirty pages\n", snapc,
+		      ci->i_wrbuffer_ref_head);
+		if (ctl) {
+			ctl->i_size = i_size_read(inode);
+			ctl->truncate_size = ci->i_truncate_size;
+			ctl->truncate_seq = ci->i_truncate_seq;
+			ctl->size_stable = false;
+			ctl->head_snapc = true;
+		}
+	}
+	spin_unlock(&ci->i_ceph_lock);
+	return snapc;
+}
+
+/*
+ * Flush dirty data.  We have to start with the oldest snap as that's the only
+ * one we're allowed to write back.
+ */
+static int ceph_writepages(struct address_space *mapping,
+			   struct writeback_control *wbc)
+{
+	struct ceph_writeback_ctl ceph_wbc;
+	struct ceph_snap_context *snapc;
+	struct ceph_inode_info *ci = ceph_inode(mapping->host);
+	loff_t actual_start = wbc->range_start, actual_end = wbc->range_end;
+	int ret;
+
+	do {
+		snapc = ceph_get_oldest_context(mapping->host, &ceph_wbc, NULL);
+		if (snapc == ci->i_head_snapc) {
+			wbc->range_start = actual_start;
+			wbc->range_end = actual_end;
+		} else {
+			/* Do not respect wbc->range_{start,end}.  Dirty pages
+			 * in that range can be associated with newer snapc.
+			 * They are not writeable until we write all dirty
+			 * pages associated with an older snapc get written.
+			 */
+			wbc->range_start = 0;
+			wbc->range_end = LLONG_MAX;
+		}
+
+		ret = netfs_writepages_group(mapping, wbc, &snapc->group, &ceph_wbc);
+		ceph_put_snap_context(snapc);
+		if (snapc == ci->i_head_snapc)
+			break;
+	} while (ret == 0 && wbc->nr_to_write > 0);
+
+	return ret;
+}
+
+const struct address_space_operations ceph_aops = {
+	.read_folio	= netfs_read_folio,
+	.readahead	= netfs_readahead,
+	.writepages	= ceph_writepages,
+	.dirty_folio	= ceph_dirty_folio,
+	.invalidate_folio = netfs_invalidate_folio,
+	.release_folio	= netfs_release_folio,
+	.direct_IO	= noop_direct_IO,
+	.migrate_folio	= filemap_migrate_folio,
+};
+
+/*
+ * Wrap generic_file_aio_read with checks for cap bits on the inode.
+ * Atomically grab references, so that those bits are not released
+ * back to the MDS mid-read.
+ *
+ * Hmm, the sync read case isn't actually async... should it be?
+ */
+ssize_t ceph_netfs_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct file *filp = iocb->ki_filp;
+	struct inode *inode = file_inode(filp);
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_file_info *fi = filp->private_data;
+	struct ceph_client *cl = ceph_inode_to_client(inode);
+	ssize_t ret;
+	size_t len = iov_iter_count(to);
+	bool dio = iocb->ki_flags & IOCB_DIRECT;
+	int want = 0, got = 0;
+
+	doutc(cl, "%llu~%zu trying to get caps on %p %llx.%llx\n",
+	      iocb->ki_pos, len, inode, ceph_vinop(inode));
+
+	if (ceph_inode_is_shutdown(inode))
+		return -ESTALE;
+
+	if (dio)
+		ret = netfs_start_io_direct(inode);
+	else
+		ret = netfs_start_io_read(inode);
+	if (ret < 0)
+		return ret;
+
+	if (!(fi->flags & CEPH_F_SYNC) && !dio)
+		want |= CEPH_CAP_FILE_CACHE;
+	if (fi->fmode & CEPH_FILE_MODE_LAZY)
+		want |= CEPH_CAP_FILE_LAZYIO;
+
+	ret = ceph_get_caps(filp, CEPH_CAP_FILE_RD, want, -1, &got);
+	if (ret < 0)
+		goto out;
+
+	if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
+	    dio ||
+	    (fi->flags & CEPH_F_SYNC)) {
+		doutc(cl, "sync %p %llx.%llx %llu~%zu got cap refs on %s\n",
+		      inode, ceph_vinop(inode), iocb->ki_pos, len,
+		      ceph_cap_string(got));
+
+		ret = netfs_unbuffered_read_iter(iocb, to);
+	} else {
+		doutc(cl, "async %p %llx.%llx %llu~%zu got cap refs on %s\n",
+		      inode, ceph_vinop(inode), iocb->ki_pos, len,
+		      ceph_cap_string(got));
+		ret = filemap_read(iocb, to, 0);
+	}
+
+	doutc(cl, "%p %llx.%llx dropping cap refs on %s = %zd\n",
+	      inode, ceph_vinop(inode), ceph_cap_string(got), ret);
+	ceph_put_cap_refs(ci, got);
+
+out:
+	if (dio)
+		netfs_end_io_direct(inode);
+	else
+		netfs_end_io_read(inode);
+	return ret;
+}
+
+/*
+ * Get the most recent snap context in the list to which the inode subscribes.
+ * This is the only one we are allowed to modify.  If a folio points to an
+ * earlier snapshot, it must be flushed first.
+ */
+static struct ceph_snap_context *ceph_get_most_recent_snapc(struct inode *inode)
+{
+	struct ceph_snap_context *snapc;
+	struct ceph_inode_info *ci = ceph_inode(inode);
+
+	/* Get the snap this write is going to belong to. */
+	spin_lock(&ci->i_ceph_lock);
+	if (__ceph_have_pending_cap_snap(ci)) {
+		struct ceph_cap_snap *capsnap =
+			list_last_entry(&ci->i_cap_snaps,
+					struct ceph_cap_snap, ci_item);
+
+		snapc = ceph_get_snap_context(capsnap->context);
+	} else {
+		BUG_ON(!ci->i_head_snapc);
+		snapc = ceph_get_snap_context(ci->i_head_snapc);
+	}
+	spin_unlock(&ci->i_ceph_lock);
+
+	return snapc;
+}
+
+/*
+ * Take cap references to avoid releasing caps to MDS mid-write.
+ *
+ * If we are synchronous, and write with an old snap context, the OSD
+ * may return EOLDSNAPC.  In that case, retry the write.. _after_
+ * dropping our cap refs and allowing the pending snap to logically
+ * complete _before_ this write occurs.
+ *
+ * If we are near ENOSPC, write synchronously.
+ */
+ssize_t ceph_netfs_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	struct ceph_snap_context *snapc;
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	struct ceph_file_info *fi = file->private_data;
+	struct ceph_osd_client *osdc = &fsc->client->osdc;
+	struct ceph_cap_flush *prealloc_cf;
+	struct ceph_client *cl = fsc->client;
+	ssize_t count, written = 0;
+	loff_t limit = max(i_size_read(inode), fsc->max_file_size);
+	loff_t pos;
+	bool direct_lock = false;
+	u64 pool_flags;
+	u32 map_flags;
+	int err, want = 0, got;
+
+	if (ceph_inode_is_shutdown(inode))
+		return -ESTALE;
+
+	if (ceph_snap(inode) != CEPH_NOSNAP)
+		return -EROFS;
+
+	prealloc_cf = ceph_alloc_cap_flush();
+	if (!prealloc_cf)
+		return -ENOMEM;
+
+	if ((iocb->ki_flags & (IOCB_DIRECT | IOCB_APPEND)) == IOCB_DIRECT)
+		direct_lock = true;
+
+retry_snap:
+	if (direct_lock)
+		netfs_start_io_direct(inode);
+	else
+		netfs_start_io_write(inode);
+
+	if (iocb->ki_flags & IOCB_APPEND) {
+		err = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false);
+		if (err < 0)
+			goto out;
+	}
+
+	err = generic_write_checks(iocb, from);
+	if (err <= 0)
+		goto out;
+
+	pos = iocb->ki_pos;
+	if (unlikely(pos >= limit)) {
+		err = -EFBIG;
+		goto out;
+	} else {
+		iov_iter_truncate(from, limit - pos);
+	}
+
+	count = iov_iter_count(from);
+	if (ceph_quota_is_max_bytes_exceeded(inode, pos + count)) {
+		err = -EDQUOT;
+		goto out;
+	}
+
+	down_read(&osdc->lock);
+	map_flags = osdc->osdmap->flags;
+	pool_flags = ceph_pg_pool_flags(osdc->osdmap, ci->i_layout.pool_id);
+	up_read(&osdc->lock);
+	if ((map_flags & CEPH_OSDMAP_FULL) ||
+	    (pool_flags & CEPH_POOL_FLAG_FULL)) {
+		err = -ENOSPC;
+		goto out;
+	}
+
+	err = file_remove_privs(file);
+	if (err)
+		goto out;
+
+	doutc(cl, "%p %llx.%llx %llu~%zd getting caps. i_size %llu\n",
+	      inode, ceph_vinop(inode), pos, count,
+	      i_size_read(inode));
+	if (!(fi->flags & CEPH_F_SYNC) && !direct_lock)
+		want |= CEPH_CAP_FILE_BUFFER;
+	if (fi->fmode & CEPH_FILE_MODE_LAZY)
+		want |= CEPH_CAP_FILE_LAZYIO;
+	got = 0;
+	err = ceph_get_caps(file, CEPH_CAP_FILE_WR, want, pos + count, &got);
+	if (err < 0)
+		goto out;
+
+	err = file_update_time(file);
+	if (err)
+		goto out_caps;
+
+	inode_inc_iversion_raw(inode);
+
+	doutc(cl, "%p %llx.%llx %llu~%zd got cap refs on %s\n",
+	      inode, ceph_vinop(inode), pos, count, ceph_cap_string(got));
+
+	/* Get the snap this write is going to belong to. */
+	snapc = ceph_get_most_recent_snapc(inode);
+
+	if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 ||
+	    (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) ||
+	    (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) {
+		struct iov_iter data;
+
+		/* we might need to revert back to that point */
+		data = *from;
+		written = netfs_unbuffered_write_iter_locked(iocb, &data, &snapc->group);
+		if (direct_lock)
+			netfs_end_io_direct(inode);
+		else
+			netfs_end_io_write(inode);
+		if (written > 0)
+			iov_iter_advance(from, written);
+		ceph_put_snap_context(snapc);
+	} else {
+		/*
+		 * No need to acquire the i_truncate_mutex.  Because the MDS
+		 * revokes Fwb caps before sending truncate message to us.  We
+		 * can't get Fwb cap while there are pending vmtruncate.  So
+		 * write and vmtruncate can not run at the same time
+		 */
+		written = netfs_perform_write(iocb, from, &snapc->group, &prealloc_cf);
+		netfs_end_io_write(inode);
+	}
+
+	if (written >= 0) {
+		int dirty;
+
+		spin_lock(&ci->i_ceph_lock);
+		dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR,
+					       &prealloc_cf);
+		spin_unlock(&ci->i_ceph_lock);
+		if (dirty)
+			__mark_inode_dirty(inode, dirty);
+		if (ceph_quota_is_max_bytes_approaching(inode, iocb->ki_pos))
+			ceph_check_caps(ci, CHECK_CAPS_FLUSH);
+	}
+
+	doutc(cl, "%p %llx.%llx %llu~%u  dropping cap refs on %s\n",
+	      inode, ceph_vinop(inode), pos, (unsigned)count,
+	      ceph_cap_string(got));
+	ceph_put_cap_refs(ci, got);
+
+	if (written == -EOLDSNAPC) {
+		doutc(cl, "%p %llx.%llx %llu~%u" "got EOLDSNAPC, retrying\n",
+		      inode, ceph_vinop(inode), pos, (unsigned)count);
+		goto retry_snap;
+	}
+
+	if (written >= 0) {
+		if ((map_flags & CEPH_OSDMAP_NEARFULL) ||
+		    (pool_flags & CEPH_POOL_FLAG_NEARFULL))
+			iocb->ki_flags |= IOCB_DSYNC;
+		written = generic_write_sync(iocb, written);
+	}
+
+	goto out_unlocked;
+out_caps:
+	ceph_put_cap_refs(ci, got);
+out:
+	if (direct_lock)
+		netfs_end_io_direct(inode);
+	else
+		netfs_end_io_write(inode);
+out_unlocked:
+	ceph_free_cap_flush(prealloc_cf);
+	return written ? written : err;
+}
+
+vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
+{
+	struct ceph_snap_context *snapc;
+	struct vm_area_struct *vma = vmf->vma;
+	struct inode *inode = file_inode(vma->vm_file);
+	struct ceph_client *cl = ceph_inode_to_client(inode);
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_file_info *fi = vma->vm_file->private_data;
+	struct ceph_cap_flush *prealloc_cf;
+	struct folio *folio = page_folio(vmf->page);
+	loff_t size = i_size_read(inode);
+	loff_t off = folio_pos(folio);
+	size_t len = folio_size(folio);
+	int want, got, err;
+	vm_fault_t ret = VM_FAULT_SIGBUS;
+
+	if (ceph_inode_is_shutdown(inode))
+		return ret;
+
+	prealloc_cf = ceph_alloc_cap_flush();
+	if (!prealloc_cf)
+		return -ENOMEM;
+
+	doutc(cl, "%llx.%llx %llu~%zd getting caps i_size %llu\n",
+	      ceph_vinop(inode), off, len, size);
+	if (fi->fmode & CEPH_FILE_MODE_LAZY)
+		want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
+	else
+		want = CEPH_CAP_FILE_BUFFER;
+
+	got = 0;
+	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, off + len, &got);
+	if (err < 0)
+		goto out_free;
+
+	doutc(cl, "%llx.%llx %llu~%zd got cap refs on %s\n", ceph_vinop(inode),
+	      off, len, ceph_cap_string(got));
+
+	/* Get the snap this write is going to belong to. */
+	snapc = ceph_get_most_recent_snapc(inode);
+
+	ret = netfs_page_mkwrite(vmf, &snapc->group, &prealloc_cf);
+
+	doutc(cl, "%llx.%llx %llu~%zd dropping cap refs on %s ret %x\n",
+	      ceph_vinop(inode), off, len, ceph_cap_string(got), ret);
+	ceph_put_cap_refs_async(ci, got);
+out_free:
+	ceph_free_cap_flush(prealloc_cf);
+	if (err < 0)
+		ret = vmf_error(err);
+	return ret;
+}
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 14784ad86670..acd5c4821ded 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -470,7 +470,7 @@ struct ceph_inode_info {
 #endif
 };
 
-struct ceph_netfs_request_data {
+struct ceph_netfs_request_data { // TODO: Remove
 	int caps;
 
 	/*
@@ -483,6 +483,29 @@ struct ceph_netfs_request_data {
 	bool file_ra_disabled;
 };
 
+struct ceph_io_request {
+	struct netfs_io_request rreq;
+	u64 rmw_assert_version;
+	int caps;
+
+	/*
+	 * Maximum size of a file readahead request.
+	 * The fadvise could update the bdi's default ra_pages.
+	 */
+	unsigned int file_ra_pages;
+
+	/* Set it if fadvise disables file readahead entirely */
+	bool file_ra_disabled;
+};
+
+struct ceph_io_subrequest {
+	union {
+		struct netfs_io_subrequest sreq;
+		struct ceph_io_request *creq;
+	};
+	struct ceph_osd_request *req;
+};
+
 static inline struct ceph_inode_info *
 ceph_inode(const struct inode *inode)
 {
@@ -1237,8 +1260,10 @@ extern void __ceph_touch_fmode(struct ceph_inode_info *ci,
 			       struct ceph_mds_client *mdsc, int fmode);
 
 /* addr.c */
-extern const struct address_space_operations ceph_aops;
+#if 0 // TODO: Remove after netfs conversion
 extern const struct netfs_request_ops ceph_netfs_ops;
+#endif // TODO: Remove after netfs conversion
+bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio);
 extern int ceph_mmap(struct file *file, struct vm_area_struct *vma);
 extern int ceph_uninline_data(struct file *file);
 extern int ceph_pool_perm_check(struct inode *inode, int need);
@@ -1253,6 +1278,14 @@ static inline bool ceph_has_inline_data(struct ceph_inode_info *ci)
 	return true;
 }
 
+/* rdwr.c */
+extern const struct netfs_request_ops ceph_netfs_ops;
+extern const struct address_space_operations ceph_aops;
+
+ssize_t ceph_netfs_read_iter(struct kiocb *iocb, struct iov_iter *to);
+ssize_t ceph_netfs_write_iter(struct kiocb *iocb, struct iov_iter *from);
+vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf);
+
 /* file.c */
 extern const struct file_operations ceph_file_fops;
 
@@ -1260,9 +1293,11 @@ extern int ceph_renew_caps(struct inode *inode, int fmode);
 extern int ceph_open(struct inode *inode, struct file *file);
 extern int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 			    struct file *file, unsigned flags, umode_t mode);
+#if 0 // TODO: Remove after netfs conversion
 extern ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 				struct iov_iter *to, int *retry_op,
 				u64 *last_objver);
+#endif
 extern int ceph_release(struct inode *inode, struct file *filp);
 extern void ceph_fill_inline_data(struct inode *inode, struct page *locked_page,
 				  char *data, size_t len);
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index 9724d5a1ddc7..a82eb3be9737 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -264,9 +264,9 @@ static inline bool netfs_is_cache_enabled(struct netfs_inode *ctx)
 }
 
 /*
- * Check to see if a buffer aligns with the crypto block size.  If it doesn't
- * the crypto layer is going to copy all the data - in which case relying on
- * the crypto op for a free copy is pointless.
+ * Check to see if a buffer aligns with the crypto unit block size.  If it
+ * doesn't the crypto layer is going to copy all the data - in which case
+ * relying on the crypto op for a free copy is pointless.
  */
 static inline bool netfs_is_crypto_aligned(struct netfs_io_request *rreq,
 					   struct iov_iter *iter)
diff --git a/fs/netfs/main.c b/fs/netfs/main.c
index 0900dea53e4a..d431ba261920 100644
--- a/fs/netfs/main.c
+++ b/fs/netfs/main.c
@@ -139,7 +139,7 @@ static int __init netfs_init(void)
 		goto error_folio_pool;
 
 	netfs_request_slab = kmem_cache_create("netfs_request",
-					       sizeof(struct netfs_io_request), 0,
+					       NETFS_DEF_IO_REQUEST_SIZE, 0,
 					       SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					       NULL);
 	if (!netfs_request_slab)
@@ -149,7 +149,7 @@ static int __init netfs_init(void)
 		goto error_reqpool;
 
 	netfs_subrequest_slab = kmem_cache_create("netfs_subrequest",
-						  sizeof(struct netfs_io_subrequest) + 16, 0,
+						  NETFS_DEF_IO_SUBREQUEST_SIZE, 0,
 						  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 						  NULL);
 	if (!netfs_subrequest_slab)
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index 9b8d99477405..091328596533 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -652,7 +652,8 @@ int netfs_writepages_group(struct address_space *mapping,
 		if (netfs_folio_group(folio) != NETFS_FOLIO_COPY_TO_CACHE &&
 		    unlikely(!test_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))) {
 			set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags);
-			wreq->netfs_ops->begin_writeback(wreq);
+			if (wreq->netfs_ops->begin_writeback)
+				wreq->netfs_ops->begin_writeback(wreq);
 		}
 
 		error = netfs_write_folio(wreq, wbc, folio);
@@ -967,7 +968,8 @@ int netfs_writeback_single(struct address_space *mapping,
 	trace_netfs_write(wreq, netfs_write_trace_writeback);
 	netfs_stat(&netfs_n_wh_writepages);
 
-	if (__test_and_set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))
+	if (__test_and_set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags) &&
+	    wreq->netfs_ops->begin_writeback)
 		wreq->netfs_ops->begin_writeback(wreq);
 
 	for (fq = (struct folio_queue *)iter->folioq; fq; fq = fq->next) {
diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
index 733e7f93db66..0c626a7d32f4 100644
--- a/include/linux/ceph/libceph.h
+++ b/include/linux/ceph/libceph.h
@@ -16,6 +16,7 @@
 #include <linux/writeback.h>
 #include <linux/slab.h>
 #include <linux/refcount.h>
+#include <linux/netfs.h>
 
 #include <linux/ceph/types.h>
 #include <linux/ceph/messenger.h>
@@ -161,7 +162,7 @@ static inline bool ceph_msgr2(struct ceph_client *client)
  * dirtied.
  */
 struct ceph_snap_context {
-	refcount_t nref;
+	struct netfs_group group;
 	u64 seq;
 	u32 num_snaps;
 	u64 snaps[];
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 7eff589711cc..7f8d28b2c41b 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -246,6 +246,7 @@ struct ceph_osd_request {
 	struct completion r_completion;       /* private to osd_client.c */
 	ceph_osdc_callback_t r_callback;
 
+	struct netfs_io_subrequest *r_subreq;
 	struct inode *r_inode;         	      /* for use by callbacks */
 	struct list_head r_private_item;      /* ditto */
 	void *r_priv;			      /* ditto */
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 4049c985b9b4..3253352fcbfa 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -26,6 +26,14 @@ enum netfs_sreq_ref_trace;
 typedef struct mempool_s mempool_t;
 struct folio_queue;
 
+/*
+ * Size of allocations for default netfs_io_(sub)request object slabs and
+ * mempools.  If a filesystem's request and subrequest objects fit within this
+ * size, they can use these otherwise they must provide their own.
+ */
+#define NETFS_DEF_IO_REQUEST_SIZE (sizeof(struct netfs_io_request) + 24)
+#define NETFS_DEF_IO_SUBREQUEST_SIZE (sizeof(struct netfs_io_subrequest) + 16)
+
 /**
  * folio_start_private_2 - Start an fscache write on a folio.  [DEPRECATED]
  * @folio: The folio.
@@ -184,7 +192,10 @@ struct netfs_io_subrequest {
 	struct list_head	rreq_link;	/* Link in req/stream::subrequests */
 	struct list_head	ioq_link;	/* Link in io_stream::io_queue */
 	union {
-		struct iov_iter	io_iter;	/* Iterator for this subrequest */
+		struct {
+			struct iov_iter	io_iter;	/* Iterator for this subrequest */
+			void	*fs_private;	/* Filesystem specific */
+		};
 		struct {
 			struct scatterlist src_sg; /* Source for crypto subreq */
 			struct scatterlist dst_sg; /* Dest for crypto subreq */
diff --git a/net/ceph/snapshot.c b/net/ceph/snapshot.c
index e24315937c45..92f63cbca183 100644
--- a/net/ceph/snapshot.c
+++ b/net/ceph/snapshot.c
@@ -17,6 +17,11 @@
  * the entire structure is freed.
  */
 
+static void ceph_snap_context_kfree(struct netfs_group *group)
+{
+	kfree(group);
+}
+
 /*
  * Create a new ceph snapshot context large enough to hold the
  * indicated number of snapshot ids (which can be 0).  Caller has
@@ -36,8 +41,9 @@ struct ceph_snap_context *ceph_create_snap_context(u32 snap_count,
 	if (!snapc)
 		return NULL;
 
-	refcount_set(&snapc->nref, 1);
-	snapc->num_snaps = snap_count;
+	refcount_set(&snapc->group.ref, 1);
+	snapc->group.free = ceph_snap_context_kfree;
+	snapc->num_snaps  = snap_count;
 
 	return snapc;
 }
@@ -46,18 +52,14 @@ EXPORT_SYMBOL(ceph_create_snap_context);
 struct ceph_snap_context *ceph_get_snap_context(struct ceph_snap_context *sc)
 {
 	if (sc)
-		refcount_inc(&sc->nref);
+		netfs_get_group(&sc->group);
 	return sc;
 }
 EXPORT_SYMBOL(ceph_get_snap_context);
 
 void ceph_put_snap_context(struct ceph_snap_context *sc)
 {
-	if (!sc)
-		return;
-	if (refcount_dec_and_test(&sc->nref)) {
-		/*printk(" deleting snap_context %p\n", sc);*/
-		kfree(sc);
-	}
+	if (sc)
+		netfs_put_group(&sc->group);
 }
 EXPORT_SYMBOL(ceph_put_snap_context);


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 34/35] ceph: Enable multipage folios for ceph files
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (32 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 33/35] ceph: Use netfslib [INCOMPLETE] David Howells
@ 2025-03-13 23:33 ` David Howells
  2025-03-13 23:33 ` [RFC PATCH 35/35] ceph: Remove old I/O API bits David Howells
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Enable multipage folio support for ceph regular files.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/ceph/inode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 8f73f3a55a3e..d9215423e011 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1184,6 +1184,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 	case S_IFREG:
 		inode->i_op = &ceph_file_iops;
 		inode->i_fop = &ceph_file_fops;
+		mapping_set_large_folios(inode->i_mapping);
 		break;
 	case S_IFLNK:
 		if (!ci->i_symlink) {


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC PATCH 35/35] ceph: Remove old I/O API bits
  2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
                   ` (33 preceding siblings ...)
  2025-03-13 23:33 ` [RFC PATCH 34/35] ceph: Enable multipage folios for ceph files David Howells
@ 2025-03-13 23:33 ` David Howells
  34 siblings, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-13 23:33 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Alex Markuze
  Cc: David Howells, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel

Remove the #if'd out bits of the old I/O API.  This is separate to the
implementation to reduce the size of the reviewable patch.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/ceph/addr.c  | 2018 ++---------------------------------------------
 fs/ceph/file.c  | 1504 -----------------------------------
 fs/ceph/super.h |   21 -
 3 files changed, 46 insertions(+), 3497 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 325fbbce1eaa..b3ba102af60b 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -59,1890 +59,70 @@
  * accounting is preserved.
  */
 
-#define CONGESTION_ON_THRESH(congestion_kb) (congestion_kb >> (PAGE_SHIFT-10))
-#define CONGESTION_OFF_THRESH(congestion_kb)				\
-	(CONGESTION_ON_THRESH(congestion_kb) -				\
-	 (CONGESTION_ON_THRESH(congestion_kb) >> 2))
-
-#if 0 // TODO: Remove after netfs conversion
-static int ceph_netfs_check_write_begin(struct file *file, loff_t pos, unsigned int len,
-					struct folio **foliop, void **_fsdata);
-
-static struct ceph_snap_context *page_snap_context(struct page *page)
-{
-	if (PagePrivate(page))
-		return (void *)page->private;
-	return NULL;
-}
-#endif // TODO: Remove after netfs conversion
-
-/*
- * Dirty a page.  Optimistically adjust accounting, on the assumption
- * that we won't race with invalidate.  If we do, readjust.
- */
-bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio)
-{
-	struct inode *inode = mapping->host;
-	struct ceph_client *cl = ceph_inode_to_client(inode);
-	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
-	struct ceph_inode_info *ci;
-	struct ceph_snap_context *snapc;
-	struct netfs_group *group;
-
-	if (folio_test_dirty(folio)) {
-		doutc(cl, "%llx.%llx %p idx %lu -- already dirty\n",
-		      ceph_vinop(inode), folio, folio->index);
-		VM_BUG_ON_FOLIO(!folio_test_private(folio), folio);
-		return false;
-	}
-
-	atomic64_inc(&mdsc->dirty_folios);
-
-	ci = ceph_inode(inode);
-
-	/* dirty the head */
-	spin_lock(&ci->i_ceph_lock);
-	if (__ceph_have_pending_cap_snap(ci)) {
-		struct ceph_cap_snap *capsnap =
-			list_last_entry(&ci->i_cap_snaps,
-					struct ceph_cap_snap,
-					ci_item);
-		snapc = capsnap->context;
-		capsnap->dirty_pages++;
-	} else {
-		snapc = ci->i_head_snapc;
-		BUG_ON(!snapc);
-		++ci->i_wrbuffer_ref_head;
-	}
-
-	/* Attach a reference to the snap/group to the folio. */
-	group = netfs_folio_group(folio);
-	if (group != &snapc->group) {
-		netfs_set_group(folio, &snapc->group);
-		if (group) {
-			doutc(cl, "Different group %px != %px\n",
-			      group, &snapc->group);
-			netfs_put_group(group);
-		}
-	}
-
-	if (ci->i_wrbuffer_ref == 0)
-		ihold(inode);
-	++ci->i_wrbuffer_ref;
-	doutc(cl, "%llx.%llx %p idx %lu head %d/%d -> %d/%d "
-	      "snapc %p seq %lld (%d snaps)\n",
-	      ceph_vinop(inode), folio, folio->index,
-	      ci->i_wrbuffer_ref-1, ci->i_wrbuffer_ref_head-1,
-	      ci->i_wrbuffer_ref, ci->i_wrbuffer_ref_head,
-	      snapc, snapc->seq, snapc->num_snaps);
-	spin_unlock(&ci->i_ceph_lock);
-
-	return netfs_dirty_folio(mapping, folio);
-}
-
-#if 0 // TODO: Remove after netfs conversion
-/*
- * If we are truncating the full folio (i.e. offset == 0), adjust the
- * dirty folio counters appropriately.  Only called if there is private
- * data on the folio.
- */
-static void ceph_invalidate_folio(struct folio *folio, size_t offset,
-				size_t length)
-{
-	struct inode *inode = folio->mapping->host;
-	struct ceph_client *cl = ceph_inode_to_client(inode);
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_snap_context *snapc;
-
-
-	if (offset != 0 || length != folio_size(folio)) {
-		doutc(cl, "%llx.%llx idx %lu partial dirty page %zu~%zu\n",
-		      ceph_vinop(inode), folio->index, offset, length);
-		return;
-	}
-
-	WARN_ON(!folio_test_locked(folio));
-	if (folio_test_private(folio)) {
-		doutc(cl, "%llx.%llx idx %lu full dirty page\n",
-		      ceph_vinop(inode), folio->index);
-
-		snapc = folio_detach_private(folio);
-		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
-		ceph_put_snap_context(snapc);
-	}
-
-	netfs_invalidate_folio(folio, offset, length);
-}
-
-static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq)
-{
-	struct inode *inode = rreq->inode;
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_file_layout *lo = &ci->i_layout;
-	unsigned long max_pages = inode->i_sb->s_bdi->ra_pages;
-	loff_t end = rreq->start + rreq->len, new_end;
-	struct ceph_netfs_request_data *priv = rreq->netfs_priv;
-	unsigned long max_len;
-	u32 blockoff;
-
-	if (priv) {
-		/* Readahead is disabled by posix_fadvise POSIX_FADV_RANDOM */
-		if (priv->file_ra_disabled)
-			max_pages = 0;
-		else
-			max_pages = priv->file_ra_pages;
-
-	}
-
-	/* Readahead is disabled */
-	if (!max_pages)
-		return;
-
-	max_len = max_pages << PAGE_SHIFT;
-
-	/*
-	 * Try to expand the length forward by rounding up it to the next
-	 * block, but do not exceed the file size, unless the original
-	 * request already exceeds it.
-	 */
-	new_end = umin(round_up(end, lo->stripe_unit), rreq->i_size);
-	if (new_end > end && new_end <= rreq->start + max_len)
-		rreq->len = new_end - rreq->start;
-
-	/* Try to expand the start downward */
-	div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
-	if (rreq->len + blockoff <= max_len) {
-		rreq->start -= blockoff;
-		rreq->len += blockoff;
-	}
-}
-
-static void finish_netfs_read(struct ceph_osd_request *req)
-{
-	struct inode *inode = req->r_inode;
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-	struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
-	struct netfs_io_subrequest *subreq = req->r_priv;
-	struct ceph_osd_req_op *op = &req->r_ops[0];
-	int err = req->r_result;
-	bool sparse = (op->op == CEPH_OSD_OP_SPARSE_READ);
-
-	ceph_update_read_metrics(&fsc->mdsc->metric, req->r_start_latency,
-				 req->r_end_latency, osd_data->length, err);
-
-	doutc(cl, "result %d subreq->len=%zu i_size=%lld\n", req->r_result,
-	      subreq->len, i_size_read(req->r_inode));
-
-	/* no object means success but no data */
-	if (err == -ENOENT) {
-		__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
-		__set_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
-		err = 0;
-	} else if (err == -EBLOCKLISTED) {
-		fsc->blocklisted = true;
-	}
-
-	if (err >= 0) {
-		if (sparse && err > 0)
-			err = ceph_sparse_ext_map_end(op);
-		if (err < subreq->len &&
-		    subreq->rreq->origin != NETFS_DIO_READ)
-			__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
-		if (IS_ENCRYPTED(inode) && err > 0) {
-			err = ceph_fscrypt_decrypt_extents(inode,
-					osd_data->pages, subreq->start,
-					op->extent.sparse_ext,
-					op->extent.sparse_ext_cnt);
-			if (err > subreq->len)
-				err = subreq->len;
-		}
-		if (err > 0)
-			__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
-	}
-
-	if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) {
-		ceph_put_page_vector(osd_data->pages,
-				     calc_pages_for(osd_data->offset,
-					osd_data->length), false);
-	}
-	if (err > 0) {
-		subreq->transferred = err;
-		err = 0;
-	}
-	subreq->error = err;
-	trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
-	netfs_read_subreq_terminated(subreq);
-	iput(req->r_inode);
-	ceph_dec_osd_stopping_blocker(fsc->mdsc);
-}
-
-static bool ceph_netfs_issue_op_inline(struct netfs_io_subrequest *subreq)
-{
-	struct netfs_io_request *rreq = subreq->rreq;
-	struct inode *inode = rreq->inode;
-	struct ceph_mds_reply_info_parsed *rinfo;
-	struct ceph_mds_reply_info_in *iinfo;
-	struct ceph_mds_request *req;
-	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	ssize_t err = 0;
-	size_t len;
-	int mode;
-
-	if (rreq->origin != NETFS_DIO_READ)
-		__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
-	__clear_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
-
-	if (subreq->start >= inode->i_size)
-		goto out;
-
-	/* We need to fetch the inline data. */
-	mode = ceph_try_to_choose_auth_mds(inode, CEPH_STAT_CAP_INLINE_DATA);
-	req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_GETATTR, mode);
-	if (IS_ERR(req)) {
-		err = PTR_ERR(req);
-		goto out;
-	}
-	req->r_ino1 = ci->i_vino;
-	req->r_args.getattr.mask = cpu_to_le32(CEPH_STAT_CAP_INLINE_DATA);
-	req->r_num_caps = 2;
-
-	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
-	err = ceph_mdsc_do_request(mdsc, NULL, req);
-	if (err < 0)
-		goto out;
-
-	rinfo = &req->r_reply_info;
-	iinfo = &rinfo->targeti;
-	if (iinfo->inline_version == CEPH_INLINE_NONE) {
-		/* The data got uninlined */
-		ceph_mdsc_put_request(req);
-		return false;
-	}
-
-	len = min_t(size_t, iinfo->inline_len - subreq->start, subreq->len);
-	err = copy_to_iter(iinfo->inline_data + subreq->start, len, &subreq->io_iter);
-	if (err == 0) {
-		err = -EFAULT;
-	} else {
-		subreq->transferred += err;
-		err = 0;
-	}
-
-	ceph_mdsc_put_request(req);
-out:
-	subreq->error = err;
-	trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
-	netfs_read_subreq_terminated(subreq);
-	return true;
-}
-
-static int ceph_netfs_prepare_read(struct netfs_io_subrequest *subreq)
-{
-	struct netfs_io_request *rreq = subreq->rreq;
-	struct inode *inode = rreq->inode;
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	size_t xlen;
-	u64 objno, objoff;
-
-	/* Truncate the extent at the end of the current block */
-	ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len,
-				      &objno, &objoff, &xlen);
-	rreq->io_streams[0].sreq_max_len = umin(xlen, fsc->mount_options->rsize);
-	return 0;
-}
-
-static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq)
-{
-	struct netfs_io_request *rreq = subreq->rreq;
-	struct inode *inode = rreq->inode;
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-	struct ceph_osd_request *req = NULL;
-	struct ceph_vino vino = ceph_vino(inode);
-	int err;
-	u64 len;
-	bool sparse = IS_ENCRYPTED(inode) || ceph_test_mount_opt(fsc, SPARSEREAD);
-	u64 off = subreq->start;
-	int extent_cnt;
-
-	if (ceph_inode_is_shutdown(inode)) {
-		err = -EIO;
-		goto out;
-	}
-
-	if (ceph_has_inline_data(ci) && ceph_netfs_issue_op_inline(subreq))
-		return;
-
-	// TODO: This rounding here is slightly dodgy.  It *should* work, for
-	// now, as the cache only deals in blocks that are a multiple of
-	// PAGE_SIZE and fscrypt blocks are at most PAGE_SIZE.  What needs to
-	// happen is for the fscrypt driving to be moved into netfslib and the
-	// data in the cache also to be stored encrypted.
-	len = subreq->len;
-	ceph_fscrypt_adjust_off_and_len(inode, &off, &len);
-
-	req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, vino,
-			off, &len, 0, 1, sparse ? CEPH_OSD_OP_SPARSE_READ : CEPH_OSD_OP_READ,
-			CEPH_OSD_FLAG_READ, NULL, ci->i_truncate_seq,
-			ci->i_truncate_size, false);
-	if (IS_ERR(req)) {
-		err = PTR_ERR(req);
-		req = NULL;
-		goto out;
-	}
-
-	if (sparse) {
-		extent_cnt = __ceph_sparse_read_ext_count(inode, len);
-		err = ceph_alloc_sparse_ext_map(&req->r_ops[0], extent_cnt);
-		if (err)
-			goto out;
-	}
-
-	doutc(cl, "%llx.%llx pos=%llu orig_len=%zu len=%llu\n",
-	      ceph_vinop(inode), subreq->start, subreq->len, len);
-
-	/*
-	 * FIXME: For now, use CEPH_OSD_DATA_TYPE_PAGES instead of _ITER for
-	 * encrypted inodes. We'd need infrastructure that handles an iov_iter
-	 * instead of page arrays, and we don't have that as of yet. Once the
-	 * dust settles on the write helpers and encrypt/decrypt routines for
-	 * netfs, we should be able to rework this.
-	 */
-	if (IS_ENCRYPTED(inode)) {
-		struct page **pages;
-		size_t page_off;
-
-		/*
-		 * The io_iter.count needs to be corrected to aligned length.
-		 * Otherwise, iov_iter_get_pages_alloc2() operates with
-		 * the initial unaligned length value. As a result,
-		 * ceph_msg_data_cursor_init() triggers BUG_ON() in the case
-		 * if msg->sparse_read_total > msg->data_length.
-		 */
-		subreq->io_iter.count = len;
-
-		err = iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len, &page_off);
-		if (err < 0) {
-			doutc(cl, "%llx.%llx failed to allocate pages, %d\n",
-			      ceph_vinop(inode), err);
-			goto out;
-		}
-
-		/* should always give us a page-aligned read */
-		WARN_ON_ONCE(page_off);
-
-		len = err;
-		err = 0;
-
-		osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, false,
-						 false);
-	} else {
-		osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter);
-	}
-	if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
-		err = -EIO;
-		goto out;
-	}
-	req->r_callback = finish_netfs_read;
-	req->r_priv = subreq;
-	req->r_inode = inode;
-	ihold(inode);
-
-	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
-	ceph_osdc_start_request(req->r_osdc, req);
-out:
-	ceph_osdc_put_request(req);
-	if (err) {
-		subreq->error = err;
-		netfs_read_subreq_terminated(subreq);
-	}
-	doutc(cl, "%llx.%llx result %d\n", ceph_vinop(inode), err);
-}
-
-static int ceph_init_request(struct netfs_io_request *rreq, struct file *file)
-{
-	struct inode *inode = rreq->inode;
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = ceph_inode_to_client(inode);
-	int got = 0, want = CEPH_CAP_FILE_CACHE;
-	struct ceph_netfs_request_data *priv;
-	int ret = 0;
-
-	/* [DEPRECATED] Use PG_private_2 to mark folio being written to the cache. */
-	__set_bit(NETFS_RREQ_USE_PGPRIV2, &rreq->flags);
-
-	if (rreq->origin != NETFS_READAHEAD)
-		return 0;
-
-	priv = kzalloc(sizeof(*priv), GFP_NOFS);
-	if (!priv)
-		return -ENOMEM;
-
-	/*
-	 * If we are doing readahead triggered by a read, fault-in or
-	 * MADV/FADV_WILLNEED, someone higher up the stack must be holding the
-	 * FILE_CACHE and/or LAZYIO caps.
-	 */
-	if (file) {
-		priv->file_ra_pages = file->f_ra.ra_pages;
-		priv->file_ra_disabled = file->f_mode & FMODE_RANDOM;
-		rreq->netfs_priv = priv;
-		return 0;
-	}
-
-	/*
-	 * readahead callers do not necessarily hold Fcb caps
-	 * (e.g. fadvise, madvise).
-	 */
-	ret = ceph_try_get_caps(inode, CEPH_CAP_FILE_RD, want, true, &got);
-	if (ret < 0) {
-		doutc(cl, "%llx.%llx, error getting cap\n", ceph_vinop(inode));
-		goto out;
-	}
-
-	if (!(got & want)) {
-		doutc(cl, "%llx.%llx, no cache cap\n", ceph_vinop(inode));
-		ret = -EACCES;
-		goto out;
-	}
-	if (ret == 0) {
-		ret = -EACCES;
-		goto out;
-	}
-
-	priv->caps = got;
-	rreq->netfs_priv = priv;
-	rreq->io_streams[0].sreq_max_len = fsc->mount_options->rsize;
-
-out:
-	if (ret < 0) {
-		if (got)
-			ceph_put_cap_refs(ceph_inode(inode), got);
-		kfree(priv);
-	}
-
-	return ret;
-}
-
-static void ceph_netfs_free_request(struct netfs_io_request *rreq)
-{
-	struct ceph_netfs_request_data *priv = rreq->netfs_priv;
-
-	if (!priv)
-		return;
-
-	if (priv->caps)
-		ceph_put_cap_refs(ceph_inode(rreq->inode), priv->caps);
-	kfree(priv);
-	rreq->netfs_priv = NULL;
-}
-
-const struct netfs_request_ops ceph_netfs_ops = {
-	.init_request		= ceph_init_request,
-	.free_request		= ceph_netfs_free_request,
-	.prepare_read		= ceph_netfs_prepare_read,
-	.issue_read		= ceph_netfs_issue_read,
-	.expand_readahead	= ceph_netfs_expand_readahead,
-	.check_write_begin	= ceph_netfs_check_write_begin,
-};
-
-#ifdef CONFIG_CEPH_FSCACHE
-static void ceph_set_page_fscache(struct page *page)
-{
-	folio_start_private_2(page_folio(page)); /* [DEPRECATED] */
-}
-
-static void ceph_fscache_write_terminated(void *priv, ssize_t error, bool was_async)
-{
-	struct inode *inode = priv;
-
-	if (IS_ERR_VALUE(error) && error != -ENOBUFS)
-		ceph_fscache_invalidate(inode, false);
-}
-
-static void ceph_fscache_write_to_cache(struct inode *inode, u64 off, u64 len, bool caching)
-{
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct fscache_cookie *cookie = ceph_fscache_cookie(ci);
-
-	fscache_write_to_cache(cookie, inode->i_mapping, off, len, i_size_read(inode),
-			       ceph_fscache_write_terminated, inode, true, caching);
-}
-#else
-static inline void ceph_set_page_fscache(struct page *page)
-{
-}
-
-static inline void ceph_fscache_write_to_cache(struct inode *inode, u64 off, u64 len, bool caching)
-{
-}
-#endif /* CONFIG_CEPH_FSCACHE */
-
-struct ceph_writeback_ctl
-{
-	loff_t i_size;
-	u64 truncate_size;
-	u32 truncate_seq;
-	bool size_stable;
-
-	bool head_snapc;
-	struct ceph_snap_context *snapc;
-	struct ceph_snap_context *last_snapc;
-
-	bool done;
-	bool should_loop;
-	bool range_whole;
-	pgoff_t start_index;
-	pgoff_t index;
-	pgoff_t end;
-	xa_mark_t tag;
-
-	pgoff_t strip_unit_end;
-	unsigned int wsize;
-	unsigned int nr_folios;
-	unsigned int max_pages;
-	unsigned int locked_pages;
-
-	int op_idx;
-	int num_ops;
-	u64 offset;
-	u64 len;
-
-	struct folio_batch fbatch;
-	unsigned int processed_in_fbatch;
-
-	bool from_pool;
-	struct page **pages;
-	struct page **data_pages;
-};
-
-/*
- * Get ref for the oldest snapc for an inode with dirty data... that is, the
- * only snap context we are allowed to write back.
- */
-static struct ceph_snap_context *
-get_oldest_context(struct inode *inode, struct ceph_writeback_ctl *ctl,
-		   struct ceph_snap_context *page_snapc)
-{
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_client *cl = ceph_inode_to_client(inode);
-	struct ceph_snap_context *snapc = NULL;
-	struct ceph_cap_snap *capsnap = NULL;
-
-	spin_lock(&ci->i_ceph_lock);
-	list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) {
-		doutc(cl, " capsnap %p snapc %p has %d dirty pages\n",
-		      capsnap, capsnap->context, capsnap->dirty_pages);
-		if (!capsnap->dirty_pages)
-			continue;
-
-		/* get i_size, truncate_{seq,size} for page_snapc? */
-		if (snapc && capsnap->context != page_snapc)
-			continue;
-
-		if (ctl) {
-			if (capsnap->writing) {
-				ctl->i_size = i_size_read(inode);
-				ctl->size_stable = false;
-			} else {
-				ctl->i_size = capsnap->size;
-				ctl->size_stable = true;
-			}
-			ctl->truncate_size = capsnap->truncate_size;
-			ctl->truncate_seq = capsnap->truncate_seq;
-			ctl->head_snapc = false;
-		}
-
-		if (snapc)
-			break;
-
-		snapc = ceph_get_snap_context(capsnap->context);
-		if (!page_snapc ||
-		    page_snapc == snapc ||
-		    page_snapc->seq > snapc->seq)
-			break;
-	}
-	if (!snapc && ci->i_wrbuffer_ref_head) {
-		snapc = ceph_get_snap_context(ci->i_head_snapc);
-		doutc(cl, " head snapc %p has %d dirty pages\n", snapc,
-		      ci->i_wrbuffer_ref_head);
-		if (ctl) {
-			ctl->i_size = i_size_read(inode);
-			ctl->truncate_size = ci->i_truncate_size;
-			ctl->truncate_seq = ci->i_truncate_seq;
-			ctl->size_stable = false;
-			ctl->head_snapc = true;
-		}
-	}
-	spin_unlock(&ci->i_ceph_lock);
-	return snapc;
-}
-
-static u64 get_writepages_data_length(struct inode *inode,
-				      struct page *page, u64 start)
-{
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_snap_context *snapc;
-	struct ceph_cap_snap *capsnap = NULL;
-	u64 end = i_size_read(inode);
-	u64 ret;
-
-	snapc = page_snap_context(ceph_fscrypt_pagecache_page(page));
-	if (snapc != ci->i_head_snapc) {
-		bool found = false;
-		spin_lock(&ci->i_ceph_lock);
-		list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) {
-			if (capsnap->context == snapc) {
-				if (!capsnap->writing)
-					end = capsnap->size;
-				found = true;
-				break;
-			}
-		}
-		spin_unlock(&ci->i_ceph_lock);
-		WARN_ON(!found);
-	}
-	if (end > ceph_fscrypt_page_offset(page) + thp_size(page))
-		end = ceph_fscrypt_page_offset(page) + thp_size(page);
-	ret = end > start ? end - start : 0;
-	if (ret && fscrypt_is_bounce_page(page))
-		ret = round_up(ret, CEPH_FSCRYPT_BLOCK_SIZE);
-	return ret;
-}
-
-/*
- * Write a folio, but leave it locked.
- *
- * If we get a write error, mark the mapping for error, but still adjust the
- * dirty page accounting (i.e., folio is no longer dirty).
- */
-static int write_folio_nounlock(struct folio *folio,
-		struct writeback_control *wbc)
-{
-	struct page *page = &folio->page;
-	struct inode *inode = folio->mapping->host;
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-	struct ceph_snap_context *snapc, *oldest;
-	loff_t page_off = folio_pos(folio);
-	int err;
-	loff_t len = folio_size(folio);
-	loff_t wlen;
-	struct ceph_writeback_ctl ceph_wbc;
-	struct ceph_osd_client *osdc = &fsc->client->osdc;
-	struct ceph_osd_request *req;
-	bool caching = ceph_is_cache_enabled(inode);
-	struct page *bounce_page = NULL;
-
-	doutc(cl, "%llx.%llx folio %p idx %lu\n", ceph_vinop(inode), folio,
-	      folio->index);
-
-	if (ceph_inode_is_shutdown(inode))
-		return -EIO;
-
-	/* verify this is a writeable snap context */
-	snapc = page_snap_context(&folio->page);
-	if (!snapc) {
-		doutc(cl, "%llx.%llx folio %p not dirty?\n", ceph_vinop(inode),
-		      folio);
-		return 0;
-	}
-	oldest = get_oldest_context(inode, &ceph_wbc, snapc);
-	if (snapc->seq > oldest->seq) {
-		doutc(cl, "%llx.%llx folio %p snapc %p not writeable - noop\n",
-		      ceph_vinop(inode), folio, snapc);
-		/* we should only noop if called by kswapd */
-		WARN_ON(!(current->flags & PF_MEMALLOC));
-		ceph_put_snap_context(oldest);
-		folio_redirty_for_writepage(wbc, folio);
-		return 0;
-	}
-	ceph_put_snap_context(oldest);
-
-	/* is this a partial page at end of file? */
-	if (page_off >= ceph_wbc.i_size) {
-		doutc(cl, "%llx.%llx folio at %lu beyond eof %llu\n",
-		      ceph_vinop(inode), folio->index, ceph_wbc.i_size);
-		folio_invalidate(folio, 0, folio_size(folio));
-		return 0;
-	}
-
-	if (ceph_wbc.i_size < page_off + len)
-		len = ceph_wbc.i_size - page_off;
-
-	wlen = IS_ENCRYPTED(inode) ? round_up(len, CEPH_FSCRYPT_BLOCK_SIZE) : len;
-	doutc(cl, "%llx.%llx folio %p index %lu on %llu~%llu snapc %p seq %lld\n",
-	      ceph_vinop(inode), folio, folio->index, page_off, wlen, snapc,
-	      snapc->seq);
-
-	if (atomic_long_inc_return(&fsc->writeback_count) >
-	    CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb))
-		fsc->write_congested = true;
-
-	req = ceph_osdc_new_request(osdc, &ci->i_layout, ceph_vino(inode),
-				    page_off, &wlen, 0, 1, CEPH_OSD_OP_WRITE,
-				    CEPH_OSD_FLAG_WRITE, snapc,
-				    ceph_wbc.truncate_seq,
-				    ceph_wbc.truncate_size, true);
-	if (IS_ERR(req)) {
-		folio_redirty_for_writepage(wbc, folio);
-		return PTR_ERR(req);
-	}
-
-	if (wlen < len)
-		len = wlen;
-
-	folio_start_writeback(folio);
-	if (caching)
-		ceph_set_page_fscache(&folio->page);
-	ceph_fscache_write_to_cache(inode, page_off, len, caching);
-
-	if (IS_ENCRYPTED(inode)) {
-		bounce_page = fscrypt_encrypt_pagecache_blocks(&folio->page,
-						    CEPH_FSCRYPT_BLOCK_SIZE, 0,
-						    GFP_NOFS);
-		if (IS_ERR(bounce_page)) {
-			folio_redirty_for_writepage(wbc, folio);
-			folio_end_writeback(folio);
-			ceph_osdc_put_request(req);
-			return PTR_ERR(bounce_page);
-		}
-	}
-
-	/* it may be a short write due to an object boundary */
-	WARN_ON_ONCE(len > folio_size(folio));
-	osd_req_op_extent_osd_data_pages(req, 0,
-			bounce_page ? &bounce_page : &page, wlen, 0,
-			false, false);
-	doutc(cl, "%llx.%llx %llu~%llu (%llu bytes, %sencrypted)\n",
-	      ceph_vinop(inode), page_off, len, wlen,
-	      IS_ENCRYPTED(inode) ? "" : "not ");
-
-	req->r_mtime = inode_get_mtime(inode);
-	ceph_osdc_start_request(osdc, req);
-	err = ceph_osdc_wait_request(osdc, req);
-
-	ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
-				  req->r_end_latency, len, err);
-	fscrypt_free_bounce_page(bounce_page);
-	ceph_osdc_put_request(req);
-	if (err == 0)
-		err = len;
-
-	if (err < 0) {
-		struct writeback_control tmp_wbc;
-		if (!wbc)
-			wbc = &tmp_wbc;
-		if (err == -ERESTARTSYS) {
-			/* killed by SIGKILL */
-			doutc(cl, "%llx.%llx interrupted page %p\n",
-			      ceph_vinop(inode), folio);
-			folio_redirty_for_writepage(wbc, folio);
-			folio_end_writeback(folio);
-			return err;
-		}
-		if (err == -EBLOCKLISTED)
-			fsc->blocklisted = true;
-		doutc(cl, "%llx.%llx setting mapping error %d %p\n",
-		      ceph_vinop(inode), err, folio);
-		mapping_set_error(&inode->i_data, err);
-		wbc->pages_skipped++;
-	} else {
-		doutc(cl, "%llx.%llx cleaned page %p\n",
-		      ceph_vinop(inode), folio);
-		err = 0;  /* vfs expects us to return 0 */
-	}
-	oldest = folio_detach_private(folio);
-	WARN_ON_ONCE(oldest != snapc);
-	folio_end_writeback(folio);
-	ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
-	ceph_put_snap_context(snapc);  /* page's reference */
-
-	if (atomic_long_dec_return(&fsc->writeback_count) <
-	    CONGESTION_OFF_THRESH(fsc->mount_options->congestion_kb))
-		fsc->write_congested = false;
-
-	return err;
-}
-
-/*
- * async writeback completion handler.
- *
- * If we get an error, set the mapping error bit, but not the individual
- * page error bits.
- */
-static void writepages_finish(struct ceph_osd_request *req)
-{
-	struct inode *inode = req->r_inode;
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_client *cl = ceph_inode_to_client(inode);
-	struct ceph_osd_data *osd_data;
-	struct page *page;
-	int num_pages, total_pages = 0;
-	int i, j;
-	int rc = req->r_result;
-	struct ceph_snap_context *snapc = req->r_snapc;
-	struct address_space *mapping = inode->i_mapping;
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
-	unsigned int len = 0;
-	bool remove_page;
-
-	doutc(cl, "%llx.%llx rc %d\n", ceph_vinop(inode), rc);
-	if (rc < 0) {
-		mapping_set_error(mapping, rc);
-		ceph_set_error_write(ci);
-		if (rc == -EBLOCKLISTED)
-			fsc->blocklisted = true;
-	} else {
-		ceph_clear_error_write(ci);
-	}
-
-	/*
-	 * We lost the cache cap, need to truncate the page before
-	 * it is unlocked, otherwise we'd truncate it later in the
-	 * page truncation thread, possibly losing some data that
-	 * raced its way in
-	 */
-	remove_page = !(ceph_caps_issued(ci) &
-			(CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO));
-
-	/* clean all pages */
-	for (i = 0; i < req->r_num_ops; i++) {
-		if (req->r_ops[i].op != CEPH_OSD_OP_WRITE) {
-			pr_warn_client(cl,
-				"%llx.%llx incorrect op %d req %p index %d tid %llu\n",
-				ceph_vinop(inode), req->r_ops[i].op, req, i,
-				req->r_tid);
-			break;
-		}
-
-		osd_data = osd_req_op_extent_osd_data(req, i);
-		BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_PAGES);
-		len += osd_data->length;
-		num_pages = calc_pages_for((u64)osd_data->offset,
-					   (u64)osd_data->length);
-		total_pages += num_pages;
-		for (j = 0; j < num_pages; j++) {
-			page = osd_data->pages[j];
-			if (fscrypt_is_bounce_page(page)) {
-				page = fscrypt_pagecache_page(page);
-				fscrypt_free_bounce_page(osd_data->pages[j]);
-				osd_data->pages[j] = page;
-			}
-			BUG_ON(!page);
-			WARN_ON(!PageUptodate(page));
-
-			if (atomic_long_dec_return(&fsc->writeback_count) <
-			     CONGESTION_OFF_THRESH(
-					fsc->mount_options->congestion_kb))
-				fsc->write_congested = false;
-
-			ceph_put_snap_context(detach_page_private(page));
-			end_page_writeback(page);
-
-			if (atomic64_dec_return(&mdsc->dirty_folios) <= 0) {
-				wake_up_all(&mdsc->flush_end_wq);
-				WARN_ON(atomic64_read(&mdsc->dirty_folios) < 0);
-			}
-
-			doutc(cl, "unlocking %p\n", page);
-
-			if (remove_page)
-				generic_error_remove_folio(inode->i_mapping,
-							  page_folio(page));
-
-			unlock_page(page);
-		}
-		doutc(cl, "%llx.%llx wrote %llu bytes cleaned %d pages\n",
-		      ceph_vinop(inode), osd_data->length,
-		      rc >= 0 ? num_pages : 0);
-
-		release_pages(osd_data->pages, num_pages);
-	}
-
-	ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
-				  req->r_end_latency, len, rc);
-
-	ceph_put_wrbuffer_cap_refs(ci, total_pages, snapc);
-
-	osd_data = osd_req_op_extent_osd_data(req, 0);
-	if (osd_data->pages_from_pool)
-		mempool_free(osd_data->pages, ceph_wb_pagevec_pool);
-	else
-		kfree(osd_data->pages);
-	ceph_osdc_put_request(req);
-	ceph_dec_osd_stopping_blocker(fsc->mdsc);
-}
-
-static inline
-bool is_forced_umount(struct address_space *mapping)
-{
-	struct inode *inode = mapping->host;
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-
-	if (ceph_inode_is_shutdown(inode)) {
-		if (ci->i_wrbuffer_ref > 0) {
-			pr_warn_ratelimited_client(cl,
-				"%llx.%llx %lld forced umount\n",
-				ceph_vinop(inode), ceph_ino(inode));
-		}
-		mapping_set_error(mapping, -EIO);
-		return true;
-	}
-
-	return false;
-}
-
-static inline
-unsigned int ceph_define_write_size(struct address_space *mapping)
-{
-	struct inode *inode = mapping->host;
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	unsigned int wsize = i_blocksize(inode);
-
-	if (fsc->mount_options->wsize < wsize)
-		wsize = fsc->mount_options->wsize;
-
-	return wsize;
-}
-
-static inline
-void ceph_folio_batch_init(struct ceph_writeback_ctl *ceph_wbc)
-{
-	folio_batch_init(&ceph_wbc->fbatch);
-	ceph_wbc->processed_in_fbatch = 0;
-}
-
-static inline
-void ceph_folio_batch_reinit(struct ceph_writeback_ctl *ceph_wbc)
-{
-	folio_batch_release(&ceph_wbc->fbatch);
-	ceph_folio_batch_init(ceph_wbc);
-}
-
-static inline
-void ceph_init_writeback_ctl(struct address_space *mapping,
-			     struct writeback_control *wbc,
-			     struct ceph_writeback_ctl *ceph_wbc)
-{
-	ceph_wbc->snapc = NULL;
-	ceph_wbc->last_snapc = NULL;
-
-	ceph_wbc->strip_unit_end = 0;
-	ceph_wbc->wsize = ceph_define_write_size(mapping);
-
-	ceph_wbc->nr_folios = 0;
-	ceph_wbc->max_pages = 0;
-	ceph_wbc->locked_pages = 0;
-
-	ceph_wbc->done = false;
-	ceph_wbc->should_loop = false;
-	ceph_wbc->range_whole = false;
-
-	ceph_wbc->start_index = wbc->range_cyclic ? mapping->writeback_index : 0;
-	ceph_wbc->index = ceph_wbc->start_index;
-	ceph_wbc->end = -1;
-
-	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) {
-		ceph_wbc->tag = PAGECACHE_TAG_TOWRITE;
-	} else {
-		ceph_wbc->tag = PAGECACHE_TAG_DIRTY;
-	}
-
-	ceph_wbc->op_idx = -1;
-	ceph_wbc->num_ops = 0;
-	ceph_wbc->offset = 0;
-	ceph_wbc->len = 0;
-	ceph_wbc->from_pool = false;
-
-	ceph_folio_batch_init(ceph_wbc);
-
-	ceph_wbc->pages = NULL;
-	ceph_wbc->data_pages = NULL;
-}
-
-static inline
-int ceph_define_writeback_range(struct address_space *mapping,
-				struct writeback_control *wbc,
-				struct ceph_writeback_ctl *ceph_wbc)
-{
-	struct inode *inode = mapping->host;
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-
-	/* find oldest snap context with dirty data */
-	ceph_wbc->snapc = get_oldest_context(inode, ceph_wbc, NULL);
-	if (!ceph_wbc->snapc) {
-		/* hmm, why does writepages get called when there
-		   is no dirty data? */
-		doutc(cl, " no snap context with dirty data?\n");
-		return -ENODATA;
-	}
-
-	doutc(cl, " oldest snapc is %p seq %lld (%d snaps)\n",
-	      ceph_wbc->snapc, ceph_wbc->snapc->seq,
-	      ceph_wbc->snapc->num_snaps);
-
-	ceph_wbc->should_loop = false;
-
-	if (ceph_wbc->head_snapc && ceph_wbc->snapc != ceph_wbc->last_snapc) {
-		/* where to start/end? */
-		if (wbc->range_cyclic) {
-			ceph_wbc->index = ceph_wbc->start_index;
-			ceph_wbc->end = -1;
-			if (ceph_wbc->index > 0)
-				ceph_wbc->should_loop = true;
-			doutc(cl, " cyclic, start at %lu\n", ceph_wbc->index);
-		} else {
-			ceph_wbc->index = wbc->range_start >> PAGE_SHIFT;
-			ceph_wbc->end = wbc->range_end >> PAGE_SHIFT;
-			if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-				ceph_wbc->range_whole = true;
-			doutc(cl, " not cyclic, %lu to %lu\n",
-				ceph_wbc->index, ceph_wbc->end);
-		}
-	} else if (!ceph_wbc->head_snapc) {
-		/* Do not respect wbc->range_{start,end}. Dirty pages
-		 * in that range can be associated with newer snapc.
-		 * They are not writeable until we write all dirty pages
-		 * associated with 'snapc' get written */
-		if (ceph_wbc->index > 0)
-			ceph_wbc->should_loop = true;
-		doutc(cl, " non-head snapc, range whole\n");
-	}
-
-	ceph_put_snap_context(ceph_wbc->last_snapc);
-	ceph_wbc->last_snapc = ceph_wbc->snapc;
-
-	return 0;
-}
-
-static inline
-bool has_writeback_done(struct ceph_writeback_ctl *ceph_wbc)
-{
-	return ceph_wbc->done && ceph_wbc->index > ceph_wbc->end;
-}
-
-static inline
-bool can_next_page_be_processed(struct ceph_writeback_ctl *ceph_wbc,
-				unsigned index)
-{
-	return index < ceph_wbc->nr_folios &&
-		ceph_wbc->locked_pages < ceph_wbc->max_pages;
-}
-
-static
-int ceph_check_page_before_write(struct address_space *mapping,
-				 struct writeback_control *wbc,
-				 struct ceph_writeback_ctl *ceph_wbc,
-				 struct folio *folio)
-{
-	struct inode *inode = mapping->host;
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-	struct ceph_snap_context *pgsnapc;
-
-	/* only dirty folios, or our accounting breaks */
-	if (unlikely(!folio_test_dirty(folio) || folio->mapping != mapping)) {
-		doutc(cl, "!dirty or !mapping %p\n", folio);
-		return -ENODATA;
-	}
-
-	/* only if matching snap context */
-	pgsnapc = page_snap_context(&folio->page);
-	if (pgsnapc != ceph_wbc->snapc) {
-		doutc(cl, "folio snapc %p %lld != oldest %p %lld\n",
-		      pgsnapc, pgsnapc->seq,
-		      ceph_wbc->snapc, ceph_wbc->snapc->seq);
-
-		if (!ceph_wbc->should_loop && !ceph_wbc->head_snapc &&
-		    wbc->sync_mode != WB_SYNC_NONE)
-			ceph_wbc->should_loop = true;
-
-		return -ENODATA;
-	}
-
-	if (folio_pos(folio) >= ceph_wbc->i_size) {
-		doutc(cl, "folio at %lu beyond eof %llu\n",
-		      folio->index, ceph_wbc->i_size);
-
-		if ((ceph_wbc->size_stable ||
-		    folio_pos(folio) >= i_size_read(inode)) &&
-		    folio_clear_dirty_for_io(folio))
-			folio_invalidate(folio, 0, folio_size(folio));
-
-		return -ENODATA;
-	}
-
-	if (ceph_wbc->strip_unit_end &&
-	    (folio->index > ceph_wbc->strip_unit_end)) {
-		doutc(cl, "end of strip unit %p\n", folio);
-		return -E2BIG;
-	}
-
-	return 0;
-}
-
-static inline
-void __ceph_allocate_page_array(struct ceph_writeback_ctl *ceph_wbc,
-				unsigned int max_pages)
-{
-	ceph_wbc->pages = kmalloc_array(max_pages,
-					sizeof(*ceph_wbc->pages),
-					GFP_NOFS);
-	if (!ceph_wbc->pages) {
-		ceph_wbc->from_pool = true;
-		ceph_wbc->pages = mempool_alloc(ceph_wb_pagevec_pool, GFP_NOFS);
-		BUG_ON(!ceph_wbc->pages);
-	}
-}
-
-static inline
-void ceph_allocate_page_array(struct address_space *mapping,
-			      struct ceph_writeback_ctl *ceph_wbc,
-			      struct folio *folio)
-{
-	struct inode *inode = mapping->host;
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	size_t xlen;
-	u64 objnum;
-	u64 objoff;
-
-	/* prepare async write request */
-	ceph_wbc->offset = (u64)folio_pos(folio);
-	ceph_calc_file_object_mapping(&ci->i_layout,
-					ceph_wbc->offset, ceph_wbc->wsize,
-					&objnum, &objoff, &xlen);
-
-	ceph_wbc->num_ops = 1;
-	ceph_wbc->strip_unit_end = folio->index + ((xlen - 1) >> PAGE_SHIFT);
-
-	BUG_ON(ceph_wbc->pages);
-	ceph_wbc->max_pages = calc_pages_for(0, (u64)xlen);
-	__ceph_allocate_page_array(ceph_wbc, ceph_wbc->max_pages);
-
-	ceph_wbc->len = 0;
-}
-
-static inline
-bool is_folio_index_contiguous(const struct ceph_writeback_ctl *ceph_wbc,
-			      const struct folio *folio)
-{
-	return folio->index == (ceph_wbc->offset + ceph_wbc->len) >> PAGE_SHIFT;
-}
-
-static inline
-bool is_num_ops_too_big(struct ceph_writeback_ctl *ceph_wbc)
-{
-	return ceph_wbc->num_ops >=
-		(ceph_wbc->from_pool ?  CEPH_OSD_SLAB_OPS : CEPH_OSD_MAX_OPS);
-}
-#endif // TODO: Remove after netfs conversion
-
-static inline
-bool is_write_congestion_happened(struct ceph_fs_client *fsc)
-{
-	return atomic_long_inc_return(&fsc->writeback_count) >
-		CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb);
-}
-
-#if 0 // TODO: Remove after netfs conversion
-static inline int move_dirty_folio_in_page_array(struct address_space *mapping,
-		struct writeback_control *wbc,
-		struct ceph_writeback_ctl *ceph_wbc, struct folio *folio)
-{
-	struct inode *inode = mapping->host;
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-	struct page **pages = ceph_wbc->pages;
-	unsigned int index = ceph_wbc->locked_pages;
-	gfp_t gfp_flags = ceph_wbc->locked_pages ? GFP_NOWAIT : GFP_NOFS;
-
-	if (IS_ENCRYPTED(inode)) {
-		pages[index] = fscrypt_encrypt_pagecache_blocks(&folio->page,
-								PAGE_SIZE,
-								0,
-								gfp_flags);
-		if (IS_ERR(pages[index])) {
-			if (PTR_ERR(pages[index]) == -EINVAL) {
-				pr_err_client(cl, "inode->i_blkbits=%hhu\n",
-						inode->i_blkbits);
-			}
-
-			/* better not fail on first page! */
-			BUG_ON(ceph_wbc->locked_pages == 0);
-
-			pages[index] = NULL;
-			return PTR_ERR(pages[index]);
-		}
-	} else {
-		pages[index] = &folio->page;
-	}
-
-	ceph_wbc->locked_pages++;
-
-	return 0;
-}
-
-static
-int ceph_process_folio_batch(struct address_space *mapping,
-			     struct writeback_control *wbc,
-			     struct ceph_writeback_ctl *ceph_wbc)
-{
-	struct inode *inode = mapping->host;
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-	struct folio *folio = NULL;
-	unsigned i;
-	int rc = 0;
-
-	for (i = 0; can_next_page_be_processed(ceph_wbc, i); i++) {
-		folio = ceph_wbc->fbatch.folios[i];
-
-		if (!folio)
-			continue;
-
-		doutc(cl, "? %p idx %lu, folio_test_writeback %#x, "
-			"folio_test_dirty %#x, folio_test_locked %#x\n",
-			folio, folio->index, folio_test_writeback(folio),
-			folio_test_dirty(folio),
-			folio_test_locked(folio));
-
-		if (folio_test_writeback(folio) ||
-		    folio_test_private_2(folio) /* [DEPRECATED] */) {
-			doutc(cl, "waiting on writeback %p\n", folio);
-			folio_wait_writeback(folio);
-			folio_wait_private_2(folio); /* [DEPRECATED] */
-			continue;
-		}
-
-		if (ceph_wbc->locked_pages == 0)
-			folio_lock(folio);
-		else if (!folio_trylock(folio))
-			break;
-
-		rc = ceph_check_page_before_write(mapping, wbc,
-						  ceph_wbc, folio);
-		if (rc == -ENODATA) {
-			rc = 0;
-			folio_unlock(folio);
-			ceph_wbc->fbatch.folios[i] = NULL;
-			continue;
-		} else if (rc == -E2BIG) {
-			rc = 0;
-			folio_unlock(folio);
-			ceph_wbc->fbatch.folios[i] = NULL;
-			break;
-		}
-
-		if (!folio_clear_dirty_for_io(folio)) {
-			doutc(cl, "%p !folio_clear_dirty_for_io\n", folio);
-			folio_unlock(folio);
-			ceph_wbc->fbatch.folios[i] = NULL;
-			continue;
-		}
-
-		/*
-		 * We have something to write.  If this is
-		 * the first locked page this time through,
-		 * calculate max possible write size and
-		 * allocate a page array
-		 */
-		if (ceph_wbc->locked_pages == 0) {
-			ceph_allocate_page_array(mapping, ceph_wbc, folio);
-		} else if (!is_folio_index_contiguous(ceph_wbc, folio)) {
-			if (is_num_ops_too_big(ceph_wbc)) {
-				folio_redirty_for_writepage(wbc, folio);
-				folio_unlock(folio);
-				break;
-			}
-
-			ceph_wbc->num_ops++;
-			ceph_wbc->offset = (u64)folio_pos(folio);
-			ceph_wbc->len = 0;
-		}
-
-		/* note position of first page in fbatch */
-		doutc(cl, "%llx.%llx will write folio %p idx %lu\n",
-		      ceph_vinop(inode), folio, folio->index);
-
-		fsc->write_congested = is_write_congestion_happened(fsc);
-
-		rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc,
-				folio);
-		if (rc) {
-			folio_redirty_for_writepage(wbc, folio);
-			folio_unlock(folio);
-			break;
-		}
-
-		ceph_wbc->fbatch.folios[i] = NULL;
-		ceph_wbc->len += folio_size(folio);
-	}
-
-	ceph_wbc->processed_in_fbatch = i;
-
-	return rc;
-}
-
-static inline
-void ceph_shift_unused_folios_left(struct folio_batch *fbatch)
-{
-	unsigned j, n = 0;
-
-	/* shift unused page to beginning of fbatch */
-	for (j = 0; j < folio_batch_count(fbatch); j++) {
-		if (!fbatch->folios[j])
-			continue;
-
-		if (n < j) {
-			fbatch->folios[n] = fbatch->folios[j];
-		}
-
-		n++;
-	}
-
-	fbatch->nr = n;
-}
-
-static
-int ceph_submit_write(struct address_space *mapping,
-			struct writeback_control *wbc,
-			struct ceph_writeback_ctl *ceph_wbc)
-{
-	struct inode *inode = mapping->host;
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-	struct ceph_vino vino = ceph_vino(inode);
-	struct ceph_osd_request *req = NULL;
-	struct page *page = NULL;
-	bool caching = ceph_is_cache_enabled(inode);
-	u64 offset;
-	u64 len;
-	unsigned i;
-
-new_request:
-	offset = ceph_fscrypt_page_offset(ceph_wbc->pages[0]);
-	len = ceph_wbc->wsize;
-
-	req = ceph_osdc_new_request(&fsc->client->osdc,
-				    &ci->i_layout, vino,
-				    offset, &len, 0, ceph_wbc->num_ops,
-				    CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE,
-				    ceph_wbc->snapc, ceph_wbc->truncate_seq,
-				    ceph_wbc->truncate_size, false);
-	if (IS_ERR(req)) {
-		req = ceph_osdc_new_request(&fsc->client->osdc,
-					    &ci->i_layout, vino,
-					    offset, &len, 0,
-					    min(ceph_wbc->num_ops,
-						CEPH_OSD_SLAB_OPS),
-					    CEPH_OSD_OP_WRITE,
-					    CEPH_OSD_FLAG_WRITE,
-					    ceph_wbc->snapc,
-					    ceph_wbc->truncate_seq,
-					    ceph_wbc->truncate_size,
-					    true);
-		BUG_ON(IS_ERR(req));
-	}
-
-	page = ceph_wbc->pages[ceph_wbc->locked_pages - 1];
-	BUG_ON(len < ceph_fscrypt_page_offset(page) + thp_size(page) - offset);
-
-	if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
-		for (i = 0; i < folio_batch_count(&ceph_wbc->fbatch); i++) {
-			struct folio *folio = ceph_wbc->fbatch.folios[i];
-
-			if (!folio)
-				continue;
-
-			page = &folio->page;
-			redirty_page_for_writepage(wbc, page);
-			unlock_page(page);
-		}
-
-		for (i = 0; i < ceph_wbc->locked_pages; i++) {
-			page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]);
-
-			if (!page)
-				continue;
-
-			redirty_page_for_writepage(wbc, page);
-			unlock_page(page);
-		}
-
-		ceph_osdc_put_request(req);
-		return -EIO;
-	}
-
-	req->r_callback = writepages_finish;
-	req->r_inode = inode;
-
-	/* Format the osd request message and submit the write */
-	len = 0;
-	ceph_wbc->data_pages = ceph_wbc->pages;
-	ceph_wbc->op_idx = 0;
-	for (i = 0; i < ceph_wbc->locked_pages; i++) {
-		u64 cur_offset;
-
-		page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]);
-		cur_offset = page_offset(page);
-
-		/*
-		 * Discontinuity in page range? Ceph can handle that by just passing
-		 * multiple extents in the write op.
-		 */
-		if (offset + len != cur_offset) {
-			/* If it's full, stop here */
-			if (ceph_wbc->op_idx + 1 == req->r_num_ops)
-				break;
-
-			/* Kick off an fscache write with what we have so far. */
-			ceph_fscache_write_to_cache(inode, offset, len, caching);
-
-			/* Start a new extent */
-			osd_req_op_extent_dup_last(req, ceph_wbc->op_idx,
-						   cur_offset - offset);
-
-			doutc(cl, "got pages at %llu~%llu\n", offset, len);
-
-			osd_req_op_extent_osd_data_pages(req, ceph_wbc->op_idx,
-							 ceph_wbc->data_pages,
-							 len, 0,
-							 ceph_wbc->from_pool,
-							 false);
-			osd_req_op_extent_update(req, ceph_wbc->op_idx, len);
-
-			len = 0;
-			offset = cur_offset;
-			ceph_wbc->data_pages = ceph_wbc->pages + i;
-			ceph_wbc->op_idx++;
-		}
-
-		set_page_writeback(page);
-
-		if (caching)
-			ceph_set_page_fscache(page);
-
-		len += thp_size(page);
-	}
-
-	ceph_fscache_write_to_cache(inode, offset, len, caching);
-
-	if (ceph_wbc->size_stable) {
-		len = min(len, ceph_wbc->i_size - offset);
-	} else if (i == ceph_wbc->locked_pages) {
-		/* writepages_finish() clears writeback pages
-		 * according to the data length, so make sure
-		 * data length covers all locked pages */
-		u64 min_len = len + 1 - thp_size(page);
-		len = get_writepages_data_length(inode,
-						 ceph_wbc->pages[i - 1],
-						 offset);
-		len = max(len, min_len);
-	}
-
-	if (IS_ENCRYPTED(inode))
-		len = round_up(len, CEPH_FSCRYPT_BLOCK_SIZE);
-
-	doutc(cl, "got pages at %llu~%llu\n", offset, len);
-
-	if (IS_ENCRYPTED(inode) &&
-	    ((offset | len) & ~CEPH_FSCRYPT_BLOCK_MASK)) {
-		pr_warn_client(cl,
-			"bad encrypted write offset=%lld len=%llu\n",
-			offset, len);
-	}
-
-	osd_req_op_extent_osd_data_pages(req, ceph_wbc->op_idx,
-					 ceph_wbc->data_pages, len,
-					 0, ceph_wbc->from_pool, false);
-	osd_req_op_extent_update(req, ceph_wbc->op_idx, len);
-
-	BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops);
-
-	ceph_wbc->from_pool = false;
-	if (i < ceph_wbc->locked_pages) {
-		BUG_ON(ceph_wbc->num_ops <= req->r_num_ops);
-		ceph_wbc->num_ops -= req->r_num_ops;
-		ceph_wbc->locked_pages -= i;
-
-		/* allocate new pages array for next request */
-		ceph_wbc->data_pages = ceph_wbc->pages;
-		__ceph_allocate_page_array(ceph_wbc, ceph_wbc->locked_pages);
-		memcpy(ceph_wbc->pages, ceph_wbc->data_pages + i,
-			ceph_wbc->locked_pages * sizeof(*ceph_wbc->pages));
-		memset(ceph_wbc->data_pages + i, 0,
-			ceph_wbc->locked_pages * sizeof(*ceph_wbc->pages));
-	} else {
-		BUG_ON(ceph_wbc->num_ops != req->r_num_ops);
-		/* request message now owns the pages array */
-		ceph_wbc->pages = NULL;
-	}
-
-	req->r_mtime = inode_get_mtime(inode);
-	ceph_osdc_start_request(&fsc->client->osdc, req);
-	req = NULL;
-
-	wbc->nr_to_write -= i;
-	if (ceph_wbc->pages)
-		goto new_request;
-
-	return 0;
-}
-
-static
-void ceph_wait_until_current_writes_complete(struct address_space *mapping,
-					     struct writeback_control *wbc,
-					     struct ceph_writeback_ctl *ceph_wbc)
-{
-	struct page *page;
-	unsigned i, nr;
-
-	if (wbc->sync_mode != WB_SYNC_NONE &&
-	    ceph_wbc->start_index == 0 && /* all dirty pages were checked */
-	    !ceph_wbc->head_snapc) {
-		ceph_wbc->index = 0;
-
-		while ((ceph_wbc->index <= ceph_wbc->end) &&
-			(nr = filemap_get_folios_tag(mapping,
-						     &ceph_wbc->index,
-						     (pgoff_t)-1,
-						     PAGECACHE_TAG_WRITEBACK,
-						     &ceph_wbc->fbatch))) {
-			for (i = 0; i < nr; i++) {
-				page = &ceph_wbc->fbatch.folios[i]->page;
-				if (page_snap_context(page) != ceph_wbc->snapc)
-					continue;
-				wait_on_page_writeback(page);
-			}
-
-			folio_batch_release(&ceph_wbc->fbatch);
-			cond_resched();
-		}
-	}
-}
-
 /*
- * initiate async writeback
+ * Dirty a page.  Optimistically adjust accounting, on the assumption
+ * that we won't race with invalidate.  If we do, readjust.
  */
-static int ceph_writepages_start(struct address_space *mapping,
-				 struct writeback_control *wbc)
+bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio)
 {
 	struct inode *inode = mapping->host;
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-	struct ceph_writeback_ctl ceph_wbc;
-	int rc = 0;
-
-	if (wbc->sync_mode == WB_SYNC_NONE && fsc->write_congested)
-		return 0;
-
-	doutc(cl, "%llx.%llx (mode=%s)\n", ceph_vinop(inode),
-	      wbc->sync_mode == WB_SYNC_NONE ? "NONE" :
-	      (wbc->sync_mode == WB_SYNC_ALL ? "ALL" : "HOLD"));
-
-	if (is_forced_umount(mapping)) {
-		/* we're in a forced umount, don't write! */
-		return -EIO;
-	}
-
-	ceph_init_writeback_ctl(mapping, wbc, &ceph_wbc);
-
-	if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
-		rc = -EIO;
-		goto out;
-	}
-
-retry:
-	rc = ceph_define_writeback_range(mapping, wbc, &ceph_wbc);
-	if (rc == -ENODATA) {
-		/* hmm, why does writepages get called when there
-		   is no dirty data? */
-		rc = 0;
-		goto dec_osd_stopping_blocker;
-	}
-
-	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
-		tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
-
-	while (!has_writeback_done(&ceph_wbc)) {
-		ceph_wbc.locked_pages = 0;
-		ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
-
-get_more_pages:
-		ceph_folio_batch_reinit(&ceph_wbc);
-
-		ceph_wbc.nr_folios = filemap_get_folios_tag(mapping,
-							    &ceph_wbc.index,
-							    ceph_wbc.end,
-							    ceph_wbc.tag,
-							    &ceph_wbc.fbatch);
-		doutc(cl, "pagevec_lookup_range_tag for tag %#x got %d\n",
-			ceph_wbc.tag, ceph_wbc.nr_folios);
-
-		if (!ceph_wbc.nr_folios && !ceph_wbc.locked_pages)
-			break;
-
-process_folio_batch:
-		rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
-		if (rc)
-			goto release_folios;
-
-		/* did we get anything? */
-		if (!ceph_wbc.locked_pages)
-			goto release_folios;
-
-		if (ceph_wbc.processed_in_fbatch) {
-			ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
-
-			if (folio_batch_count(&ceph_wbc.fbatch) == 0 &&
-			    ceph_wbc.locked_pages < ceph_wbc.max_pages) {
-				doutc(cl, "reached end fbatch, trying for more\n");
-				goto get_more_pages;
-			}
-		}
-
-		rc = ceph_submit_write(mapping, wbc, &ceph_wbc);
-		if (rc)
-			goto release_folios;
-
-		ceph_wbc.locked_pages = 0;
-		ceph_wbc.strip_unit_end = 0;
-
-		if (folio_batch_count(&ceph_wbc.fbatch) > 0) {
-			ceph_wbc.nr_folios =
-				folio_batch_count(&ceph_wbc.fbatch);
-			goto process_folio_batch;
-		}
-
-		/*
-		 * We stop writing back only if we are not doing
-		 * integrity sync. In case of integrity sync we have to
-		 * keep going until we have written all the pages
-		 * we tagged for writeback prior to entering this loop.
-		 */
-		if (wbc->nr_to_write <= 0 && wbc->sync_mode == WB_SYNC_NONE)
-			ceph_wbc.done = true;
-
-release_folios:
-		doutc(cl, "folio_batch release on %d folios (%p)\n",
-		      (int)ceph_wbc.fbatch.nr,
-		      ceph_wbc.fbatch.nr ? ceph_wbc.fbatch.folios[0] : NULL);
-		folio_batch_release(&ceph_wbc.fbatch);
-	}
-
-	if (ceph_wbc.should_loop && !ceph_wbc.done) {
-		/* more to do; loop back to beginning of file */
-		doutc(cl, "looping back to beginning of file\n");
-		/* OK even when start_index == 0 */
-		ceph_wbc.end = ceph_wbc.start_index - 1;
-
-		/* to write dirty pages associated with next snapc,
-		 * we need to wait until current writes complete */
-		ceph_wait_until_current_writes_complete(mapping, wbc, &ceph_wbc);
-
-		ceph_wbc.start_index = 0;
-		ceph_wbc.index = 0;
-		goto retry;
-	}
-
-	if (wbc->range_cyclic || (ceph_wbc.range_whole && wbc->nr_to_write > 0))
-		mapping->writeback_index = ceph_wbc.index;
-
-dec_osd_stopping_blocker:
-	ceph_dec_osd_stopping_blocker(fsc->mdsc);
-
-out:
-	ceph_put_snap_context(ceph_wbc.last_snapc);
-	doutc(cl, "%llx.%llx dend - startone, rc = %d\n", ceph_vinop(inode),
-	      rc);
-
-	return rc;
-}
-
-/*
- * See if a given @snapc is either writeable, or already written.
- */
-static int context_is_writeable_or_written(struct inode *inode,
-					   struct ceph_snap_context *snapc)
-{
-	struct ceph_snap_context *oldest = get_oldest_context(inode, NULL, NULL);
-	int ret = !oldest || snapc->seq <= oldest->seq;
-
-	ceph_put_snap_context(oldest);
-	return ret;
-}
-
-/**
- * ceph_find_incompatible - find an incompatible context and return it
- * @folio: folio being dirtied
- *
- * We are only allowed to write into/dirty a folio if the folio is
- * clean, or already dirty within the same snap context. Returns a
- * conflicting context if there is one, NULL if there isn't, or a
- * negative error code on other errors.
- *
- * Must be called with folio lock held.
- */
-static struct ceph_snap_context *
-ceph_find_incompatible(struct folio *folio)
-{
-	struct inode *inode = folio->mapping->host;
 	struct ceph_client *cl = ceph_inode_to_client(inode);
-	struct ceph_inode_info *ci = ceph_inode(inode);
-
-	if (ceph_inode_is_shutdown(inode)) {
-		doutc(cl, " %llx.%llx folio %p is shutdown\n",
-		      ceph_vinop(inode), folio);
-		return ERR_PTR(-ESTALE);
-	}
-
-	for (;;) {
-		struct ceph_snap_context *snapc, *oldest;
-
-		folio_wait_writeback(folio);
-
-		snapc = page_snap_context(&folio->page);
-		if (!snapc || snapc == ci->i_head_snapc)
-			break;
-
-		/*
-		 * this folio is already dirty in another (older) snap
-		 * context!  is it writeable now?
-		 */
-		oldest = get_oldest_context(inode, NULL, NULL);
-		if (snapc->seq > oldest->seq) {
-			/* not writeable -- return it for the caller to deal with */
-			ceph_put_snap_context(oldest);
-			doutc(cl, " %llx.%llx folio %p snapc %p not current or oldest\n",
-			      ceph_vinop(inode), folio, snapc);
-			return ceph_get_snap_context(snapc);
-		}
-		ceph_put_snap_context(oldest);
-
-		/* yay, writeable, do it now (without dropping folio lock) */
-		doutc(cl, " %llx.%llx folio %p snapc %p not current, but oldest\n",
-		      ceph_vinop(inode), folio, snapc);
-		if (folio_clear_dirty_for_io(folio)) {
-			int r = write_folio_nounlock(folio, NULL);
-			if (r < 0)
-				return ERR_PTR(r);
-		}
-	}
-	return NULL;
-}
-
-static int ceph_netfs_check_write_begin(struct file *file, loff_t pos, unsigned int len,
-					struct folio **foliop, void **_fsdata)
-{
-	struct inode *inode = file_inode(file);
-	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
+	struct ceph_inode_info *ci;
 	struct ceph_snap_context *snapc;
+	struct netfs_group *group;
 
-	snapc = ceph_find_incompatible(*foliop);
-	if (snapc) {
-		int r;
-
-		folio_unlock(*foliop);
-		folio_put(*foliop);
-		*foliop = NULL;
-		if (IS_ERR(snapc))
-			return PTR_ERR(snapc);
-
-		ceph_queue_writeback(inode);
-		r = wait_event_killable(ci->i_cap_wq,
-					context_is_writeable_or_written(inode, snapc));
-		ceph_put_snap_context(snapc);
-		return r == 0 ? -EAGAIN : r;
+	if (folio_test_dirty(folio)) {
+		doutc(cl, "%llx.%llx %p idx %lu -- already dirty\n",
+		      ceph_vinop(inode), folio, folio->index);
+		VM_BUG_ON_FOLIO(!folio_test_private(folio), folio);
+		return false;
 	}
-	return 0;
-}
-
-/*
- * We are only allowed to write into/dirty the page if the page is
- * clean, or already dirty within the same snap context.
- */
-static int ceph_write_begin(struct file *file, struct address_space *mapping,
-			    loff_t pos, unsigned len,
-			    struct folio **foliop, void **fsdata)
-{
-	struct inode *inode = file_inode(file);
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	int r;
-
-	r = netfs_write_begin(&ci->netfs, file, inode->i_mapping, pos, len, foliop, NULL);
-	if (r < 0)
-		return r;
 
-	folio_wait_private_2(*foliop); /* [DEPRECATED] */
-	WARN_ON_ONCE(!folio_test_locked(*foliop));
-	return 0;
-}
+	atomic64_inc(&mdsc->dirty_folios);
 
-/*
- * we don't do anything in here that simple_write_end doesn't do
- * except adjust dirty page accounting
- */
-static int ceph_write_end(struct file *file, struct address_space *mapping,
-			  loff_t pos, unsigned len, unsigned copied,
-			  struct folio *folio, void *fsdata)
-{
-	struct inode *inode = file_inode(file);
-	struct ceph_client *cl = ceph_inode_to_client(inode);
-	bool check_cap = false;
+	ci = ceph_inode(inode);
 
-	doutc(cl, "%llx.%llx file %p folio %p %d~%d (%d)\n", ceph_vinop(inode),
-	      file, folio, (int)pos, (int)copied, (int)len);
+	/* dirty the head */
+	spin_lock(&ci->i_ceph_lock);
+	if (__ceph_have_pending_cap_snap(ci)) {
+		struct ceph_cap_snap *capsnap =
+			list_last_entry(&ci->i_cap_snaps,
+					struct ceph_cap_snap,
+					ci_item);
+		snapc = capsnap->context;
+		capsnap->dirty_pages++;
+	} else {
+		snapc = ci->i_head_snapc;
+		BUG_ON(!snapc);
+		++ci->i_wrbuffer_ref_head;
+	}
 
-	if (!folio_test_uptodate(folio)) {
-		/* just return that nothing was copied on a short copy */
-		if (copied < len) {
-			copied = 0;
-			goto out;
+	/* Attach a reference to the snap/group to the folio. */
+	group = netfs_folio_group(folio);
+	if (group != &snapc->group) {
+		netfs_set_group(folio, &snapc->group);
+		if (group) {
+			doutc(cl, "Different group %px != %px\n",
+			      group, &snapc->group);
+			netfs_put_group(group);
 		}
-		folio_mark_uptodate(folio);
 	}
 
-	/* did file size increase? */
-	if (pos+copied > i_size_read(inode))
-		check_cap = ceph_inode_set_size(inode, pos+copied);
-
-	folio_mark_dirty(folio);
-
-out:
-	folio_unlock(folio);
-	folio_put(folio);
-
-	if (check_cap)
-		ceph_check_caps(ceph_inode(inode), CHECK_CAPS_AUTHONLY);
+	if (ci->i_wrbuffer_ref == 0)
+		ihold(inode);
+	++ci->i_wrbuffer_ref;
+	doutc(cl, "%llx.%llx %p idx %lu head %d/%d -> %d/%d "
+	      "snapc %p seq %lld (%d snaps)\n",
+	      ceph_vinop(inode), folio, folio->index,
+	      ci->i_wrbuffer_ref-1, ci->i_wrbuffer_ref_head-1,
+	      ci->i_wrbuffer_ref, ci->i_wrbuffer_ref_head,
+	      snapc, snapc->seq, snapc->num_snaps);
+	spin_unlock(&ci->i_ceph_lock);
 
-	return copied;
+	return netfs_dirty_folio(mapping, folio);
 }
 
-const struct address_space_operations ceph_aops = {
-	.read_folio = netfs_read_folio,
-	.readahead = netfs_readahead,
-	.writepages = ceph_writepages_start,
-	.write_begin = ceph_write_begin,
-	.write_end = ceph_write_end,
-	.dirty_folio = ceph_dirty_folio,
-	.invalidate_folio = ceph_invalidate_folio,
-	.release_folio = netfs_release_folio,
-	.direct_IO = noop_direct_IO,
-	.migrate_folio = filemap_migrate_folio,
-};
-#endif // TODO: Remove after netfs conversion
-
 static void ceph_block_sigs(sigset_t *oldset)
 {
 	sigset_t mask;
@@ -2046,112 +226,6 @@ static vm_fault_t ceph_filemap_fault(struct vm_fault *vmf)
 	return ret;
 }
 
-#if 0 // TODO: Remove after netfs conversion
-static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
-{
-	struct vm_area_struct *vma = vmf->vma;
-	struct inode *inode = file_inode(vma->vm_file);
-	struct ceph_client *cl = ceph_inode_to_client(inode);
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_file_info *fi = vma->vm_file->private_data;
-	struct ceph_cap_flush *prealloc_cf;
-	struct folio *folio = page_folio(vmf->page);
-	loff_t off = folio_pos(folio);
-	loff_t size = i_size_read(inode);
-	size_t len;
-	int want, got, err;
-	sigset_t oldset;
-	vm_fault_t ret = VM_FAULT_SIGBUS;
-
-	if (ceph_inode_is_shutdown(inode))
-		return ret;
-
-	prealloc_cf = ceph_alloc_cap_flush();
-	if (!prealloc_cf)
-		return VM_FAULT_OOM;
-
-	sb_start_pagefault(inode->i_sb);
-	ceph_block_sigs(&oldset);
-
-	if (off + folio_size(folio) <= size)
-		len = folio_size(folio);
-	else
-		len = offset_in_folio(folio, size);
-
-	doutc(cl, "%llx.%llx %llu~%zd getting caps i_size %llu\n",
-	      ceph_vinop(inode), off, len, size);
-	if (fi->fmode & CEPH_FILE_MODE_LAZY)
-		want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
-	else
-		want = CEPH_CAP_FILE_BUFFER;
-
-	got = 0;
-	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, off + len, &got);
-	if (err < 0)
-		goto out_free;
-
-	doutc(cl, "%llx.%llx %llu~%zd got cap refs on %s\n", ceph_vinop(inode),
-	      off, len, ceph_cap_string(got));
-
-	/* Update time before taking folio lock */
-	file_update_time(vma->vm_file);
-	inode_inc_iversion_raw(inode);
-
-	do {
-		struct ceph_snap_context *snapc;
-
-		folio_lock(folio);
-
-		if (folio_mkwrite_check_truncate(folio, inode) < 0) {
-			folio_unlock(folio);
-			ret = VM_FAULT_NOPAGE;
-			break;
-		}
-
-		snapc = ceph_find_incompatible(folio);
-		if (!snapc) {
-			/* success.  we'll keep the folio locked. */
-			folio_mark_dirty(folio);
-			ret = VM_FAULT_LOCKED;
-			break;
-		}
-
-		folio_unlock(folio);
-
-		if (IS_ERR(snapc)) {
-			ret = VM_FAULT_SIGBUS;
-			break;
-		}
-
-		ceph_queue_writeback(inode);
-		err = wait_event_killable(ci->i_cap_wq,
-				context_is_writeable_or_written(inode, snapc));
-		ceph_put_snap_context(snapc);
-	} while (err == 0);
-
-	if (ret == VM_FAULT_LOCKED) {
-		int dirty;
-		spin_lock(&ci->i_ceph_lock);
-		dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR,
-					       &prealloc_cf);
-		spin_unlock(&ci->i_ceph_lock);
-		if (dirty)
-			__mark_inode_dirty(inode, dirty);
-	}
-
-	doutc(cl, "%llx.%llx %llu~%zd dropping cap refs on %s ret %x\n",
-	      ceph_vinop(inode), off, len, ceph_cap_string(got), ret);
-	ceph_put_cap_refs_async(ci, got);
-out_free:
-	ceph_restore_sigs(&oldset);
-	sb_end_pagefault(inode->i_sb);
-	ceph_free_cap_flush(prealloc_cf);
-	if (err < 0)
-		ret = vmf_error(err);
-	return ret;
-}
-#endif // TODO: Remove after netfs conversion
-
 void ceph_fill_inline_data(struct inode *inode, struct page *locked_page,
 			   char	*data, size_t len)
 {
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 94b91b5bc843..d7684f4b2e10 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -77,97 +77,6 @@ static __le32 ceph_flags_sys2wire(struct ceph_mds_client *mdsc, u32 flags)
  * need to wait for MDS acknowledgement.
  */
 
-#if 0 // TODO: Remove after netfs conversion
-/*
- * How many pages to get in one call to iov_iter_get_pages().  This
- * determines the size of the on-stack array used as a buffer.
- */
-#define ITER_GET_BVECS_PAGES	64
-
-static int __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
-			    struct ceph_databuf *dbuf)
-{
-	size_t size = 0;
-
-	if (maxsize > iov_iter_count(iter))
-		maxsize = iov_iter_count(iter);
-
-	while (size < maxsize) {
-		struct page *pages[ITER_GET_BVECS_PAGES];
-		ssize_t bytes;
-		size_t start;
-		int idx = 0;
-
-		bytes = iov_iter_get_pages2(iter, pages, maxsize - size,
-					    ITER_GET_BVECS_PAGES, &start);
-		if (bytes < 0) {
-			if (size == 0)
-				return bytes;
-			break;
-		}
-
-		while (bytes) {
-			int len = min_t(int, bytes, PAGE_SIZE - start);
-
-			ceph_databuf_append_page(dbuf, pages[idx++], start, len);
-			bytes -= len;
-			size += len;
-			start = 0;
-		}
-	}
-
-	return 0;
-}
-
-/*
- * iov_iter_get_pages() only considers one iov_iter segment, no matter
- * what maxsize or maxpages are given.  For ITER_BVEC that is a single
- * page.
- *
- * Attempt to get up to @maxsize bytes worth of pages from @iter.
- * Return the number of bytes in the created bio_vec array, or an error.
- */
-static struct ceph_databuf *iter_get_bvecs_alloc(struct iov_iter *iter,
-						 size_t maxsize, bool write)
-{
-	struct ceph_databuf *dbuf;
-	size_t orig_count = iov_iter_count(iter);
-	int npages, ret;
-
-	iov_iter_truncate(iter, maxsize);
-	npages = iov_iter_npages(iter, INT_MAX);
-	iov_iter_reexpand(iter, orig_count);
-
-	if (write)
-		dbuf = ceph_databuf_req_alloc(npages, 0, GFP_KERNEL);
-	else
-		dbuf = ceph_databuf_reply_alloc(npages, 0, GFP_KERNEL);
-	if (!dbuf)
-		return ERR_PTR(-ENOMEM);
-
-	ret = __iter_get_bvecs(iter, maxsize, dbuf);
-	if (ret < 0) {
-		/*
-		 * No pages were pinned -- just free the array.
-		 */
-		ceph_databuf_release(dbuf);
-		return ERR_PTR(ret);
-	}
-
-	return dbuf;
-}
-
-static void ceph_dirty_pages(struct ceph_databuf *dbuf)
-{
-	struct bio_vec *bvec = dbuf->bvec;
-	int i;
-
-	for (i = 0; i < dbuf->nr_bvec; i++)
-		if (bvec[i].bv_page)
-			set_page_dirty_lock(bvec[i].bv_page);
-}
-#endif // TODO: Remove after netfs conversion
-
 /*
  * Prepare an open request.  Preallocate ceph_cap to avoid an
  * inopportune ENOMEM later.
@@ -1023,1222 +932,6 @@ int ceph_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
-#if 0 // TODO: Remove after netfs conversion
-enum {
-	HAVE_RETRIED = 1,
-	CHECK_EOF =    2,
-	READ_INLINE =  3,
-};
-
-/*
- * Completely synchronous read and write methods.  Direct from __user
- * buffer to osd, or directly to user pages (if O_DIRECT).
- *
- * If the read spans object boundary, just do multiple reads.  (That's not
- * atomic, but good enough for now.)
- *
- * If we get a short result from the OSD, check against i_size; we need to
- * only return a short read to the caller if we hit EOF.
- */
-ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
-			 struct iov_iter *to, int *retry_op,
-			 u64 *last_objver)
-{
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-	struct ceph_osd_client *osdc = &fsc->client->osdc;
-	ssize_t ret;
-	u64 off = *ki_pos;
-	u64 len = iov_iter_count(to);
-	u64 i_size = i_size_read(inode);
-	bool sparse = IS_ENCRYPTED(inode) || ceph_test_mount_opt(fsc, SPARSEREAD);
-	u64 objver = 0;
-
-	doutc(cl, "on inode %p %llx.%llx %llx~%llx\n", inode,
-	      ceph_vinop(inode), *ki_pos, len);
-
-	if (ceph_inode_is_shutdown(inode))
-		return -EIO;
-
-	if (!len || !i_size)
-		return 0;
-	/*
-	 * flush any page cache pages in this range.  this
-	 * will make concurrent normal and sync io slow,
-	 * but it will at least behave sensibly when they are
-	 * in sequence.
-	 */
-	ret = filemap_write_and_wait_range(inode->i_mapping,
-					   off, off + len - 1);
-	if (ret < 0)
-		return ret;
-
-	ret = 0;
-	while ((len = iov_iter_count(to)) > 0) {
-		struct ceph_osd_request *req;
-		struct page **pages;
-		int num_pages;
-		size_t page_off;
-		bool more;
-		int idx = 0;
-		size_t left;
-		struct ceph_osd_req_op *op;
-		u64 read_off = off;
-		u64 read_len = len;
-		int extent_cnt;
-
-		/* determine new offset/length if encrypted */
-		ceph_fscrypt_adjust_off_and_len(inode, &read_off, &read_len);
-
-		doutc(cl, "orig %llu~%llu reading %llu~%llu", off, len,
-		      read_off, read_len);
-
-		req = ceph_osdc_new_request(osdc, &ci->i_layout,
-					ci->i_vino, read_off, &read_len, 0, 1,
-					sparse ? CEPH_OSD_OP_SPARSE_READ :
-						 CEPH_OSD_OP_READ,
-					CEPH_OSD_FLAG_READ,
-					NULL, ci->i_truncate_seq,
-					ci->i_truncate_size, false);
-		if (IS_ERR(req)) {
-			ret = PTR_ERR(req);
-			break;
-		}
-
-		/* adjust len downward if the request truncated the len */
-		if (off + len > read_off + read_len)
-			len = read_off + read_len - off;
-		more = len < iov_iter_count(to);
-
-		op = &req->r_ops[0];
-		if (sparse) {
-			extent_cnt = __ceph_sparse_read_ext_count(inode, read_len);
-			ret = ceph_alloc_sparse_ext_map(op, extent_cnt);
-			if (ret) {
-				ceph_osdc_put_request(req);
-				break;
-			}
-		}
-
-		num_pages = calc_pages_for(read_off, read_len);
-		page_off = offset_in_page(off);
-		pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL);
-		if (IS_ERR(pages)) {
-			ceph_osdc_put_request(req);
-			ret = PTR_ERR(pages);
-			break;
-		}
-
-		osd_req_op_extent_osd_data_pages(req, 0, pages, read_len,
-						 offset_in_page(read_off),
-						 false, true);
-
-		ceph_osdc_start_request(osdc, req);
-		ret = ceph_osdc_wait_request(osdc, req);
-
-		ceph_update_read_metrics(&fsc->mdsc->metric,
-					 req->r_start_latency,
-					 req->r_end_latency,
-					 read_len, ret);
-
-		if (ret > 0)
-			objver = req->r_version;
-
-		i_size = i_size_read(inode);
-		doutc(cl, "%llu~%llu got %zd i_size %llu%s\n", off, len,
-		      ret, i_size, (more ? " MORE" : ""));
-
-		/* Fix it to go to end of extent map */
-		if (sparse && ret >= 0)
-			ret = ceph_sparse_ext_map_end(op);
-		else if (ret == -ENOENT)
-			ret = 0;
-
-		if (ret < 0) {
-			ceph_osdc_put_request(req);
-			if (ret == -EBLOCKLISTED)
-				fsc->blocklisted = true;
-			break;
-		}
-
-		if (IS_ENCRYPTED(inode)) {
-			int fret;
-
-			fret = ceph_fscrypt_decrypt_extents(inode, pages,
-					read_off, op->extent.sparse_ext,
-					op->extent.sparse_ext_cnt);
-			if (fret < 0) {
-				ret = fret;
-				ceph_osdc_put_request(req);
-				break;
-			}
-
-			/* account for any partial block at the beginning */
-			fret -= (off - read_off);
-
-			/*
-			 * Short read after big offset adjustment?
-			 * Nothing is usable, just call it a zero
-			 * len read.
-			 */
-			fret = max(fret, 0);
-
-			/* account for partial block at the end */
-			ret = min_t(ssize_t, fret, len);
-		}
-
-		/* Short read but not EOF? Zero out the remainder. */
-		if (ret < len && (off + ret < i_size)) {
-			int zlen = min(len - ret, i_size - off - ret);
-			int zoff = page_off + ret;
-
-			doutc(cl, "zero gap %llu~%llu\n", off + ret,
-			      off + ret + zlen);
-			ceph_zero_page_vector_range(zoff, zlen, pages);
-			ret += zlen;
-		}
-
-		if (off + ret > i_size)
-			left = (i_size > off) ? i_size - off : 0;
-		else
-			left = ret;
-
-		while (left > 0) {
-			size_t plen, copied;
-
-			plen = min_t(size_t, left, PAGE_SIZE - page_off);
-			SetPageUptodate(pages[idx]);
-			copied = copy_page_to_iter(pages[idx++],
-						   page_off, plen, to);
-			off += copied;
-			left -= copied;
-			page_off = 0;
-			if (copied < plen) {
-				ret = -EFAULT;
-				break;
-			}
-		}
-
-		ceph_osdc_put_request(req);
-
-		if (off >= i_size || !more)
-			break;
-	}
-
-	if (ret > 0) {
-		if (off >= i_size) {
-			*retry_op = CHECK_EOF;
-			ret = i_size - *ki_pos;
-			*ki_pos = i_size;
-		} else {
-			ret = off - *ki_pos;
-			*ki_pos = off;
-		}
-
-		if (last_objver)
-			*last_objver = objver;
-	}
-	doutc(cl, "result %zd retry_op %d\n", ret, *retry_op);
-	return ret;
-}
-
-static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
-			      int *retry_op)
-{
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file_inode(file);
-	struct ceph_client *cl = ceph_inode_to_client(inode);
-
-	doutc(cl, "on file %p %llx~%zx %s\n", file, iocb->ki_pos,
-	      iov_iter_count(to),
-	      (file->f_flags & O_DIRECT) ? "O_DIRECT" : "");
-
-	return __ceph_sync_read(inode, &iocb->ki_pos, to, retry_op, NULL);
-}
-
-struct ceph_aio_request {
-	struct kiocb *iocb;
-	size_t total_len;
-	bool write;
-	bool should_dirty;
-	int error;
-	struct list_head osd_reqs;
-	unsigned num_reqs;
-	atomic_t pending_reqs;
-	struct timespec64 mtime;
-	struct ceph_cap_flush *prealloc_cf;
-};
-
-struct ceph_aio_work {
-	struct work_struct work;
-	struct ceph_osd_request *req;
-};
-
-static void ceph_aio_retry_work(struct work_struct *work);
-
-static void ceph_aio_complete(struct inode *inode,
-			      struct ceph_aio_request *aio_req)
-{
-	struct ceph_client *cl = ceph_inode_to_client(inode);
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	int ret;
-
-	if (!atomic_dec_and_test(&aio_req->pending_reqs))
-		return;
-
-	if (aio_req->iocb->ki_flags & IOCB_DIRECT)
-		inode_dio_end(inode);
-
-	ret = aio_req->error;
-	if (!ret)
-		ret = aio_req->total_len;
-
-	doutc(cl, "%p %llx.%llx rc %d\n", inode, ceph_vinop(inode), ret);
-
-	if (ret >= 0 && aio_req->write) {
-		int dirty;
-
-		loff_t endoff = aio_req->iocb->ki_pos + aio_req->total_len;
-		if (endoff > i_size_read(inode)) {
-			if (ceph_inode_set_size(inode, endoff))
-				ceph_check_caps(ci, CHECK_CAPS_AUTHONLY);
-		}
-
-		spin_lock(&ci->i_ceph_lock);
-		dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR,
-					       &aio_req->prealloc_cf);
-		spin_unlock(&ci->i_ceph_lock);
-		if (dirty)
-			__mark_inode_dirty(inode, dirty);
-
-	}
-
-	ceph_put_cap_refs(ci, (aio_req->write ? CEPH_CAP_FILE_WR :
-						CEPH_CAP_FILE_RD));
-
-	aio_req->iocb->ki_complete(aio_req->iocb, ret);
-
-	ceph_free_cap_flush(aio_req->prealloc_cf);
-	kfree(aio_req);
-}
-
-static void ceph_aio_complete_req(struct ceph_osd_request *req)
-{
-	int rc = req->r_result;
-	struct inode *inode = req->r_inode;
-	struct ceph_aio_request *aio_req = req->r_priv;
-	struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
-	struct ceph_osd_req_op *op = &req->r_ops[0];
-	struct ceph_client_metric *metric = &ceph_sb_to_mdsc(inode->i_sb)->metric;
-	size_t len = osd_data->iter.count;
-	bool sparse = (op->op == CEPH_OSD_OP_SPARSE_READ);
-	struct ceph_client *cl = ceph_inode_to_client(inode);
-
-	doutc(cl, "req %p inode %p %llx.%llx, rc %d bytes %zu\n", req,
-	      inode, ceph_vinop(inode), rc, len);
-
-	if (rc == -EOLDSNAPC) {
-		struct ceph_aio_work *aio_work;
-		BUG_ON(!aio_req->write);
-
-		aio_work = kmalloc(sizeof(*aio_work), GFP_NOFS);
-		if (aio_work) {
-			INIT_WORK(&aio_work->work, ceph_aio_retry_work);
-			aio_work->req = req;
-			queue_work(ceph_inode_to_fs_client(inode)->inode_wq,
-				   &aio_work->work);
-			return;
-		}
-		rc = -ENOMEM;
-	} else if (!aio_req->write) {
-		if (sparse && rc >= 0)
-			rc = ceph_sparse_ext_map_end(op);
-		if (rc == -ENOENT)
-			rc = 0;
-		if (rc >= 0 && len > rc) {
-			int zlen = len - rc;
-
-			/*
-			 * If read is satisfied by single OSD request,
-			 * it can pass EOF. Otherwise read is within
-			 * i_size.
-			 */
-			if (aio_req->num_reqs == 1) {
-				loff_t i_size = i_size_read(inode);
-				loff_t endoff = aio_req->iocb->ki_pos + rc;
-				if (endoff < i_size)
-					zlen = min_t(size_t, zlen,
-						     i_size - endoff);
-				aio_req->total_len = rc + zlen;
-			}
-
-			iov_iter_advance(&osd_data->iter, rc);
-			iov_iter_zero(zlen, &osd_data->iter);
-		}
-	}
-
-	/* r_start_latency == 0 means the request was not submitted */
-	if (req->r_start_latency) {
-		if (aio_req->write)
-			ceph_update_write_metrics(metric, req->r_start_latency,
-						  req->r_end_latency, len, rc);
-		else
-			ceph_update_read_metrics(metric, req->r_start_latency,
-						 req->r_end_latency, len, rc);
-	}
-
-	if (aio_req->should_dirty)
-		ceph_dirty_pages(osd_data->dbuf);
-	ceph_osdc_put_request(req);
-
-	if (rc < 0)
-		cmpxchg(&aio_req->error, 0, rc);
-
-	ceph_aio_complete(inode, aio_req);
-	return;
-}
-
-static void ceph_aio_retry_work(struct work_struct *work)
-{
-	struct ceph_aio_work *aio_work =
-		container_of(work, struct ceph_aio_work, work);
-	struct ceph_osd_request *orig_req = aio_work->req;
-	struct ceph_aio_request *aio_req = orig_req->r_priv;
-	struct inode *inode = orig_req->r_inode;
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_snap_context *snapc;
-	struct ceph_osd_request *req;
-	int ret;
-
-	spin_lock(&ci->i_ceph_lock);
-	if (__ceph_have_pending_cap_snap(ci)) {
-		struct ceph_cap_snap *capsnap =
-			list_last_entry(&ci->i_cap_snaps,
-					struct ceph_cap_snap,
-					ci_item);
-		snapc = ceph_get_snap_context(capsnap->context);
-	} else {
-		BUG_ON(!ci->i_head_snapc);
-		snapc = ceph_get_snap_context(ci->i_head_snapc);
-	}
-	spin_unlock(&ci->i_ceph_lock);
-
-	req = ceph_osdc_alloc_request(orig_req->r_osdc, snapc, 1,
-			false, GFP_NOFS);
-	if (!req) {
-		ret = -ENOMEM;
-		req = orig_req;
-		goto out;
-	}
-
-	req->r_flags = /* CEPH_OSD_FLAG_ORDERSNAP | */ CEPH_OSD_FLAG_WRITE;
-	ceph_oloc_copy(&req->r_base_oloc, &orig_req->r_base_oloc);
-	ceph_oid_copy(&req->r_base_oid, &orig_req->r_base_oid);
-
-	req->r_ops[0] = orig_req->r_ops[0];
-
-	req->r_mtime = aio_req->mtime;
-	req->r_data_offset = req->r_ops[0].extent.offset;
-
-	ret = ceph_osdc_alloc_messages(req, GFP_NOFS);
-	if (ret) {
-		ceph_osdc_put_request(req);
-		req = orig_req;
-		goto out;
-	}
-
-	ceph_osdc_put_request(orig_req);
-
-	req->r_callback = ceph_aio_complete_req;
-	req->r_inode = inode;
-	req->r_priv = aio_req;
-
-	ceph_osdc_start_request(req->r_osdc, req);
-out:
-	if (ret < 0) {
-		req->r_result = ret;
-		ceph_aio_complete_req(req);
-	}
-
-	ceph_put_snap_context(snapc);
-	kfree(aio_work);
-}
-
-static ssize_t
-ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
-		       struct ceph_snap_context *snapc,
-		       struct ceph_cap_flush **pcf)
-{
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file_inode(file);
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-	struct ceph_client_metric *metric = &fsc->mdsc->metric;
-	struct ceph_vino vino;
-	struct ceph_osd_request *req;
-	struct ceph_aio_request *aio_req = NULL;
-	struct ceph_databuf *dbuf = NULL;
-	int flags;
-	int ret = 0;
-	struct timespec64 mtime = current_time(inode);
-	size_t count = iov_iter_count(iter);
-	loff_t pos = iocb->ki_pos;
-	bool write = iov_iter_rw(iter) == WRITE;
-	bool should_dirty = !write && user_backed_iter(iter);
-	bool sparse = ceph_test_mount_opt(fsc, SPARSEREAD);
-
-	if (write && ceph_snap(file_inode(file)) != CEPH_NOSNAP)
-		return -EROFS;
-
-	doutc(cl, "sync_direct_%s on file %p %lld~%u snapc %p seq %lld\n",
-	      (write ? "write" : "read"), file, pos, (unsigned)count,
-	      snapc, snapc ? snapc->seq : 0);
-
-	if (write) {
-		int ret2;
-
-		ceph_fscache_invalidate(inode, true);
-
-		ret2 = invalidate_inode_pages2_range(inode->i_mapping,
-					pos >> PAGE_SHIFT,
-					(pos + count - 1) >> PAGE_SHIFT);
-		if (ret2 < 0)
-			doutc(cl, "invalidate_inode_pages2_range returned %d\n",
-			      ret2);
-
-		flags = /* CEPH_OSD_FLAG_ORDERSNAP | */ CEPH_OSD_FLAG_WRITE;
-	} else {
-		flags = CEPH_OSD_FLAG_READ;
-	}
-
-	while (iov_iter_count(iter) > 0) {
-		u64 size = iov_iter_count(iter);
-		struct ceph_osd_req_op *op;
-		size_t len;
-		int readop = sparse ? CEPH_OSD_OP_SPARSE_READ : CEPH_OSD_OP_READ;
-		int extent_cnt;
-
-		if (write)
-			size = min_t(u64, size, fsc->mount_options->wsize);
-		else
-			size = min_t(u64, size, fsc->mount_options->rsize);
-
-		vino = ceph_vino(inode);
-		req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
-					    vino, pos, &size, 0,
-					    1,
-					    write ? CEPH_OSD_OP_WRITE : readop,
-					    flags, snapc,
-					    ci->i_truncate_seq,
-					    ci->i_truncate_size,
-					    false);
-		if (IS_ERR(req)) {
-			ret = PTR_ERR(req);
-			break;
-		}
-
-		op = &req->r_ops[0];
-		if (!write && sparse) {
-			extent_cnt = __ceph_sparse_read_ext_count(inode, size);
-			ret = ceph_alloc_sparse_ext_map(op, extent_cnt);
-			if (ret) {
-				ceph_osdc_put_request(req);
-				break;
-			}
-		}
-
-		dbuf = iter_get_bvecs_alloc(iter, size, write);
-		if (IS_ERR(dbuf)) {
-			ceph_osdc_put_request(req);
-			ret = PTR_ERR(dbuf);
-			break;
-		}
-		len = ceph_databuf_len(dbuf);
-		if (len != size)
-			osd_req_op_extent_update(req, 0, len);
-
-		osd_req_op_extent_osd_databuf(req, 0, dbuf);
-
-		/*
-		 * To simplify error handling, allow AIO when IO within i_size
-		 * or IO can be satisfied by single OSD request.
-		 */
-		if (pos == iocb->ki_pos && !is_sync_kiocb(iocb) &&
-		    (len == count || pos + count <= i_size_read(inode))) {
-			aio_req = kzalloc(sizeof(*aio_req), GFP_KERNEL);
-			if (aio_req) {
-				aio_req->iocb = iocb;
-				aio_req->write = write;
-				aio_req->should_dirty = should_dirty;
-				INIT_LIST_HEAD(&aio_req->osd_reqs);
-				if (write) {
-					aio_req->mtime = mtime;
-					swap(aio_req->prealloc_cf, *pcf);
-				}
-			}
-			/* ignore error */
-		}
-
-		if (write) {
-			/*
-			 * throw out any page cache pages in this range. this
-			 * may block.
-			 */
-			truncate_inode_pages_range(inode->i_mapping, pos,
-						   PAGE_ALIGN(pos + len) - 1);
-
-			req->r_mtime = mtime;
-		}
-
-		if (aio_req) {
-			aio_req->total_len += len;
-			aio_req->num_reqs++;
-			atomic_inc(&aio_req->pending_reqs);
-
-			req->r_callback = ceph_aio_complete_req;
-			req->r_inode = inode;
-			req->r_priv = aio_req;
-			list_add_tail(&req->r_private_item, &aio_req->osd_reqs);
-
-			pos += len;
-			continue;
-		}
-
-		ceph_osdc_start_request(req->r_osdc, req);
-		ret = ceph_osdc_wait_request(&fsc->client->osdc, req);
-
-		if (write)
-			ceph_update_write_metrics(metric, req->r_start_latency,
-						  req->r_end_latency, len, ret);
-		else
-			ceph_update_read_metrics(metric, req->r_start_latency,
-						 req->r_end_latency, len, ret);
-
-		size = i_size_read(inode);
-		if (!write) {
-			if (sparse && ret >= 0)
-				ret = ceph_sparse_ext_map_end(op);
-			else if (ret == -ENOENT)
-				ret = 0;
-
-			if (ret >= 0 && ret < len && pos + ret < size) {
-				int zlen = min_t(size_t, len - ret,
-						 size - pos - ret);
-
-				iov_iter_advance(&dbuf->iter, ret);
-				iov_iter_zero(zlen, &dbuf->iter);
-				ret += zlen;
-			}
-			if (ret >= 0)
-				len = ret;
-		}
-
-		ceph_osdc_put_request(req);
-		if (ret < 0)
-			break;
-
-		pos += len;
-		if (!write && pos >= size)
-			break;
-
-		if (write && pos > size) {
-			if (ceph_inode_set_size(inode, pos))
-				ceph_check_caps(ceph_inode(inode),
-						CHECK_CAPS_AUTHONLY);
-		}
-	}
-
-	if (aio_req) {
-		LIST_HEAD(osd_reqs);
-
-		if (aio_req->num_reqs == 0) {
-			kfree(aio_req);
-			return ret;
-		}
-
-		ceph_get_cap_refs(ci, write ? CEPH_CAP_FILE_WR :
-					      CEPH_CAP_FILE_RD);
-
-		list_splice(&aio_req->osd_reqs, &osd_reqs);
-		inode_dio_begin(inode);
-		while (!list_empty(&osd_reqs)) {
-			req = list_first_entry(&osd_reqs,
-					       struct ceph_osd_request,
-					       r_private_item);
-			list_del_init(&req->r_private_item);
-			if (ret >= 0)
-				ceph_osdc_start_request(req->r_osdc, req);
-			if (ret < 0) {
-				req->r_result = ret;
-				ceph_aio_complete_req(req);
-			}
-		}
-		return -EIOCBQUEUED;
-	}
-
-	if (ret != -EOLDSNAPC && pos > iocb->ki_pos) {
-		ret = pos - iocb->ki_pos;
-		iocb->ki_pos = pos;
-	}
-	return ret;
-}
-
-/*
- * Synchronous write, straight from __user pointer or user pages.
- *
- * If write spans object boundary, just do multiple writes.  (For a
- * correct atomic write, we should e.g. take write locks on all
- * objects, rollback on failure, etc.)
- */
-static ssize_t
-ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
-		struct ceph_snap_context *snapc)
-{
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file_inode(file);
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-	struct ceph_osd_client *osdc = &fsc->client->osdc;
-	struct ceph_osd_request *req;
-	struct page **pages;
-	u64 len;
-	int num_pages;
-	int written = 0;
-	int ret;
-	bool check_caps = false;
-	struct timespec64 mtime = current_time(inode);
-	size_t count = iov_iter_count(from);
-
-	if (ceph_snap(file_inode(file)) != CEPH_NOSNAP)
-		return -EROFS;
-
-	doutc(cl, "on file %p %lld~%u snapc %p seq %lld\n", file, pos,
-	      (unsigned)count, snapc, snapc->seq);
-
-	ret = filemap_write_and_wait_range(inode->i_mapping,
-					   pos, pos + count - 1);
-	if (ret < 0)
-		return ret;
-
-	ceph_fscache_invalidate(inode, false);
-
-	while ((len = iov_iter_count(from)) > 0) {
-		size_t left;
-		int n;
-		u64 write_pos = pos;
-		u64 write_len = len;
-		u64 objnum, objoff;
-		u64 assert_ver = 0;
-		bool rmw;
-		bool first, last;
-		struct iov_iter saved_iter = *from;
-		size_t off, xlen;
-
-		ceph_fscrypt_adjust_off_and_len(inode, &write_pos, &write_len);
-
-		/* clamp the length to the end of first object */
-		ceph_calc_file_object_mapping(&ci->i_layout, write_pos,
-					      write_len, &objnum, &objoff,
-					      &xlen);
-		write_len = xlen;
-
-		/* adjust len downward if it goes beyond current object */
-		if (pos + len > write_pos + write_len)
-			len = write_pos + write_len - pos;
-
-		/*
-		 * If we had to adjust the length or position to align with a
-		 * crypto block, then we must do a read/modify/write cycle. We
-		 * use a version assertion to redrive the thing if something
-		 * changes in between.
-		 */
-		first = pos != write_pos;
-		last = (pos + len) != (write_pos + write_len);
-		rmw = first || last;
-
-		doutc(cl, "ino %llx %lld~%llu adjusted %lld~%llu -- %srmw\n",
-		      ci->i_vino.ino, pos, len, write_pos, write_len,
-		      rmw ? "" : "no ");
-
-		/*
-		 * The data is emplaced into the page as it would be if it were
-		 * in an array of pagecache pages.
-		 */
-		num_pages = calc_pages_for(write_pos, write_len);
-		pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL);
-		if (IS_ERR(pages)) {
-			ret = PTR_ERR(pages);
-			break;
-		}
-
-		/* Do we need to preload the pages? */
-		if (rmw) {
-			u64 first_pos = write_pos;
-			u64 last_pos = (write_pos + write_len) - CEPH_FSCRYPT_BLOCK_SIZE;
-			u64 read_len = CEPH_FSCRYPT_BLOCK_SIZE;
-			struct ceph_osd_req_op *op;
-
-			/* We should only need to do this for encrypted inodes */
-			WARN_ON_ONCE(!IS_ENCRYPTED(inode));
-
-			/* No need to do two reads if first and last blocks are same */
-			if (first && last_pos == first_pos)
-				last = false;
-
-			/*
-			 * Allocate a read request for one or two extents,
-			 * depending on how the request was aligned.
-			 */
-			req = ceph_osdc_new_request(osdc, &ci->i_layout,
-					ci->i_vino, first ? first_pos : last_pos,
-					&read_len, 0, (first && last) ? 2 : 1,
-					CEPH_OSD_OP_SPARSE_READ, CEPH_OSD_FLAG_READ,
-					NULL, ci->i_truncate_seq,
-					ci->i_truncate_size, false);
-			if (IS_ERR(req)) {
-				ceph_release_page_vector(pages, num_pages);
-				ret = PTR_ERR(req);
-				break;
-			}
-
-			/* Something is misaligned! */
-			if (read_len != CEPH_FSCRYPT_BLOCK_SIZE) {
-				ceph_osdc_put_request(req);
-				ceph_release_page_vector(pages, num_pages);
-				ret = -EIO;
-				break;
-			}
-
-			/* Add extent for first block? */
-			op = &req->r_ops[0];
-
-			if (first) {
-				osd_req_op_extent_osd_data_pages(req, 0, pages,
-							 CEPH_FSCRYPT_BLOCK_SIZE,
-							 offset_in_page(first_pos),
-							 false, false);
-				/* We only expect a single extent here */
-				ret = __ceph_alloc_sparse_ext_map(op, 1);
-				if (ret) {
-					ceph_osdc_put_request(req);
-					ceph_release_page_vector(pages, num_pages);
-					break;
-				}
-			}
-
-			/* Add extent for last block */
-			if (last) {
-				/* Init the other extent if first extent has been used */
-				if (first) {
-					op = &req->r_ops[1];
-					osd_req_op_extent_init(req, 1,
-							CEPH_OSD_OP_SPARSE_READ,
-							last_pos, CEPH_FSCRYPT_BLOCK_SIZE,
-							ci->i_truncate_size,
-							ci->i_truncate_seq);
-				}
-
-				ret = __ceph_alloc_sparse_ext_map(op, 1);
-				if (ret) {
-					ceph_osdc_put_request(req);
-					ceph_release_page_vector(pages, num_pages);
-					break;
-				}
-
-				osd_req_op_extent_osd_data_pages(req, first ? 1 : 0,
-							&pages[num_pages - 1],
-							CEPH_FSCRYPT_BLOCK_SIZE,
-							offset_in_page(last_pos),
-							false, false);
-			}
-
-			ceph_osdc_start_request(osdc, req);
-			ret = ceph_osdc_wait_request(osdc, req);
-
-			/* FIXME: length field is wrong if there are 2 extents */
-			ceph_update_read_metrics(&fsc->mdsc->metric,
-						 req->r_start_latency,
-						 req->r_end_latency,
-						 read_len, ret);
-
-			/* Ok if object is not already present */
-			if (ret == -ENOENT) {
-				/*
-				 * If there is no object, then we can't assert
-				 * on its version. Set it to 0, and we'll use an
-				 * exclusive create instead.
-				 */
-				ceph_osdc_put_request(req);
-				ret = 0;
-
-				/*
-				 * zero out the soon-to-be uncopied parts of the
-				 * first and last pages.
-				 */
-				if (first)
-					zero_user_segment(pages[0], 0,
-							  offset_in_page(first_pos));
-				if (last)
-					zero_user_segment(pages[num_pages - 1],
-							  offset_in_page(last_pos),
-							  PAGE_SIZE);
-			} else {
-				if (ret < 0) {
-					ceph_osdc_put_request(req);
-					ceph_release_page_vector(pages, num_pages);
-					break;
-				}
-
-				op = &req->r_ops[0];
-				if (op->extent.sparse_ext_cnt == 0) {
-					if (first)
-						zero_user_segment(pages[0], 0,
-								  offset_in_page(first_pos));
-					else
-						zero_user_segment(pages[num_pages - 1],
-								  offset_in_page(last_pos),
-								  PAGE_SIZE);
-				} else if (op->extent.sparse_ext_cnt != 1 ||
-					   ceph_sparse_ext_map_end(op) !=
-						CEPH_FSCRYPT_BLOCK_SIZE) {
-					ret = -EIO;
-					ceph_osdc_put_request(req);
-					ceph_release_page_vector(pages, num_pages);
-					break;
-				}
-
-				if (first && last) {
-					op = &req->r_ops[1];
-					if (op->extent.sparse_ext_cnt == 0) {
-						zero_user_segment(pages[num_pages - 1],
-								  offset_in_page(last_pos),
-								  PAGE_SIZE);
-					} else if (op->extent.sparse_ext_cnt != 1 ||
-						   ceph_sparse_ext_map_end(op) !=
-							CEPH_FSCRYPT_BLOCK_SIZE) {
-						ret = -EIO;
-						ceph_osdc_put_request(req);
-						ceph_release_page_vector(pages, num_pages);
-						break;
-					}
-				}
-
-				/* Grab assert version. It must be non-zero. */
-				assert_ver = req->r_version;
-				WARN_ON_ONCE(ret > 0 && assert_ver == 0);
-
-				ceph_osdc_put_request(req);
-				if (first) {
-					ret = ceph_fscrypt_decrypt_block_inplace(inode,
-							pages[0], CEPH_FSCRYPT_BLOCK_SIZE,
-							offset_in_page(first_pos),
-							first_pos >> CEPH_FSCRYPT_BLOCK_SHIFT);
-					if (ret < 0) {
-						ceph_release_page_vector(pages, num_pages);
-						break;
-					}
-				}
-				if (last) {
-					ret = ceph_fscrypt_decrypt_block_inplace(inode,
-							pages[num_pages - 1],
-							CEPH_FSCRYPT_BLOCK_SIZE,
-							offset_in_page(last_pos),
-							last_pos >> CEPH_FSCRYPT_BLOCK_SHIFT);
-					if (ret < 0) {
-						ceph_release_page_vector(pages, num_pages);
-						break;
-					}
-				}
-			}
-		}
-
-		left = len;
-		off = offset_in_page(pos);
-		for (n = 0; n < num_pages; n++) {
-			size_t plen = min_t(size_t, left, PAGE_SIZE - off);
-
-			/* copy the data */
-			ret = copy_page_from_iter(pages[n], off, plen, from);
-			if (ret != plen) {
-				ret = -EFAULT;
-				break;
-			}
-			off = 0;
-			left -= ret;
-		}
-		if (ret < 0) {
-			doutc(cl, "write failed with %d\n", ret);
-			ceph_release_page_vector(pages, num_pages);
-			break;
-		}
-
-		if (IS_ENCRYPTED(inode)) {
-			ret = ceph_fscrypt_encrypt_pages(inode, pages,
-							 write_pos, write_len,
-							 GFP_KERNEL);
-			if (ret < 0) {
-				doutc(cl, "encryption failed with %d\n", ret);
-				ceph_release_page_vector(pages, num_pages);
-				break;
-			}
-		}
-
-		req = ceph_osdc_new_request(osdc, &ci->i_layout,
-					    ci->i_vino, write_pos, &write_len,
-					    rmw ? 1 : 0, rmw ? 2 : 1,
-					    CEPH_OSD_OP_WRITE,
-					    CEPH_OSD_FLAG_WRITE,
-					    snapc, ci->i_truncate_seq,
-					    ci->i_truncate_size, false);
-		if (IS_ERR(req)) {
-			ret = PTR_ERR(req);
-			ceph_release_page_vector(pages, num_pages);
-			break;
-		}
-
-		doutc(cl, "write op %lld~%llu\n", write_pos, write_len);
-		osd_req_op_extent_osd_data_pages(req, rmw ? 1 : 0, pages, write_len,
-						 offset_in_page(write_pos), false,
-						 true);
-		req->r_inode = inode;
-		req->r_mtime = mtime;
-
-		/* Set up the assertion */
-		if (rmw) {
-			/*
-			 * Set up the assertion. If we don't have a version
-			 * number, then the object doesn't exist yet. Use an
-			 * exclusive create instead of a version assertion in
-			 * that case.
-			 */
-			if (assert_ver) {
-				osd_req_op_init(req, 0, CEPH_OSD_OP_ASSERT_VER, 0);
-				req->r_ops[0].assert_ver.ver = assert_ver;
-			} else {
-				osd_req_op_init(req, 0, CEPH_OSD_OP_CREATE,
-						CEPH_OSD_OP_FLAG_EXCL);
-			}
-		}
-
-		ceph_osdc_start_request(osdc, req);
-		ret = ceph_osdc_wait_request(osdc, req);
-
-		ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
-					  req->r_end_latency, len, ret);
-		ceph_osdc_put_request(req);
-		if (ret != 0) {
-			doutc(cl, "osd write returned %d\n", ret);
-			/* Version changed! Must re-do the rmw cycle */
-			if ((assert_ver && (ret == -ERANGE || ret == -EOVERFLOW)) ||
-			    (!assert_ver && ret == -EEXIST)) {
-				/* We should only ever see this on a rmw */
-				WARN_ON_ONCE(!rmw);
-
-				/* The version should never go backward */
-				WARN_ON_ONCE(ret == -EOVERFLOW);
-
-				*from = saved_iter;
-
-				/* FIXME: limit number of times we loop? */
-				continue;
-			}
-			ceph_set_error_write(ci);
-			break;
-		}
-
-		ceph_clear_error_write(ci);
-
-		/*
-		 * We successfully wrote to a range of the file. Declare
-		 * that region of the pagecache invalid.
-		 */
-		ret = invalidate_inode_pages2_range(
-				inode->i_mapping,
-				pos >> PAGE_SHIFT,
-				(pos + len - 1) >> PAGE_SHIFT);
-		if (ret < 0) {
-			doutc(cl, "invalidate_inode_pages2_range returned %d\n",
-			      ret);
-			ret = 0;
-		}
-		pos += len;
-		written += len;
-		doutc(cl, "written %d\n", written);
-		if (pos > i_size_read(inode)) {
-			check_caps = ceph_inode_set_size(inode, pos);
-			if (check_caps)
-				ceph_check_caps(ceph_inode(inode),
-						CHECK_CAPS_AUTHONLY);
-		}
-
-	}
-
-	if (ret != -EOLDSNAPC && written > 0) {
-		ret = written;
-		iocb->ki_pos = pos;
-	}
-	doutc(cl, "returning %d\n", ret);
-	return ret;
-}
-
-/*
- * Wrap generic_file_aio_read with checks for cap bits on the inode.
- * Atomically grab references, so that those bits are not released
- * back to the MDS mid-read.
- *
- * Hmm, the sync read case isn't actually async... should it be?
- */
-static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to)
-{
-	struct file *filp = iocb->ki_filp;
-	struct ceph_file_info *fi = filp->private_data;
-	size_t len = iov_iter_count(to);
-	struct inode *inode = file_inode(filp);
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	bool direct_lock = iocb->ki_flags & IOCB_DIRECT;
-	struct ceph_client *cl = ceph_inode_to_client(inode);
-	ssize_t ret;
-	int want = 0, got = 0;
-	int retry_op = 0, read = 0;
-
-again:
-	doutc(cl, "%llu~%u trying to get caps on %p %llx.%llx\n",
-	      iocb->ki_pos, (unsigned)len, inode, ceph_vinop(inode));
-
-	if (ceph_inode_is_shutdown(inode))
-		return -ESTALE;
-
-	if (direct_lock)
-		ceph_start_io_direct(inode);
-	else
-		ceph_start_io_read(inode);
-
-	if (!(fi->flags & CEPH_F_SYNC) && !direct_lock)
-		want |= CEPH_CAP_FILE_CACHE;
-	if (fi->fmode & CEPH_FILE_MODE_LAZY)
-		want |= CEPH_CAP_FILE_LAZYIO;
-
-	ret = ceph_get_caps(filp, CEPH_CAP_FILE_RD, want, -1, &got);
-	if (ret < 0) {
-		if (direct_lock)
-			ceph_end_io_direct(inode);
-		else
-			ceph_end_io_read(inode);
-		return ret;
-	}
-
-	if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
-	    (iocb->ki_flags & IOCB_DIRECT) ||
-	    (fi->flags & CEPH_F_SYNC)) {
-
-		doutc(cl, "sync %p %llx.%llx %llu~%u got cap refs on %s\n",
-		      inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len,
-		      ceph_cap_string(got));
-
-		if (!ceph_has_inline_data(ci)) {
-			if (!retry_op &&
-			    (iocb->ki_flags & IOCB_DIRECT) &&
-			    !IS_ENCRYPTED(inode)) {
-				ret = ceph_direct_read_write(iocb, to,
-							     NULL, NULL);
-				if (ret >= 0 && ret < len)
-					retry_op = CHECK_EOF;
-			} else {
-				ret = ceph_sync_read(iocb, to, &retry_op);
-			}
-		} else {
-			retry_op = READ_INLINE;
-		}
-	} else {
-		doutc(cl, "async %p %llx.%llx %llu~%u got cap refs on %s\n",
-		      inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len,
-		      ceph_cap_string(got));
-		ret = generic_file_read_iter(iocb, to);
-	}
-
-	doutc(cl, "%p %llx.%llx dropping cap refs on %s = %d\n",
-	      inode, ceph_vinop(inode), ceph_cap_string(got), (int)ret);
-	ceph_put_cap_refs(ci, got);
-
-	if (direct_lock)
-		ceph_end_io_direct(inode);
-	else
-		ceph_end_io_read(inode);
-
-	if (retry_op > HAVE_RETRIED && ret >= 0) {
-		int statret;
-		struct page *page = NULL;
-		loff_t i_size;
-		int mask = CEPH_STAT_CAP_SIZE;
-		if (retry_op == READ_INLINE) {
-			page = __page_cache_alloc(GFP_KERNEL);
-			if (!page)
-				return -ENOMEM;
-
-			mask = CEPH_STAT_CAP_INLINE_DATA;
-		}
-
-		statret = __ceph_do_getattr(inode, page, mask, !!page);
-		if (statret < 0) {
-			if (page)
-				__free_page(page);
-			if (statret == -ENODATA) {
-				BUG_ON(retry_op != READ_INLINE);
-				goto again;
-			}
-			return statret;
-		}
-
-		i_size = i_size_read(inode);
-		if (retry_op == READ_INLINE) {
-			BUG_ON(ret > 0 || read > 0);
-			if (iocb->ki_pos < i_size &&
-			    iocb->ki_pos < PAGE_SIZE) {
-				loff_t end = min_t(loff_t, i_size,
-						   iocb->ki_pos + len);
-				end = min_t(loff_t, end, PAGE_SIZE);
-				if (statret < end)
-					zero_user_segment(page, statret, end);
-				ret = copy_page_to_iter(page,
-						iocb->ki_pos & ~PAGE_MASK,
-						end - iocb->ki_pos, to);
-				iocb->ki_pos += ret;
-				read += ret;
-			}
-			if (iocb->ki_pos < i_size && read < len) {
-				size_t zlen = min_t(size_t, len - read,
-						    i_size - iocb->ki_pos);
-				ret = iov_iter_zero(zlen, to);
-				iocb->ki_pos += ret;
-				read += ret;
-			}
-			__free_pages(page, 0);
-			return read;
-		}
-
-		/* hit EOF or hole? */
-		if (retry_op == CHECK_EOF && iocb->ki_pos < i_size &&
-		    ret < len) {
-			doutc(cl, "may hit hole, ppos %lld < size %lld, reading more\n",
-			      iocb->ki_pos, i_size);
-
-			read += ret;
-			len -= ret;
-			retry_op = HAVE_RETRIED;
-			goto again;
-		}
-	}
-
-	if (ret >= 0)
-		ret += read;
-
-	return ret;
-}
-#endif // TODO: Remove after netfs conversion
-
 /*
  * Wrap filemap_splice_read with checks for cap bits on the inode.
  * Atomically grab references, so that those bits are not released
@@ -2298,203 +991,6 @@ static ssize_t ceph_splice_read(struct file *in, loff_t *ppos,
 	return ret;
 }
 
-#if 0 // TODO: Remove after netfs conversion
-/*
- * Take cap references to avoid releasing caps to MDS mid-write.
- *
- * If we are synchronous, and write with an old snap context, the OSD
- * may return EOLDSNAPC.  In that case, retry the write.. _after_
- * dropping our cap refs and allowing the pending snap to logically
- * complete _before_ this write occurs.
- *
- * If we are near ENOSPC, write synchronously.
- */
-static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
-{
-	struct file *file = iocb->ki_filp;
-	struct ceph_file_info *fi = file->private_data;
-	struct inode *inode = file_inode(file);
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	struct ceph_client *cl = fsc->client;
-	struct ceph_osd_client *osdc = &fsc->client->osdc;
-	struct ceph_cap_flush *prealloc_cf;
-	ssize_t count, written = 0;
-	int err, want = 0, got;
-	bool direct_lock = false;
-	u32 map_flags;
-	u64 pool_flags;
-	loff_t pos;
-	loff_t limit = max(i_size_read(inode), fsc->max_file_size);
-
-	if (ceph_inode_is_shutdown(inode))
-		return -ESTALE;
-
-	if (ceph_snap(inode) != CEPH_NOSNAP)
-		return -EROFS;
-
-	prealloc_cf = ceph_alloc_cap_flush();
-	if (!prealloc_cf)
-		return -ENOMEM;
-
-	if ((iocb->ki_flags & (IOCB_DIRECT | IOCB_APPEND)) == IOCB_DIRECT)
-		direct_lock = true;
-
-retry_snap:
-	if (direct_lock)
-		ceph_start_io_direct(inode);
-	else
-		ceph_start_io_write(inode);
-
-	if (iocb->ki_flags & IOCB_APPEND) {
-		err = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false);
-		if (err < 0)
-			goto out;
-	}
-
-	err = generic_write_checks(iocb, from);
-	if (err <= 0)
-		goto out;
-
-	pos = iocb->ki_pos;
-	if (unlikely(pos >= limit)) {
-		err = -EFBIG;
-		goto out;
-	} else {
-		iov_iter_truncate(from, limit - pos);
-	}
-
-	count = iov_iter_count(from);
-	if (ceph_quota_is_max_bytes_exceeded(inode, pos + count)) {
-		err = -EDQUOT;
-		goto out;
-	}
-
-	down_read(&osdc->lock);
-	map_flags = osdc->osdmap->flags;
-	pool_flags = ceph_pg_pool_flags(osdc->osdmap, ci->i_layout.pool_id);
-	up_read(&osdc->lock);
-	if ((map_flags & CEPH_OSDMAP_FULL) ||
-	    (pool_flags & CEPH_POOL_FLAG_FULL)) {
-		err = -ENOSPC;
-		goto out;
-	}
-
-	err = file_remove_privs(file);
-	if (err)
-		goto out;
-
-	doutc(cl, "%p %llx.%llx %llu~%zd getting caps. i_size %llu\n",
-	      inode, ceph_vinop(inode), pos, count,
-	      i_size_read(inode));
-	if (!(fi->flags & CEPH_F_SYNC) && !direct_lock)
-		want |= CEPH_CAP_FILE_BUFFER;
-	if (fi->fmode & CEPH_FILE_MODE_LAZY)
-		want |= CEPH_CAP_FILE_LAZYIO;
-	got = 0;
-	err = ceph_get_caps(file, CEPH_CAP_FILE_WR, want, pos + count, &got);
-	if (err < 0)
-		goto out;
-
-	err = file_update_time(file);
-	if (err)
-		goto out_caps;
-
-	inode_inc_iversion_raw(inode);
-
-	doutc(cl, "%p %llx.%llx %llu~%zd got cap refs on %s\n",
-	      inode, ceph_vinop(inode), pos, count, ceph_cap_string(got));
-
-	if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 ||
-	    (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) ||
-	    (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) {
-		struct ceph_snap_context *snapc;
-		struct iov_iter data;
-
-		spin_lock(&ci->i_ceph_lock);
-		if (__ceph_have_pending_cap_snap(ci)) {
-			struct ceph_cap_snap *capsnap =
-					list_last_entry(&ci->i_cap_snaps,
-							struct ceph_cap_snap,
-							ci_item);
-			snapc = ceph_get_snap_context(capsnap->context);
-		} else {
-			BUG_ON(!ci->i_head_snapc);
-			snapc = ceph_get_snap_context(ci->i_head_snapc);
-		}
-		spin_unlock(&ci->i_ceph_lock);
-
-		/* we might need to revert back to that point */
-		data = *from;
-		if ((iocb->ki_flags & IOCB_DIRECT) && !IS_ENCRYPTED(inode))
-			written = ceph_direct_read_write(iocb, &data, snapc,
-							 &prealloc_cf);
-		else
-			written = ceph_sync_write(iocb, &data, pos, snapc);
-		if (direct_lock)
-			ceph_end_io_direct(inode);
-		else
-			ceph_end_io_write(inode);
-		if (written > 0)
-			iov_iter_advance(from, written);
-		ceph_put_snap_context(snapc);
-	} else {
-		/*
-		 * No need to acquire the i_truncate_mutex. Because
-		 * the MDS revokes Fwb caps before sending truncate
-		 * message to us. We can't get Fwb cap while there
-		 * are pending vmtruncate. So write and vmtruncate
-		 * can not run at the same time
-		 */
-		written = generic_perform_write(iocb, from);
-		ceph_end_io_write(inode);
-	}
-
-	if (written >= 0) {
-		int dirty;
-
-		spin_lock(&ci->i_ceph_lock);
-		dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR,
-					       &prealloc_cf);
-		spin_unlock(&ci->i_ceph_lock);
-		if (dirty)
-			__mark_inode_dirty(inode, dirty);
-		if (ceph_quota_is_max_bytes_approaching(inode, iocb->ki_pos))
-			ceph_check_caps(ci, CHECK_CAPS_FLUSH);
-	}
-
-	doutc(cl, "%p %llx.%llx %llu~%u  dropping cap refs on %s\n",
-	      inode, ceph_vinop(inode), pos, (unsigned)count,
-	      ceph_cap_string(got));
-	ceph_put_cap_refs(ci, got);
-
-	if (written == -EOLDSNAPC) {
-		doutc(cl, "%p %llx.%llx %llu~%u" "got EOLDSNAPC, retrying\n",
-		      inode, ceph_vinop(inode), pos, (unsigned)count);
-		goto retry_snap;
-	}
-
-	if (written >= 0) {
-		if ((map_flags & CEPH_OSDMAP_NEARFULL) ||
-		    (pool_flags & CEPH_POOL_FLAG_NEARFULL))
-			iocb->ki_flags |= IOCB_DSYNC;
-		written = generic_write_sync(iocb, written);
-	}
-
-	goto out_unlocked;
-out_caps:
-	ceph_put_cap_refs(ci, got);
-out:
-	if (direct_lock)
-		ceph_end_io_direct(inode);
-	else
-		ceph_end_io_write(inode);
-out_unlocked:
-	ceph_free_cap_flush(prealloc_cf);
-	return written ? written : err;
-}
-#endif // TODO: Remove after netfs conversion
-
 /*
  * llseek.  be sure to verify file size on SEEK_END.
  */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index acd5c4821ded..97eddbf9dae9 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -470,19 +470,6 @@ struct ceph_inode_info {
 #endif
 };
 
-struct ceph_netfs_request_data { // TODO: Remove
-	int caps;
-
-	/*
-	 * Maximum size of a file readahead request.
-	 * The fadvise could update the bdi's default ra_pages.
-	 */
-	unsigned int file_ra_pages;
-
-	/* Set it if fadvise disables file readahead entirely */
-	bool file_ra_disabled;
-};
-
 struct ceph_io_request {
 	struct netfs_io_request rreq;
 	u64 rmw_assert_version;
@@ -1260,9 +1247,6 @@ extern void __ceph_touch_fmode(struct ceph_inode_info *ci,
 			       struct ceph_mds_client *mdsc, int fmode);
 
 /* addr.c */
-#if 0 // TODO: Remove after netfs conversion
-extern const struct netfs_request_ops ceph_netfs_ops;
-#endif // TODO: Remove after netfs conversion
 bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio);
 extern int ceph_mmap(struct file *file, struct vm_area_struct *vma);
 extern int ceph_uninline_data(struct file *file);
@@ -1293,11 +1277,6 @@ extern int ceph_renew_caps(struct inode *inode, int fmode);
 extern int ceph_open(struct inode *inode, struct file *file);
 extern int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 			    struct file *file, unsigned flags, umode_t mode);
-#if 0 // TODO: Remove after netfs conversion
-extern ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
-				struct iov_iter *to, int *retry_op,
-				u64 *last_objver);
-#endif
 extern int ceph_release(struct inode *inode, struct file *filp);
 extern void ceph_fill_inline_data(struct inode *inode, struct page *locked_page,
 				  char *data, size_t len);


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 02/35] libceph: Rename alignment to offset
  2025-03-13 23:32 ` [RFC PATCH 02/35] libceph: Rename alignment to offset David Howells
@ 2025-03-14 19:04   ` Viacheslav Dubeyko
  2025-03-14 20:01   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-14 19:04 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:32 +0000, David Howells wrote:
> Rename 'alignment' to 'offset' in a number of places where it seems to be
> talking about the offset into the first page of a sequence of pages.
> 

Yeah, offset sounds more clear than alignment.

> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/ceph/addr.c                  |  4 ++--
>  include/linux/ceph/messenger.h  |  4 ++--
>  include/linux/ceph/osd_client.h | 10 +++++-----
>  net/ceph/messenger.c            | 10 +++++-----
>  net/ceph/osd_client.c           | 24 ++++++++++++------------
>  5 files changed, 26 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 20b6bd8cd004..482a9f41a685 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -254,7 +254,7 @@ static void finish_netfs_read(struct ceph_osd_request *req)
>  
>  	if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) {
>  		ceph_put_page_vector(osd_data->pages,
> -				     calc_pages_for(osd_data->alignment,
> +				     calc_pages_for(osd_data->offset,
>  					osd_data->length), false);
>  	}
>  	if (err > 0) {
> @@ -918,7 +918,7 @@ static void writepages_finish(struct ceph_osd_request *req)
>  		osd_data = osd_req_op_extent_osd_data(req, i);
>  		BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_PAGES);
>  		len += osd_data->length;
> -		num_pages = calc_pages_for((u64)osd_data->alignment,
> +		num_pages = calc_pages_for((u64)osd_data->offset,
>  					   (u64)osd_data->length);
>  		total_pages += num_pages;
>  		for (j = 0; j < num_pages; j++) {
> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> index 1717cc57cdac..db2aba32b8a0 100644
> --- a/include/linux/ceph/messenger.h
> +++ b/include/linux/ceph/messenger.h
> @@ -221,7 +221,7 @@ struct ceph_msg_data {
>  		struct {
>  			struct page	**pages;

Do we still operate by pages here? It looks like we need to rework it somehow.

>  			size_t		length;		/* total # bytes */
> -			unsigned int	alignment;	/* first page */
> +			unsigned int	offset;		/* first page */

Maybe, we need to change the comment on the "first folio" here?

>  			bool		own_pages;

We are mentioning pages everywhere. :)

>  		};
>  		struct ceph_pagelist	*pagelist;
> @@ -602,7 +602,7 @@ extern bool ceph_con_keepalive_expired(struct ceph_connection *con,
>  				       unsigned long interval);
>  
>  void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
> -			     size_t length, size_t alignment, bool own_pages);
> +			     size_t length, size_t offset, bool own_pages);
>  extern void ceph_msg_data_add_pagelist(struct ceph_msg *msg,
>  				struct ceph_pagelist *pagelist);
>  #ifdef CONFIG_BLOCK
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index d55b30057a45..8fc84f389aad 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -118,7 +118,7 @@ struct ceph_osd_data {
>  		struct {
>  			struct page	**pages;

Yeah, pages, pages, pages... :)

>  			u64		length;
> -			u32		alignment;
> +			u32		offset;
>  			bool		pages_from_pool;
>  			bool		own_pages;
>  		};
> @@ -469,7 +469,7 @@ struct ceph_osd_req_op *osd_req_op_init(struct ceph_osd_request *osd_req,
>  extern void osd_req_op_raw_data_in_pages(struct ceph_osd_request *,
>  					unsigned int which,
>  					struct page **pages, u64 length,
> -					u32 alignment, bool pages_from_pool,
> +					u32 offset, bool pages_from_pool,
>  					bool own_pages);
>  
>  extern void osd_req_op_extent_init(struct ceph_osd_request *osd_req,
> @@ -488,7 +488,7 @@ extern struct ceph_osd_data *osd_req_op_extent_osd_data(
>  extern void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *,
>  					unsigned int which,
>  					struct page **pages, u64 length,
> -					u32 alignment, bool pages_from_pool,
> +					u32 offset, bool pages_from_pool,
>  					bool own_pages);
>  extern void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *,
>  					unsigned int which,
> @@ -515,7 +515,7 @@ extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *,
>  extern void osd_req_op_cls_request_data_pages(struct ceph_osd_request *,
>  					unsigned int which,
>  					struct page **pages, u64 length,
> -					u32 alignment, bool pages_from_pool,
> +					u32 offset, bool pages_from_pool,
>  					bool own_pages);
>  void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req,
>  				       unsigned int which,
> @@ -524,7 +524,7 @@ void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req,
>  extern void osd_req_op_cls_response_data_pages(struct ceph_osd_request *,
>  					unsigned int which,
>  					struct page **pages, u64 length,
> -					u32 alignment, bool pages_from_pool,
> +					u32 offset, bool pages_from_pool,
>  					bool own_pages);
>  int osd_req_op_cls_init(struct ceph_osd_request *osd_req, unsigned int which,
>  			const char *class, const char *method);
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index d1b5705dc0c6..1df4291cc80b 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -840,8 +840,8 @@ static void ceph_msg_data_pages_cursor_init(struct ceph_msg_data_cursor *cursor,
>  	BUG_ON(!data->length);
>  
>  	cursor->resid = min(length, data->length);
> -	page_count = calc_pages_for(data->alignment, (u64)data->length);
> -	cursor->page_offset = data->alignment & ~PAGE_MASK;
> +	page_count = calc_pages_for(data->offset, (u64)data->length);
> +	cursor->page_offset = data->offset & ~PAGE_MASK;

We still have a lot of work converting to folio.

>  	cursor->page_index = 0;
>  	BUG_ON(page_count > (int)USHRT_MAX);
>  	cursor->page_count = (unsigned short)page_count;
> @@ -1873,7 +1873,7 @@ static struct ceph_msg_data *ceph_msg_data_add(struct ceph_msg *msg)
>  static void ceph_msg_data_destroy(struct ceph_msg_data *data)
>  {
>  	if (data->type == CEPH_MSG_DATA_PAGES && data->own_pages) {
> -		int num_pages = calc_pages_for(data->alignment, data->length);
> +		int num_pages = calc_pages_for(data->offset, data->length);
>  		ceph_release_page_vector(data->pages, num_pages);
>  	} else if (data->type == CEPH_MSG_DATA_PAGELIST) {
>  		ceph_pagelist_release(data->pagelist);
> @@ -1881,7 +1881,7 @@ static void ceph_msg_data_destroy(struct ceph_msg_data *data)
>  }
>  
>  void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
> -			     size_t length, size_t alignment, bool own_pages)
> +			     size_t length, size_t offset, bool own_pages)

I assume a sequence "size_t offset, size_t length" looks more logical here. But
it's not critical at all.

>  {
>  	struct ceph_msg_data *data;
>  
> @@ -1892,7 +1892,7 @@ void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
>  	data->type = CEPH_MSG_DATA_PAGES;
>  	data->pages = pages;
>  	data->length = length;
> -	data->alignment = alignment & ~PAGE_MASK;
> +	data->offset = offset & ~PAGE_MASK;
>  	data->own_pages = own_pages;
>  
>  	msg->data_length += length;
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index b24afec24138..e359e70ad47e 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -130,13 +130,13 @@ static void ceph_osd_data_init(struct ceph_osd_data *osd_data)
>   * Consumes @pages if @own_pages is true.
>   */
>  static void ceph_osd_data_pages_init(struct ceph_osd_data *osd_data,
> -			struct page **pages, u64 length, u32 alignment,
> +			struct page **pages, u64 length, u32 offset,

And here too...

>  			bool pages_from_pool, bool own_pages)
>  {
>  	osd_data->type = CEPH_OSD_DATA_TYPE_PAGES;
>  	osd_data->pages = pages;
>  	osd_data->length = length;
> -	osd_data->alignment = alignment;
> +	osd_data->offset = offset;
>  	osd_data->pages_from_pool = pages_from_pool;
>  	osd_data->own_pages = own_pages;
>  }
> @@ -196,26 +196,26 @@ EXPORT_SYMBOL(osd_req_op_extent_osd_data);
>  
>  void osd_req_op_raw_data_in_pages(struct ceph_osd_request *osd_req,
>  			unsigned int which, struct page **pages,
> -			u64 length, u32 alignment,
> +			u64 length, u32 offset,

Interesting... We have length of 64 bits but offset is only 32 bits. I assume
that length is in bytes, but offset is in pages. But still this difference in
types looks slightly strange.

>  			bool pages_from_pool, bool own_pages)
>  {
>  	struct ceph_osd_data *osd_data;
>  
>  	osd_data = osd_req_op_raw_data_in(osd_req, which);
> -	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
> +	ceph_osd_data_pages_init(osd_data, pages, length, offset,
>  				pages_from_pool, own_pages);
>  }
>  EXPORT_SYMBOL(osd_req_op_raw_data_in_pages);
>  
>  void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *osd_req,
>  			unsigned int which, struct page **pages,
> -			u64 length, u32 alignment,
> +			u64 length, u32 offset,

The same strange thing here...

>  			bool pages_from_pool, bool own_pages)
>  {
>  	struct ceph_osd_data *osd_data;
>  
>  	osd_data = osd_req_op_data(osd_req, which, extent, osd_data);
> -	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
> +	ceph_osd_data_pages_init(osd_data, pages, length, offset,
>  				pages_from_pool, own_pages);
>  }
>  EXPORT_SYMBOL(osd_req_op_extent_osd_data_pages);
> @@ -312,12 +312,12 @@ EXPORT_SYMBOL(osd_req_op_cls_request_data_pagelist);
>  
>  void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req,
>  			unsigned int which, struct page **pages, u64 length,
> -			u32 alignment, bool pages_from_pool, bool own_pages)
> +			u32 offset, bool pages_from_pool, bool own_pages)
>  {
>  	struct ceph_osd_data *osd_data;
>  
>  	osd_data = osd_req_op_data(osd_req, which, cls, request_data);
> -	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
> +	ceph_osd_data_pages_init(osd_data, pages, length, offset,
>  				pages_from_pool, own_pages);
>  	osd_req->r_ops[which].cls.indata_len += length;
>  	osd_req->r_ops[which].indata_len += length;
> @@ -344,12 +344,12 @@ EXPORT_SYMBOL(osd_req_op_cls_request_data_bvecs);
>  
>  void osd_req_op_cls_response_data_pages(struct ceph_osd_request *osd_req,
>  			unsigned int which, struct page **pages, u64 length,
> -			u32 alignment, bool pages_from_pool, bool own_pages)
> +			u32 offset, bool pages_from_pool, bool own_pages)
>  {
>  	struct ceph_osd_data *osd_data;
>  
>  	osd_data = osd_req_op_data(osd_req, which, cls, response_data);
> -	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
> +	ceph_osd_data_pages_init(osd_data, pages, length, offset,
>  				pages_from_pool, own_pages);
>  }
>  EXPORT_SYMBOL(osd_req_op_cls_response_data_pages);
> @@ -382,7 +382,7 @@ static void ceph_osd_data_release(struct ceph_osd_data *osd_data)
>  	if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES && osd_data->own_pages) {
>  		int num_pages;
>  
> -		num_pages = calc_pages_for((u64)osd_data->alignment,
> +		num_pages = calc_pages_for((u64)osd_data->offset,
>  						(u64)osd_data->length);

As far as I can see, length is already u64, but offset is u32. Why do we not
have u64 for both fields? Then we don't need in (u64)osd_data->length/offset
here.

>  		ceph_release_page_vector(osd_data->pages, num_pages);
>  	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) {
> @@ -969,7 +969,7 @@ static void ceph_osdc_msg_data_add(struct ceph_msg *msg,
>  		BUG_ON(length > (u64) SIZE_MAX);
>  		if (length)
>  			ceph_msg_data_add_pages(msg, osd_data->pages,
> -					length, osd_data->alignment, false);
> +					length, osd_data->offset, false);
>  	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) {
>  		BUG_ON(!length);
>  		ceph_msg_data_add_pagelist(msg, osd_data->pagelist);
> 
> 

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 02/35] libceph: Rename alignment to offset
  2025-03-13 23:32 ` [RFC PATCH 02/35] libceph: Rename alignment to offset David Howells
  2025-03-14 19:04   ` Viacheslav Dubeyko
@ 2025-03-14 20:01   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-14 20:01 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> >  		struct {
> >  			struct page	**pages;
> 
> Do we still operate by pages here? It looks like we need to rework it somehow.

One of the points of these patches is to rework this, working towards reducing
everything to just an iterator where possible, using a segmented list as the
actual buffers.

One of the things hopefully to be discussed at LSF/MM is how we might combine
struct folio_queue, struct bvec[] and struct scatterlist into something that
can hold references to more general pieces of memory and not just folios - and
that might be something we can use here for handing buffers about.

Anyway, my aim is to get all references to pages and folios (as far as
possible) out of 9p, afs, cifs and ceph - delegating all of that to netfslib
for ceph (rbd is slightly different - but I've completed the transformation
there).

Netfslib will pass an iterator to each subrequest describing the buffer, and
we might need to go from there to another iterator describing a bounce buffer
for transport encryption, but from there, we should pass the iterator directly
to the socket.

Further, I would like to make it so that we can link these buffers together
such that we can fabricate an entire message within a single iterator - and
then we no longer need to cork the TCP socket.

David

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 03/35] libceph: Add a new data container type, ceph_databuf
  2025-03-13 23:32 ` [RFC PATCH 03/35] libceph: Add a new data container type, ceph_databuf David Howells
@ 2025-03-14 20:06   ` Viacheslav Dubeyko
  2025-03-17 11:27   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-14 20:06 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:32 +0000, David Howells wrote:
> Add a new ceph data container type, ceph_databuf, that can carry a list of
> pages in a bvec and use an iov_iter to handle describe the data to the next
> layer down.  The iterator can also be used to refer to other types, such as
> ITER_FOLIOQ.
> 
> There are two ways of loading the bvec.  One way is to allocate a buffer
> with space in it and then add data, expanding the space as needed; the
> other is to splice in pages, expanding the bvec[] as needed.
> 
> This is intended to replace all other types.
> 

We definitely need to think about unit-tests or self-tests here.

> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  include/linux/ceph/databuf.h    | 131 +++++++++++++++++++++
>  include/linux/ceph/messenger.h  |   6 +-
>  include/linux/ceph/osd_client.h |   3 +
>  net/ceph/Makefile               |   3 +-
>  net/ceph/databuf.c              | 200 ++++++++++++++++++++++++++++++++
>  net/ceph/messenger.c            |  20 +++-
>  net/ceph/osd_client.c           |  11 +-
>  7 files changed, 369 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/ceph/databuf.h
>  create mode 100644 net/ceph/databuf.c
> 
> diff --git a/include/linux/ceph/databuf.h b/include/linux/ceph/databuf.h
> new file mode 100644
> index 000000000000..14c7a6449467
> --- /dev/null
> +++ b/include/linux/ceph/databuf.h
> @@ -0,0 +1,131 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __FS_CEPH_DATABUF_H
> +#define __FS_CEPH_DATABUF_H
> +
> +#include <asm/byteorder.h>
> +#include <linux/refcount.h>
> +#include <linux/blk_types.h>
> +
> +struct ceph_databuf {
> +	struct bio_vec	*bvec;		/* List of pages */

So, maybe we need to think about folios now?

> +	struct bio_vec	inline_bvec[1];	/* Inline bvecs for small buffers */
> +	struct iov_iter	iter;		/* Iterator defining occupied data */
> +	size_t		limit;		/* Maximum length before expansion required */
> +	size_t		nr_bvec;	/* Number of bvec[] that have pages */

Folios? :)

> +	size_t		max_bvec;	/* Size of bvec[] */
> +	refcount_t	refcnt;
> +	bool		put_pages;	/* T if pages in bvec[] need to be put*/

Maybe folios? :)

> +};
> +
> +struct ceph_databuf *ceph_databuf_alloc(size_t min_bvec, size_t space,
> +					unsigned int data_source, gfp_t gfp);
> +struct ceph_databuf *ceph_databuf_get(struct ceph_databuf *dbuf);
> +void ceph_databuf_release(struct ceph_databuf *dbuf);
> +int ceph_databuf_append(struct ceph_databuf *dbuf, const void *d, size_t l);

I think that declaration is important and argument names needs to be clear
enough. Short name is good but it could be confusing. Why not len instead of l?
And I am still guessing what d means. :)

> +int ceph_databuf_reserve(struct ceph_databuf *dbuf, size_t space, gfp_t gfp);
> +int ceph_databuf_insert_frag(struct ceph_databuf *dbuf, unsigned int ix,
> +			     size_t len, gfp_t gfp);
> +
> +static inline
> +struct ceph_databuf *ceph_databuf_req_alloc(size_t min_bvec, size_t space, gfp_t gfp)
> +{
> +	return ceph_databuf_alloc(min_bvec, space, ITER_SOURCE, gfp);
> +}
> +
> +static inline
> +struct ceph_databuf *ceph_databuf_reply_alloc(size_t min_bvec, size_t space, gfp_t gfp)
> +{
> +	struct ceph_databuf *dbuf;
> +
> +	dbuf = ceph_databuf_alloc(min_bvec, space, ITER_DEST, gfp);
> +	if (dbuf)
> +		iov_iter_reexpand(&dbuf->iter, space);
> +	return dbuf;
> +}
> +
> +static inline struct page *ceph_databuf_page(struct ceph_databuf *dbuf,
> +					     unsigned int ix)
> +{
> +	return dbuf->bvec[ix].bv_page;
> +}
> +
> +#define kmap_ceph_databuf_page(dbuf, ix) \
> +	kmap_local_page(ceph_databuf_page(dbuf, ix));
> +

I am still thinking that we need to base the new code on folio.

> +static inline int ceph_databuf_encode_64(struct ceph_databuf *dbuf, u64 v)
> +{
> +	__le64 ev = cpu_to_le64(v);
> +	return ceph_databuf_append(dbuf, &ev, sizeof(ev));
> +}
> +static inline int ceph_databuf_encode_32(struct ceph_databuf *dbuf, u32 v)
> +{
> +	__le32 ev = cpu_to_le32(v);
> +	return ceph_databuf_append(dbuf, &ev, sizeof(ev));
> +}
> +static inline int ceph_databuf_encode_16(struct ceph_databuf *dbuf, u16 v)
> +{
> +	__le16 ev = cpu_to_le16(v);
> +	return ceph_databuf_append(dbuf, &ev, sizeof(ev));
> +}
> +static inline int ceph_databuf_encode_8(struct ceph_databuf *dbuf, u8 v)
> +{
> +	return ceph_databuf_append(dbuf, &v, 1);
> +}

Maybe, encode_8, encode_16, encode_32, encode_64? I mean reverse sequence here.

> +static inline int ceph_databuf_encode_string(struct ceph_databuf *dbuf,
> +					     const char *s, u32 len)
> +{
> +	int ret = ceph_databuf_encode_32(dbuf, len);
> +	if (ret)
> +		return ret;
> +	if (len)
> +		return ceph_databuf_append(dbuf, s, len);
> +	return 0;
> +}
> +
> +static inline size_t ceph_databuf_len(struct ceph_databuf *dbuf)
> +{
> +	return dbuf->iter.count;
> +}
> +
> +static inline void ceph_databuf_added_data(struct ceph_databuf *dbuf,
> +					   size_t len)
> +{
> +	dbuf->iter.count += len;
> +}
> +
> +static inline void ceph_databuf_reply_ready(struct ceph_databuf *reply,
> +					    size_t len)
> +{
> +	reply->iter.data_source = ITER_SOURCE;
> +	iov_iter_truncate(&reply->iter, len);
> +}
> +
> +static inline void ceph_databuf_reset_reply(struct ceph_databuf *reply)
> +{
> +	iov_iter_bvec(&reply->iter, ITER_DEST,
> +		      reply->bvec, reply->nr_bvec, reply->limit);
> +}
> +
> +static inline void ceph_databuf_append_page(struct ceph_databuf *dbuf,
> +					    struct page *page,
> +					    unsigned int offset,
> +					    unsigned int len)
> +{
> +	BUG_ON(dbuf->nr_bvec >= dbuf->max_bvec);
> +	bvec_set_page(&dbuf->bvec[dbuf->nr_bvec++], page, len, offset);
> +	dbuf->iter.count += len;
> +	dbuf->iter.nr_segs++;

Why do we assign len to dbuf->iter.count but only increment dbuf->iter.nr_segs?

> +}
> +
> +static inline void *ceph_databuf_enc_start(struct ceph_databuf *dbuf)
> +{
> +	return page_address(ceph_databuf_page(dbuf, 0)) + dbuf->iter.count;
> +}
> +
> +static inline void ceph_databuf_enc_stop(struct ceph_databuf *dbuf, void *p)
> +{
> +	dbuf->iter.count = p - page_address(ceph_databuf_page(dbuf, 0));
> +	BUG_ON(dbuf->iter.count > dbuf->limit);
> +}

The same about page...

> +
> +#endif /* __FS_CEPH_DATABUF_H */
> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> index db2aba32b8a0..864aad369c91 100644
> --- a/include/linux/ceph/messenger.h
> +++ b/include/linux/ceph/messenger.h
> @@ -117,6 +117,7 @@ struct ceph_messenger {
>  
>  enum ceph_msg_data_type {
>  	CEPH_MSG_DATA_NONE,	/* message contains no data payload */
> +	CEPH_MSG_DATA_DATABUF,	/* data source/destination is a data buffer */
>  	CEPH_MSG_DATA_PAGES,	/* data source/destination is a page array */
>  	CEPH_MSG_DATA_PAGELIST,	/* data source/destination is a pagelist */

So, the final replacement on databuf will be in the future?

>  #ifdef CONFIG_BLOCK
> @@ -210,7 +211,10 @@ struct ceph_bvec_iter {
>  
>  struct ceph_msg_data {
>  	enum ceph_msg_data_type		type;
> +	struct iov_iter			iter;
> +	bool				release_dbuf;
>  	union {
> +		struct ceph_databuf	*dbuf;
>  #ifdef CONFIG_BLOCK
>  		struct {
>  			struct ceph_bio_iter	bio_pos;
> @@ -225,7 +229,6 @@ struct ceph_msg_data {
>  			bool		own_pages;
>  		};
>  		struct ceph_pagelist	*pagelist;
> -		struct iov_iter		iter;
>  	};
>  };
>  
> @@ -601,6 +604,7 @@ extern void ceph_con_keepalive(struct ceph_connection *con);
>  extern bool ceph_con_keepalive_expired(struct ceph_connection *con,
>  				       unsigned long interval);
>  
> +void ceph_msg_data_add_databuf(struct ceph_msg *msg, struct ceph_databuf *dbuf);
>  void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
>  			     size_t length, size_t offset, bool own_pages);
>  extern void ceph_msg_data_add_pagelist(struct ceph_msg *msg,
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 8fc84f389aad..b8fb5a71dd57 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -16,6 +16,7 @@
>  #include <linux/ceph/msgpool.h>
>  #include <linux/ceph/auth.h>
>  #include <linux/ceph/pagelist.h>
> +#include <linux/ceph/databuf.h>
>  
>  struct ceph_msg;
>  struct ceph_snap_context;
> @@ -103,6 +104,7 @@ struct ceph_osd {
>  
>  enum ceph_osd_data_type {
>  	CEPH_OSD_DATA_TYPE_NONE = 0,
> +	CEPH_OSD_DATA_TYPE_DATABUF,
>  	CEPH_OSD_DATA_TYPE_PAGES,
>  	CEPH_OSD_DATA_TYPE_PAGELIST,

The same question about replacement on databuf here? Is it future work?

>  #ifdef CONFIG_BLOCK
> @@ -115,6 +117,7 @@ enum ceph_osd_data_type {
>  struct ceph_osd_data {
>  	enum ceph_osd_data_type	type;
>  	union {
> +		struct ceph_databuf	*dbuf;
>  		struct {
>  			struct page	**pages;
>  			u64		length;
> diff --git a/net/ceph/Makefile b/net/ceph/Makefile
> index 8802a0c0155d..4b2e0b654e45 100644
> --- a/net/ceph/Makefile
> +++ b/net/ceph/Makefile
> @@ -15,4 +15,5 @@ libceph-y := ceph_common.o messenger.o msgpool.o buffer.o pagelist.o \
>  	auth_x.o \
>  	ceph_strings.o ceph_hash.o \
>  	pagevec.o snapshot.o string_table.o \
> -	messenger_v1.o messenger_v2.o
> +	messenger_v1.o messenger_v2.o \
> +	databuf.o
> diff --git a/net/ceph/databuf.c b/net/ceph/databuf.c
> new file mode 100644
> index 000000000000..9d108fff5a4f
> --- /dev/null
> +++ b/net/ceph/databuf.c
> @@ -0,0 +1,200 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Data container
> + *
> + * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +
> +#include <linux/export.h>
> +#include <linux/gfp.h>
> +#include <linux/slab.h>
> +#include <linux/uio.h>
> +#include <linux/pagemap.h>
> +#include <linux/highmem.h>
> +#include <linux/ceph/databuf.h>
> +
> +struct ceph_databuf *ceph_databuf_alloc(size_t min_bvec, size_t space,
> +					unsigned int data_source, gfp_t gfp)
> +{
> +	struct ceph_databuf *dbuf;
> +	size_t inl = ARRAY_SIZE(dbuf->inline_bvec);
> +
> +	dbuf = kzalloc(sizeof(*dbuf), gfp);
> +	if (!dbuf)
> +		return NULL;

I am guessing... Should we return error code here?

> +
> +	refcount_set(&dbuf->refcnt, 1);
> +
> +	if (min_bvec == 0 && space == 0) {
> +		/* Do nothing */
> +	} else if (min_bvec <= inl && space <= inl * PAGE_SIZE) {
> +		dbuf->bvec = dbuf->inline_bvec;
> +		dbuf->max_bvec = inl;
> +		dbuf->limit = space;
> +	} else if (min_bvec) {
> +		min_bvec = umax(min_bvec, 16);

Why 16 here? Maybe, do we need to introduce some well explained constant?

> +
> +		dbuf->bvec = kcalloc(min_bvec, sizeof(struct bio_vec), gfp);
> +		if (!dbuf->bvec) {
> +			kfree(dbuf);
> +			return NULL;

Ditto. Should we return error code here?

> +		}
> +
> +		dbuf->max_bvec = min_bvec;

Why do we assign min_bvec to max_bvec? I am simply slightly confused why
argument of function is named as min_bvec, but finally we are saving min_bvec
value into max_bvec.

> +	}
> +
> +	iov_iter_bvec(&dbuf->iter, data_source, dbuf->bvec, 0, 0);
> +
> +	if (space) {
> +		if (ceph_databuf_reserve(dbuf, space, gfp) < 0) {
> +			ceph_databuf_release(dbuf);
> +			return NULL;

Ditto. Should we return error code here?

> +		}
> +	}
> +	return dbuf;
> +}
> +EXPORT_SYMBOL(ceph_databuf_alloc);
> +
> +struct ceph_databuf *ceph_databuf_get(struct ceph_databuf *dbuf)

I see the point here. But do we really need to return pointer? Why not simply:

void ceph_databuf_get(struct ceph_databuf *dbuf)

> +{
> +	if (!dbuf)
> +		return NULL;
> +	refcount_inc(&dbuf->refcnt);
> +	return dbuf;
> +}
> +EXPORT_SYMBOL(ceph_databuf_get);
> +
> +void ceph_databuf_release(struct ceph_databuf *dbuf)
> +{
> +	size_t i;
> +
> +	if (!dbuf || !refcount_dec_and_test(&dbuf->refcnt))
> +		return;
> +
> +	if (dbuf->put_pages)
> +		for (i = 0; i < dbuf->nr_bvec; i++)
> +			put_page(dbuf->bvec[i].bv_page);
> +	if (dbuf->bvec != dbuf->inline_bvec)
> +		kfree(dbuf->bvec);
> +	kfree(dbuf);
> +}
> +EXPORT_SYMBOL(ceph_databuf_release);
> +
> +/*
> + * Expand the bvec[] in the dbuf.
> + */
> +static int ceph_databuf_expand(struct ceph_databuf *dbuf, size_t req_bvec,
> +			       gfp_t gfp)
> +{
> +	struct bio_vec *bvec = dbuf->bvec, *old = bvec;

I think that assigning (*old = bvec) looks confusing if we keep it on the same
line as bvec declaration and initialization. Why do not declare and not
initialize it on the next line?

> +	size_t size, max_bvec, off = dbuf->iter.bvec - old;

I think it's too much declarations on the same line. Why not:

size_t size, max_bvec;
size_t off = dbuf->iter.bvec - old;

> +	size_t inl = ARRAY_SIZE(dbuf->inline_bvec);
> +
> +	if (req_bvec <= inl) {
> +		dbuf->bvec = dbuf->inline_bvec;
> +		dbuf->max_bvec = inl;
> +		dbuf->iter.bvec = dbuf->inline_bvec + off;
> +		return 0;
> +	}
> +
> +	max_bvec = roundup_pow_of_two(req_bvec);
> +	size = array_size(max_bvec, sizeof(struct bio_vec));
> +
> +	if (old == dbuf->inline_bvec) {
> +		bvec = kmalloc_array(max_bvec, sizeof(struct bio_vec), gfp);
> +		if (!bvec)
> +			return -ENOMEM;
> +		memcpy(bvec, old, inl);
> +	} else {
> +		bvec = krealloc(old, size, gfp);
> +		if (!bvec)
> +			return -ENOMEM;
> +	}
> +	dbuf->bvec = bvec;
> +	dbuf->max_bvec = max_bvec;
> +	dbuf->iter.bvec = bvec + off;
> +	return 0;
> +}
> +
> +/* Allocate enough pages for a dbuf to append the given amount
> + * of dbuf without allocating.
> + * Returns: 0 on success, -ENOMEM on error.
> + */
> +int ceph_databuf_reserve(struct ceph_databuf *dbuf, size_t add_space,
> +			 gfp_t gfp)
> +{
> +	struct bio_vec *bvec;
> +	size_t i, req_bvec = DIV_ROUND_UP(dbuf->iter.count + add_space, PAGE_SIZE);

Why not:

size_t req_bvec = DIV_ROUND_UP(dbuf->iter.count + add_space, PAGE_SIZE);
size_t i;


> +	int ret;
> +
> +	dbuf->put_pages = true;
> +	if (req_bvec > dbuf->max_bvec) {
> +		ret = ceph_databuf_expand(dbuf, req_bvec, gfp);
> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	bvec = dbuf->bvec;
> +	while (dbuf->nr_bvec < req_bvec) {
> +		struct page *pages[16];

Why do we hardcoded 16 here but using some well defined constant?

And, again, why not folio?

> +		size_t want = min(req_bvec, ARRAY_SIZE(pages)), got;
> +
> +		memset(pages, 0, sizeof(pages));
> +		got = alloc_pages_bulk(gfp, want, pages);
> +		if (!got)
> +			return -ENOMEM;
> +		for (i = 0; i < got; i++)

Why do we use size_t for i and got? Why not int, for example?

> +			bvec_set_page(&bvec[dbuf->nr_bvec + i], pages[i],
> +				      PAGE_SIZE, 0);
> +		dbuf->iter.nr_segs += got;
> +		dbuf->nr_bvec += got;

If I understood correctly, the ceph_databuf_append_page() uses slightly
different logic.

+	dbuf->iter.count += len;
+	dbuf->iter.nr_segs++;

But here we assign number of allocated pages to nr_segs. It is slightly
confusing. I think I am missing something here.

> +		dbuf->limit = dbuf->nr_bvec * PAGE_SIZE;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(ceph_databuf_reserve);
> +
> +int ceph_databuf_append(struct ceph_databuf *dbuf, const void *buf, size_t len)
> +{
> +	struct iov_iter temp_iter;
> +
> +	if (!len)
> +		return 0;
> +	if (dbuf->limit - dbuf->iter.count > len &&
> +	    ceph_databuf_reserve(dbuf, len, GFP_NOIO) < 0)
> +		return -ENOMEM;
> +
> +	iov_iter_bvec(&temp_iter, ITER_DEST,
> +		      dbuf->bvec, dbuf->nr_bvec, dbuf->limit);
> +	iov_iter_advance(&temp_iter, dbuf->iter.count);
> +
> +	if (copy_to_iter(buf, len, &temp_iter) != len)
> +		return -EFAULT;
> +	dbuf->iter.count += len;
> +	return 0;
> +}
> +EXPORT_SYMBOL(ceph_databuf_append);
> +
> +/*
> + * Allocate a fragment and insert it into the buffer at the specified index.
> + */
> +int ceph_databuf_insert_frag(struct ceph_databuf *dbuf, unsigned int ix,
> +			     size_t len, gfp_t gfp)
> +{
> +	struct page *page;
> +

Why not folio?

> +	page = alloc_page(gfp);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	bvec_set_page(&dbuf->bvec[ix], page, len, 0);
> +
> +	if (dbuf->nr_bvec == ix) {
> +		dbuf->iter.nr_segs = ix + 1;
> +		dbuf->nr_bvec = ix + 1;
> +		dbuf->iter.count += len;
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(ceph_databuf_insert_frag);
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index 1df4291cc80b..802f0b222131 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1872,7 +1872,9 @@ static struct ceph_msg_data *ceph_msg_data_add(struct ceph_msg *msg)
>  
>  static void ceph_msg_data_destroy(struct ceph_msg_data *data)
>  {
> -	if (data->type == CEPH_MSG_DATA_PAGES && data->own_pages) {
> +	if (data->type == CEPH_MSG_DATA_DATABUF) {
> +		ceph_databuf_release(data->dbuf);
> +	} else if (data->type == CEPH_MSG_DATA_PAGES && data->own_pages) {
>  		int num_pages = calc_pages_for(data->offset, data->length);
>  		ceph_release_page_vector(data->pages, num_pages);
>  	} else if (data->type == CEPH_MSG_DATA_PAGELIST) {
> @@ -1880,6 +1882,22 @@ static void ceph_msg_data_destroy(struct ceph_msg_data *data)
>  	}
>  }
>  
> +void ceph_msg_data_add_databuf(struct ceph_msg *msg, struct ceph_databuf *dbuf)
> +{
> +	struct ceph_msg_data *data;
> +
> +	BUG_ON(!dbuf);
> +	BUG_ON(!ceph_databuf_len(dbuf));
> +
> +	data = ceph_msg_data_add(msg);
> +	data->type = CEPH_MSG_DATA_DATABUF;
> +	data->dbuf = ceph_databuf_get(dbuf);
> +	data->iter = dbuf->iter;
> +
> +	msg->data_length += ceph_databuf_len(dbuf);
> +}
> +EXPORT_SYMBOL(ceph_msg_data_add_databuf);
> +
>  void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
>  			     size_t length, size_t offset, bool own_pages)
>  {
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index e359e70ad47e..c84634264377 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -359,6 +359,8 @@ static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data)
>  	switch (osd_data->type) {
>  	case CEPH_OSD_DATA_TYPE_NONE:
>  		return 0;
> +	case CEPH_OSD_DATA_TYPE_DATABUF:
> +		return ceph_databuf_len(osd_data->dbuf);
>  	case CEPH_OSD_DATA_TYPE_PAGES:
>  		return osd_data->length;
>  	case CEPH_OSD_DATA_TYPE_PAGELIST:
> @@ -379,7 +381,9 @@ static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data)
>  
>  static void ceph_osd_data_release(struct ceph_osd_data *osd_data)
>  {
> -	if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES && osd_data->own_pages) {
> +	if (osd_data->type == CEPH_OSD_DATA_TYPE_DATABUF) {
> +		ceph_databuf_release(osd_data->dbuf);
> +	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES && osd_data->own_pages) {
>  		int num_pages;
>  
>  		num_pages = calc_pages_for((u64)osd_data->offset,
> @@ -965,7 +969,10 @@ static void ceph_osdc_msg_data_add(struct ceph_msg *msg,
>  {
>  	u64 length = ceph_osd_data_length(osd_data);
>  
> -	if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) {
> +	if (osd_data->type == CEPH_OSD_DATA_TYPE_DATABUF) {
> +		BUG_ON(!length);
> +		ceph_msg_data_add_databuf(msg, osd_data->dbuf);
> +	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) {
>  		BUG_ON(length > (u64) SIZE_MAX);
>  		if (length)
>  			ceph_msg_data_add_pages(msg, osd_data->pages,
> 
> 

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 04/35] ceph: Convert ceph_mds_request::r_pagelist to a databuf
  2025-03-13 23:32 ` [RFC PATCH 04/35] ceph: Convert ceph_mds_request::r_pagelist to a databuf David Howells
@ 2025-03-14 22:27   ` slava
  2025-03-17 11:52   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: slava @ 2025-03-14 22:27 UTC (permalink / raw)
  To: David Howells, Alex Markuze
  Cc: Ilya Dryomov, Jeff Layton, Dongsheng Yang, ceph-devel,
	linux-fsdevel, linux-block, linux-kernel, Slava.Dubeyko

On Thu, 2025-03-13 at 23:32 +0000, David Howells wrote:
> Convert ceph_mds_request::r_pagelist to a databuf, along with the
> stuff
> that uses it such as setxattr ops.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/ceph/acl.c        | 39 ++++++++++----------
>  fs/ceph/file.c       | 12 ++++---
>  fs/ceph/inode.c      | 85 +++++++++++++++++++-----------------------
> --
>  fs/ceph/mds_client.c | 11 +++---
>  fs/ceph/mds_client.h |  2 +-
>  fs/ceph/super.h      |  2 +-
>  fs/ceph/xattr.c      | 68 +++++++++++++++--------------------
>  7 files changed, 96 insertions(+), 123 deletions(-)
> 
> diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c
> index 1564eacc253d..d6da650db83e 100644
> --- a/fs/ceph/acl.c
> +++ b/fs/ceph/acl.c
> @@ -171,7 +171,7 @@ int ceph_pre_init_acls(struct inode *dir, umode_t
> *mode,
>  {
>  	struct posix_acl *acl, *default_acl;
>  	size_t val_size1 = 0, val_size2 = 0;
> -	struct ceph_pagelist *pagelist = NULL;
> +	struct ceph_databuf *dbuf = NULL;
>  	void *tmp_buf = NULL;
>  	int err;
>  
> @@ -201,58 +201,55 @@ int ceph_pre_init_acls(struct inode *dir,
> umode_t *mode,
>  	tmp_buf = kmalloc(max(val_size1, val_size2), GFP_KERNEL);
>  	if (!tmp_buf)
>  		goto out_err;
> -	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
> -	if (!pagelist)
> +	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_KERNEL);
> +	if (!dbuf)
>  		goto out_err;
>  
> -	err = ceph_pagelist_reserve(pagelist, PAGE_SIZE);
> -	if (err)
> -		goto out_err;
> -
> -	ceph_pagelist_encode_32(pagelist, acl && default_acl ? 2 :
> 1);
> +	ceph_databuf_encode_32(dbuf, acl && default_acl ? 2 : 1);
>  
>  	if (acl) {
>  		size_t len = strlen(XATTR_NAME_POSIX_ACL_ACCESS);
> -		err = ceph_pagelist_reserve(pagelist, len +
> val_size1 + 8);
> +		err = ceph_databuf_reserve(dbuf, len + val_size1 +
> 8,
> +					   GFP_KERNEL);

I know that it's simple change. But this len + val_size1 + 8 looks
confusing, anyway. What this hardcoded 8 means? :)


>  		if (err)
>  			goto out_err;
> -		ceph_pagelist_encode_string(pagelist,
> XATTR_NAME_POSIX_ACL_ACCESS,
> -					    len);
> +		ceph_databuf_encode_string(dbuf,
> XATTR_NAME_POSIX_ACL_ACCESS,
> +					   len);
>  		err = posix_acl_to_xattr(&init_user_ns, acl,
>  					 tmp_buf, val_size1);
>  		if (err < 0)
>  			goto out_err;
> -		ceph_pagelist_encode_32(pagelist, val_size1);
> -		ceph_pagelist_append(pagelist, tmp_buf, val_size1);
> +		ceph_databuf_encode_32(dbuf, val_size1);
> +		ceph_databuf_append(dbuf, tmp_buf, val_size1);
>  	}
>  	if (default_acl) {
>  		size_t len = strlen(XATTR_NAME_POSIX_ACL_DEFAULT);
> -		err = ceph_pagelist_reserve(pagelist, len +
> val_size2 + 8);
> +		err = ceph_databuf_reserve(dbuf, len + val_size2 +
> 8,
> +					   GFP_KERNEL);

Same question here. :) What this hardcoded 8 means? :)

>  		if (err)
>  			goto out_err;
> -		ceph_pagelist_encode_string(pagelist,
> -					 
> XATTR_NAME_POSIX_ACL_DEFAULT, len);
> +		ceph_databuf_encode_string(dbuf,
> +					  
> XATTR_NAME_POSIX_ACL_DEFAULT, len);
>  		err = posix_acl_to_xattr(&init_user_ns, default_acl,
>  					 tmp_buf, val_size2);
>  		if (err < 0)
>  			goto out_err;
> -		ceph_pagelist_encode_32(pagelist, val_size2);
> -		ceph_pagelist_append(pagelist, tmp_buf, val_size2);
> +		ceph_databuf_encode_32(dbuf, val_size2);
> +		ceph_databuf_append(dbuf, tmp_buf, val_size2);
>  	}
>  
>  	kfree(tmp_buf);
>  
>  	as_ctx->acl = acl;
>  	as_ctx->default_acl = default_acl;
> -	as_ctx->pagelist = pagelist;
> +	as_ctx->dbuf = dbuf;
>  	return 0;
>  
>  out_err:
>  	posix_acl_release(acl);
>  	posix_acl_release(default_acl);
>  	kfree(tmp_buf);
> -	if (pagelist)
> -		ceph_pagelist_release(pagelist);
> +	ceph_databuf_release(dbuf);
>  	return err;
>  }
>  
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 851d70200c6b..9de2960748b9 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -679,9 +679,9 @@ static int ceph_finish_async_create(struct inode
> *dir, struct inode *inode,
>  	iinfo.change_attr = 1;
>  	ceph_encode_timespec64(&iinfo.btime, &now);
>  
> -	if (req->r_pagelist) {
> -		iinfo.xattr_len = req->r_pagelist->length;
> -		iinfo.xattr_data = req->r_pagelist->mapped_tail;
> +	if (req->r_dbuf) {
> +		iinfo.xattr_len = ceph_databuf_len(req->r_dbuf);
> +		iinfo.xattr_data = kmap_ceph_databuf_page(req-
> >r_dbuf, 0);

Possibly, it's in another patch. Have we removed req->r_pagelist from
the structure?

Do we always have memory pages in ceph_databuf? How
kmap_ceph_databuf_page() will behave if it's not memory page.

>  	} else {
>  		/* fake it */
>  		iinfo.xattr_len = ARRAY_SIZE(xattr_buf);
> @@ -731,6 +731,8 @@ static int ceph_finish_async_create(struct inode
> *dir, struct inode *inode,
>  	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req-
> >r_session,
>  			      req->r_fmode, NULL);
>  	up_read(&mdsc->snap_rwsem);
> +	if (req->r_dbuf)
> +		kunmap_local(iinfo.xattr_data);

Maybe, we need to hide kunmap_local() into something like
kunmap_ceph_databuf_page()?

>  	if (ret) {
>  		doutc(cl, "failed to fill inode: %d\n", ret);
>  		ceph_dir_clear_complete(dir);
> @@ -849,8 +851,8 @@ int ceph_atomic_open(struct inode *dir, struct
> dentry *dentry,
>  			goto out_ctx;
>  		}
>  		/* Async create can't handle more than a page of
> xattrs */
> -		if (as_ctx.pagelist &&
> -		    !list_is_singular(&as_ctx.pagelist->head))
> +		if (as_ctx.dbuf &&
> +		    as_ctx.dbuf->nr_bvec > 1)

Maybe, it makes sense to call something like ceph_databuf_length()
instead of low level access to dbuf->nr_bvec?

>  			try_async = false;
>  	} else if (!d_in_lookup(dentry)) {
>  		/* If it's not being looked up, it's negative */
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index b060f765ad20..ec9b80fec7be 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -112,9 +112,9 @@ struct inode *ceph_new_inode(struct inode *dir,
> struct dentry *dentry,
>  void ceph_as_ctx_to_req(struct ceph_mds_request *req,
>  			struct ceph_acl_sec_ctx *as_ctx)
>  {
> -	if (as_ctx->pagelist) {
> -		req->r_pagelist = as_ctx->pagelist;
> -		as_ctx->pagelist = NULL;
> +	if (as_ctx->dbuf) {
> +		req->r_dbuf = as_ctx->dbuf;
> +		as_ctx->dbuf = NULL;

Maybe, we need something like swap() method? :)

>  	}
>  	ceph_fscrypt_as_ctx_to_req(req, as_ctx);
>  }
> @@ -2341,11 +2341,10 @@ static int fill_fscrypt_truncate(struct inode
> *inode,
>  	loff_t pos, orig_pos = round_down(attr->ia_size,
>  					  CEPH_FSCRYPT_BLOCK_SIZE);
>  	u64 block = orig_pos >> CEPH_FSCRYPT_BLOCK_SHIFT;
> -	struct ceph_pagelist *pagelist = NULL;
> -	struct kvec iov = {0};
> +	struct ceph_databuf *dbuf = NULL;
>  	struct iov_iter iter;
> -	struct page *page = NULL;
> -	struct ceph_fscrypt_truncate_size_header header;
> +	struct ceph_fscrypt_truncate_size_header *header;
> +	void *p;
>  	int retry_op = 0;
>  	int len = CEPH_FSCRYPT_BLOCK_SIZE;
>  	loff_t i_size = i_size_read(inode);
> @@ -2372,37 +2371,35 @@ static int fill_fscrypt_truncate(struct inode
> *inode,
>  			goto out;
>  	}
>  
> -	page = __page_cache_alloc(GFP_KERNEL);
> -	if (page == NULL) {
> -		ret = -ENOMEM;
> +	ret = -ENOMEM;
> +	dbuf = ceph_databuf_req_alloc(2, 0, GFP_KERNEL);

So, do we allocate 2 items of zero length here?

> +	if (!dbuf)
>  		goto out;
> -	}
>  
> -	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
> -	if (!pagelist) {
> -		ret = -ENOMEM;
> +	if (ceph_databuf_insert_frag(dbuf, 0, sizeof(*header),
> GFP_KERNEL) < 0)
> +		goto out;
> +	if (ceph_databuf_insert_frag(dbuf, 1, PAGE_SIZE, GFP_KERNEL)
> < 0)
>  		goto out;
> -	}
>  
> -	iov.iov_base = kmap_local_page(page);
> -	iov.iov_len = len;
> -	iov_iter_kvec(&iter, READ, &iov, 1, len);
> +	iov_iter_bvec(&iter, ITER_DEST, &dbuf->bvec[1], 1, len);

Is it correct &dbuf->bvec[1]? Why do we work with item #1? I think it
looks confusing.

>  
>  	pos = orig_pos;
>  	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op,
> &objver);
>  	if (ret < 0)
>  		goto out;
>  
> +	header = kmap_ceph_databuf_page(dbuf, 0);
> +
>  	/* Insert the header first */
> -	header.ver = 1;
> -	header.compat = 1;
> -	header.change_attr =
> cpu_to_le64(inode_peek_iversion_raw(inode));
> +	header->ver = 1;
> +	header->compat = 1;
> +	header->change_attr =
> cpu_to_le64(inode_peek_iversion_raw(inode));
>  
>  	/*
>  	 * Always set the block_size to CEPH_FSCRYPT_BLOCK_SIZE,
>  	 * because in MDS it may need this to do the truncate.
>  	 */
> -	header.block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE);
> +	header->block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE);
>  
>  	/*
>  	 * If we hit a hole here, we should just skip filling
> @@ -2417,51 +2414,41 @@ static int fill_fscrypt_truncate(struct inode
> *inode,
>  	if (!objver) {
>  		doutc(cl, "hit hole, ppos %lld < size %lld\n", pos,
> i_size);
>  
> -		header.data_len = cpu_to_le32(8 + 8 + 4);
> -		header.file_offset = 0;
> +		header->data_len = cpu_to_le32(8 + 8 + 4);

The same problem of understanding here for me. What this hardcoded 8 +
8 + 4 value means? :)

> +		header->file_offset = 0;
>  		ret = 0;
>  	} else {
> -		header.data_len = cpu_to_le32(8 + 8 + 4 +
> CEPH_FSCRYPT_BLOCK_SIZE);
> -		header.file_offset = cpu_to_le64(orig_pos);
> +		header->data_len = cpu_to_le32(8 + 8 + 4 +
> CEPH_FSCRYPT_BLOCK_SIZE);

Ditto.

> +		header->file_offset = cpu_to_le64(orig_pos);
>  
>  		doutc(cl, "encrypt block boff/bsize %d/%lu\n", boff,
>  		      CEPH_FSCRYPT_BLOCK_SIZE);
>  
>  		/* truncate and zero out the extra contents for the
> last block */
> -		memset(iov.iov_base + boff, 0, PAGE_SIZE - boff);
> +		p = kmap_ceph_databuf_page(dbuf, 1);

Maybe, we need to introduce some constants to address #0 and #1 pages?
Because, #0 it's header and I assume #1 is some content.

> +		memset(p + boff, 0, PAGE_SIZE - boff);
> +		kunmap_local(p);
>  
>  		/* encrypt the last block */
> -		ret = ceph_fscrypt_encrypt_block_inplace(inode,
> page,
> -						   
> CEPH_FSCRYPT_BLOCK_SIZE,
> -						    0, block,
> -						    GFP_KERNEL);
> +		ret = ceph_fscrypt_encrypt_block_inplace(
> +			inode, ceph_databuf_page(dbuf, 1),
> +			CEPH_FSCRYPT_BLOCK_SIZE, 0, block,
> GFP_KERNEL);
>  		if (ret)
>  			goto out;
>  	}
>  
> -	/* Insert the header */
> -	ret = ceph_pagelist_append(pagelist, &header,
> sizeof(header));
> -	if (ret)
> -		goto out;
> +	ceph_databuf_added_data(dbuf, sizeof(*header));
> +	if (header->block_size)
> +		ceph_databuf_added_data(dbuf,
> CEPH_FSCRYPT_BLOCK_SIZE);
>  
> -	if (header.block_size) {
> -		/* Append the last block contents to pagelist */
> -		ret = ceph_pagelist_append(pagelist, iov.iov_base,
> -					   CEPH_FSCRYPT_BLOCK_SIZE);
> -		if (ret)
> -			goto out;
> -	}
> -	req->r_pagelist = pagelist;
> +	req->r_dbuf = dbuf;
>  out:
>  	doutc(cl, "%p %llx.%llx size dropping cap refs on %s\n",
> inode,
>  	      ceph_vinop(inode), ceph_cap_string(got));
>  	ceph_put_cap_refs(ci, got);
> -	if (iov.iov_base)
> -		kunmap_local(iov.iov_base);
> -	if (page)
> -		__free_pages(page, 0);
> -	if (ret && pagelist)
> -		ceph_pagelist_release(pagelist);
> +	kunmap_local(header);
> +	if (ret)
> +		ceph_databuf_release(dbuf);
>  	return ret;
>  }
>  
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 230e0c3f341f..09661a34f287 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -1125,8 +1125,7 @@ void ceph_mdsc_release_request(struct kref
> *kref)
>  	put_cred(req->r_cred);
>  	if (req->r_mnt_idmap)
>  		mnt_idmap_put(req->r_mnt_idmap);
> -	if (req->r_pagelist)
> -		ceph_pagelist_release(req->r_pagelist);
> +	ceph_databuf_release(req->r_dbuf);
>  	kfree(req->r_fscrypt_auth);
>  	kfree(req->r_altname);
>  	put_request_session(req);
> @@ -3207,10 +3206,10 @@ static struct ceph_msg
> *create_request_message(struct ceph_mds_session *session,
>  	msg->front.iov_len = p - msg->front.iov_base;
>  	msg->hdr.front_len = cpu_to_le32(msg->front.iov_len);
>  
> -	if (req->r_pagelist) {
> -		struct ceph_pagelist *pagelist = req->r_pagelist;
> -		ceph_msg_data_add_pagelist(msg, pagelist);
> -		msg->hdr.data_len = cpu_to_le32(pagelist->length);
> +	if (req->r_dbuf) {
> +		struct ceph_databuf *dbuf = req->r_dbuf;
> +		ceph_msg_data_add_databuf(msg, dbuf);
> +		msg->hdr.data_len =
> cpu_to_le32(ceph_databuf_len(dbuf));
>  	} else {
>  		msg->hdr.data_len = 0;
>  	}
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 3e2a6fa7c19a..a7ee8da07ce7 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -333,7 +333,7 @@ struct ceph_mds_request {
>  	u32 r_direct_hash;      /* choose dir frag based on this
> dentry hash */
>  
>  	/* data payload is used for xattr ops */
> -	struct ceph_pagelist *r_pagelist;
> +	struct ceph_databuf *r_dbuf;
>  
>  	/* what caps shall we drop? */
>  	int r_inode_drop, r_inode_unless;
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index bb0db0cc8003..984a6d2a5378 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -1137,7 +1137,7 @@ struct ceph_acl_sec_ctx {
>  #ifdef CONFIG_FS_ENCRYPTION
>  	struct ceph_fscrypt_auth *fscrypt_auth;
>  #endif
> -	struct ceph_pagelist *pagelist;
> +	struct ceph_databuf *dbuf;
>  };
>  
>  #ifdef CONFIG_SECURITY
> diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
> index 537165db4519..b083cd3b3974 100644
> --- a/fs/ceph/xattr.c
> +++ b/fs/ceph/xattr.c
> @@ -1114,17 +1114,17 @@ static int ceph_sync_setxattr(struct inode
> *inode, const char *name,
>  	struct ceph_mds_request *req;
>  	struct ceph_mds_client *mdsc = fsc->mdsc;
>  	struct ceph_osd_client *osdc = &fsc->client->osdc;
> -	struct ceph_pagelist *pagelist = NULL;
> +	struct ceph_databuf *dbuf = NULL;
>  	int op = CEPH_MDS_OP_SETXATTR;
>  	int err;
>  
>  	if (size > 0) {
> -		/* copy value into pagelist */
> -		pagelist = ceph_pagelist_alloc(GFP_NOFS);
> -		if (!pagelist)
> +		/* copy value into dbuf */
> +		dbuf = ceph_databuf_req_alloc(1, size, GFP_NOFS);
> +		if (!dbuf)
>  			return -ENOMEM;
>  
> -		err = ceph_pagelist_append(pagelist, value, size);
> +		err = ceph_databuf_append(dbuf, value, size);
>  		if (err)
>  			goto out;
>  	} else if (!value) {
> @@ -1154,8 +1154,8 @@ static int ceph_sync_setxattr(struct inode
> *inode, const char *name,
>  		req->r_args.setxattr.flags = cpu_to_le32(flags);
>  		req->r_args.setxattr.osdmap_epoch =
>  			cpu_to_le32(osdc->osdmap->epoch);
> -		req->r_pagelist = pagelist;
> -		pagelist = NULL;
> +		req->r_dbuf = dbuf;
> +		dbuf = NULL;
>  	}
>  
>  	req->r_inode = inode;
> @@ -1169,8 +1169,7 @@ static int ceph_sync_setxattr(struct inode
> *inode, const char *name,
>  	doutc(cl, "xattr.ver (after): %lld\n", ci-
> >i_xattrs.version);
>  
>  out:
> -	if (pagelist)
> -		ceph_pagelist_release(pagelist);
> +	ceph_databuf_release(dbuf);
>  	return err;
>  }
>  
> @@ -1377,7 +1376,7 @@ bool ceph_security_xattr_deadlock(struct inode
> *in)
>  int ceph_security_init_secctx(struct dentry *dentry, umode_t mode,
>  			   struct ceph_acl_sec_ctx *as_ctx)
>  {
> -	struct ceph_pagelist *pagelist = as_ctx->pagelist;
> +	struct ceph_databuf *dbuf = as_ctx->dbuf;
>  	const char *name;
>  	size_t name_len;
>  	int err;
> @@ -1391,14 +1390,11 @@ int ceph_security_init_secctx(struct dentry
> *dentry, umode_t mode,
>  	}
>  
>  	err = -ENOMEM;
> -	if (!pagelist) {
> -		pagelist = ceph_pagelist_alloc(GFP_KERNEL);
> -		if (!pagelist)
> +	if (!dbuf) {
> +		dbuf = ceph_databuf_req_alloc(0, PAGE_SIZE,
> GFP_KERNEL);
> +		if (!dbuf)
>  			goto out;
> -		err = ceph_pagelist_reserve(pagelist, PAGE_SIZE);
> -		if (err)
> -			goto out;
> -		ceph_pagelist_encode_32(pagelist, 1);
> +		ceph_databuf_encode_32(dbuf, 1);
>  	}
>  
>  	/*
> @@ -1407,38 +1403,31 @@ int ceph_security_init_secctx(struct dentry
> *dentry, umode_t mode,
>  	 * dentry_init_security hook.
>  	 */
>  	name_len = strlen(name);
> -	err = ceph_pagelist_reserve(pagelist,
> -				    4 * 2 + name_len + as_ctx-
> >lsmctx.len);
> +	err = ceph_databuf_reserve(dbuf, 4 * 2 + name_len + as_ctx-
> >lsmctx.len,
> +				   GFP_KERNEL);

The 4 * 2 + name_len + as_ctx->lsmctx.len looks unclear to me. It wil
be good to have some well defined constants here.

>  	if (err)
>  		goto out;
>  
> -	if (as_ctx->pagelist) {
> +	if (as_ctx->dbuf) {
>  		/* update count of KV pairs */
> -		BUG_ON(pagelist->length <= sizeof(__le32));
> -		if (list_is_singular(&pagelist->head)) {
> -			le32_add_cpu((__le32*)pagelist->mapped_tail,
> 1);
> -		} else {
> -			struct page *page =
> list_first_entry(&pagelist->head,
> -							     struct
> page, lru);
> -			void *addr = kmap_atomic(page);
> -			le32_add_cpu((__le32*)addr, 1);
> -			kunmap_atomic(addr);
> -		}
> +		BUG_ON(ceph_databuf_len(dbuf) <= sizeof(__le32));
> +		__le32 *addr = kmap_ceph_databuf_page(dbuf, 0);
> +		le32_add_cpu(addr, 1);
> +		kunmap_local(addr);
>  	} else {
> -		as_ctx->pagelist = pagelist;
> +		as_ctx->dbuf = dbuf;
>  	}
>  
> -	ceph_pagelist_encode_32(pagelist, name_len);
> -	ceph_pagelist_append(pagelist, name, name_len);
> +	ceph_databuf_encode_32(dbuf, name_len);
> +	ceph_databuf_append(dbuf, name, name_len);
>  
> -	ceph_pagelist_encode_32(pagelist, as_ctx->lsmctx.len);
> -	ceph_pagelist_append(pagelist, as_ctx->lsmctx.context,
> -			     as_ctx->lsmctx.len);
> +	ceph_databuf_encode_32(dbuf, as_ctx->lsmctx.len);
> +	ceph_databuf_append(dbuf, as_ctx->lsmctx.context, as_ctx-
> >lsmctx.len);
>  
>  	err = 0;
>  out:
> -	if (pagelist && !as_ctx->pagelist)
> -		ceph_pagelist_release(pagelist);
> +	if (dbuf && !as_ctx->dbuf)
> +		ceph_databuf_release(dbuf);
>  	return err;
>  }
>  #endif /* CONFIG_CEPH_FS_SECURITY_LABEL */
> @@ -1456,8 +1445,7 @@ void ceph_release_acl_sec_ctx(struct
> ceph_acl_sec_ctx *as_ctx)
>  #ifdef CONFIG_FS_ENCRYPTION
>  	kfree(as_ctx->fscrypt_auth);
>  #endif
> -	if (as_ctx->pagelist)
> -		ceph_pagelist_release(as_ctx->pagelist);
> +	ceph_databuf_release(as_ctx->dbuf);
>  }
>  
>  /*
> 

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 03/35] libceph: Add a new data container type, ceph_databuf
  2025-03-13 23:32 ` [RFC PATCH 03/35] libceph: Add a new data container type, ceph_databuf David Howells
  2025-03-14 20:06   ` Viacheslav Dubeyko
@ 2025-03-17 11:27   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-17 11:27 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > +struct ceph_databuf {
> > +	struct bio_vec	*bvec;		/* List of pages */
> 
> So, maybe we need to think about folios now?

Yeah, I know...  but struct bio_vec has a page pointer and may point to
non-folio type pages.  This stuff is still undergoing evolution as Willy works
on reducing struct page.

What I'm pondering is changing struct folio_queue to take a list of { folio,
offset, len } rather than using a folio_batch with a simple list of folios.
It doesn't necessarily help with DIO, though, but there we're given an
iterator we're required to use.

One of the things I'd like to look at for ceph as well is using the page frag
allocator[*] to get small pieces of memory for stashing protocol data in
rather than allocating full-page buffers.

[*] Memory allocated from the page frag allocator can be used with
MSG_SPLICE_PAGES as its lifetime is controlled by the refcount.  Now, we could
probably have a page frag allocator that uses folios rather than non-folio
pages for network filesystem use.  That could be of use to afs and cifs also.

As I mentioned, in a previous reply, how to better integrate folioq/bvec is
hopefully up for discussion at LSF/MM next week.

> > +static inline void ceph_databuf_append_page(struct ceph_databuf *dbuf,
> > +					    struct page *page,
> > +					    unsigned int offset,
> > +					    unsigned int len)
> > +{
> > +	BUG_ON(dbuf->nr_bvec >= dbuf->max_bvec);
> > +	bvec_set_page(&dbuf->bvec[dbuf->nr_bvec++], page, len, offset);
> > +	dbuf->iter.count += len;
> > +	dbuf->iter.nr_segs++;
> 
> Why do we assign len to dbuf->iter.count but only increment
> dbuf->iter.nr_segs?

Um, because it doesn't?  It adds len to dbuf->iter.count.

> >  enum ceph_msg_data_type {
> >  	CEPH_MSG_DATA_NONE,	/* message contains no data payload */
> > +	CEPH_MSG_DATA_DATABUF,	/* data source/destination is a data buffer */
> >  	CEPH_MSG_DATA_PAGES,	/* data source/destination is a page array */
> >  	CEPH_MSG_DATA_PAGELIST,	/* data source/destination is a pagelist */
> 
> So, the final replacement on databuf will be in the future?

The result of each patch has to compile and work, right?  But yes, various of
the patches in this series reduce the use of those other data types.  I have
patches in progress to finally remove PAGES and PAGELIST, but they're not
quite compiling yet.

> > +	dbuf = kzalloc(sizeof(*dbuf), gfp);
> > +	if (!dbuf)
> > +		return NULL;
> 
> I am guessing... Should we return error code here?

The only error this function can return is ENOMEM, so it just returns NULL
like many other alloc functions.

> > +	} else if (min_bvec) {
> > +		min_bvec = umax(min_bvec, 16);
> 
> Why 16 here? Maybe, do we need to introduce some well explained constant?

Fair point.

> > +		dbuf->max_bvec = min_bvec;
> 
> Why do we assign min_bvec to max_bvec? I am simply slightly confused why
> argument of function is named as min_bvec, but finally we are saving min_bvec
> value into max_bvec.

The 'min_bvec' argument is the minimum number of bvecs that the caller needs
to be allocated.  This may get rounded up to include all of the piece of
memory we're going to be given by the slab.

'dbuf->max_bvec' is the maximum number of entries that can be used in
dbuf->bvec[] and is a property of the databuf object.

> > +struct ceph_databuf *ceph_databuf_get(struct ceph_databuf *dbuf)
> 
> I see the point here. But do we really need to return pointer? Why not simply:
> 
> void ceph_databuf_get(struct ceph_databuf *dbuf)

It means I can do:

	foo->databuf = ceph_databuf_get(dbuf);

rather than:

	ceph_databuf_get(dbuf);
	foo->databuf = dbuf;

> > +static int ceph_databuf_expand(struct ceph_databuf *dbuf, size_t req_bvec,
> > +			       gfp_t gfp)
> > +{
> > +	struct bio_vec *bvec = dbuf->bvec, *old = bvec;
> 
> I think that assigning (*old = bvec) looks confusing if we keep it on the same
> line as bvec declaration and initialization. Why do not declare and not
> initialize it on the next line?
> 
> > +	size_t size, max_bvec, off = dbuf->iter.bvec - old;
> 
> I think it's too much declarations on the same line. Why not:
> 
> size_t size, max_bvec;
> size_t off = dbuf->iter.bvec - old;

A matter of personal preference, I guess.

> > +	bvec = dbuf->bvec;
> > +	while (dbuf->nr_bvec < req_bvec) {
> > +		struct page *pages[16];
> 
> Why do we hardcoded 16 here but using some well defined constant?

Because this is only about stack usage.  alloc_pages_bulk() gets an straight
array of page*; we have a bvec[], so we need an intermediate.  Now, I could
actually just overlay the array over the tail of the bvec[] and do a single
bulk allocation since sizeof(struct page*) > sizeof(struct bio_vec).

> And, again, why not folio?

I don't think there's a bulk folio allocator.  Quite possibly there *should*
be so that readahead can use it - one that allocates different sizes of folios
to fill the space required.

> > +		size_t want = min(req_bvec, ARRAY_SIZE(pages)), got;
> > +
> > +		memset(pages, 0, sizeof(pages));
> > +		got = alloc_pages_bulk(gfp, want, pages);
> > +		if (!got)
> > +			return -ENOMEM;
> > +		for (i = 0; i < got; i++)
> 
> Why do we use size_t for i and got? Why not int, for example?

alloc_pages_bulk() doesn't return an int.  Now, one could legitimately argue
that I should use "unsigned long" rather than "size_t", but I wouldn't use int
here.  int is smaller and signed.  Granted, it's unlikely we'll be asked >2G
pages, but if we're going to assign it down to an int, it probably needs to be
checked first.

> > +			bvec_set_page(&bvec[dbuf->nr_bvec + i], pages[i],
> > +				      PAGE_SIZE, 0);
> > +		dbuf->iter.nr_segs += got;
> > +		dbuf->nr_bvec += got;
> 
> If I understood correctly, the ceph_databuf_append_page() uses slightly
> different logic.

Can you elaborate?

> +	dbuf->iter.count += len;
> +	dbuf->iter.nr_segs++;
> 
> But here we assign number of allocated pages to nr_segs. It is slightly
> confusing. I think I am missing something here.

Um - it's an incremement?

I think part of the problem might be that we're mixing two things within the
same container: Partial pages that get kmapped and accessed directly
(e.g. protocol bits) and pages that get accessed indirectly (e.g. data
buffers).  Maybe this needs to be made more explicit in the API.

David

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 04/35] ceph: Convert ceph_mds_request::r_pagelist to a databuf
  2025-03-13 23:32 ` [RFC PATCH 04/35] ceph: Convert ceph_mds_request::r_pagelist to a databuf David Howells
  2025-03-14 22:27   ` slava
@ 2025-03-17 11:52   ` David Howells
  2025-03-20 20:34     ` Viacheslav Dubeyko
  2025-03-20 22:01     ` David Howells
  1 sibling, 2 replies; 72+ messages in thread
From: David Howells @ 2025-03-17 11:52 UTC (permalink / raw)
  To: slava
  Cc: dhowells, Alex Markuze, Ilya Dryomov, Jeff Layton, Dongsheng Yang,
	ceph-devel, linux-fsdevel, linux-block, linux-kernel,
	Slava.Dubeyko

slava@dubeyko.com wrote:

> > -		err = ceph_pagelist_reserve(pagelist, len +
> > val_size1 + 8);
> > +		err = ceph_databuf_reserve(dbuf, len + val_size1 +
> > 8,
> > +					   GFP_KERNEL);
> 
> I know that it's simple change. But this len + val_size1 + 8 looks
> confusing, anyway. What this hardcoded 8 means? :)

You tell me.  The '8' is pre-existing.

> > -	if (req->r_pagelist) {
> > -		iinfo.xattr_len = req->r_pagelist->length;
> > -		iinfo.xattr_data = req->r_pagelist->mapped_tail;
> > +	if (req->r_dbuf) {
> > +		iinfo.xattr_len = ceph_databuf_len(req->r_dbuf);
> > +		iinfo.xattr_data = kmap_ceph_databuf_page(req-
> > >r_dbuf, 0);
> 
> Possibly, it's in another patch. Have we removed req->r_pagelist from
> the structure?

See patch 20 "libceph: Remove ceph_pagelist".

It cannot be removed here as the kernel must still compile and work at this
point.

> Do we always have memory pages in ceph_databuf? How
> kmap_ceph_databuf_page() will behave if it's not memory page.

Are there other sorts of pages?

> Maybe, we need to hide kunmap_local() into something like
> kunmap_ceph_databuf_page()?

Actually, probably better to rename kmap_ceph_databuf_page() to
kmap_local_ceph_databuf().

> Maybe, it makes sense to call something like ceph_databuf_length()
> instead of low level access to dbuf->nr_bvec?

Sounds reasonable.  Better to hide the internal workings.

> > +	if (as_ctx->dbuf) {
> > +		req->r_dbuf = as_ctx->dbuf;
> > +		as_ctx->dbuf = NULL;
> 
> Maybe, we need something like swap() method? :)

I could point out that you were complaining about ceph_databuf_get() returning
a pointer than a void;-).

> > +	dbuf = ceph_databuf_req_alloc(2, 0, GFP_KERNEL);
> 
> So, do we allocate 2 items of zero length here?

You don't.  One is the bvec[] count (2) and one is that amount of memory to
preallocate (0) and attach to that bvec[].

Now, it may make sense to split the API calls to handle a number of different
scenarios, e.g.: request with just protocol, no pages; request with just
pages; request with both protocol bits and page list.

> > +	if (ceph_databuf_insert_frag(dbuf, 0, sizeof(*header),
> > GFP_KERNEL) < 0)
> > +		goto out;
> > +	if (ceph_databuf_insert_frag(dbuf, 1, PAGE_SIZE, GFP_KERNEL)
> > < 0)
> >  		goto out;
> >  
> > +	iov_iter_bvec(&iter, ITER_DEST, &dbuf->bvec[1], 1, len);
> 
> Is it correct &dbuf->bvec[1]? Why do we work with item #1? I think it
> looks confusing.

Because you have a protocol element (in dbuf->bvec[0]) and a buffer (in
dbuf->bvec[1]).

An iterator is attached to the buffer and the iterator then conveys it to
__ceph_sync_read() as the destination.

If you look a few lines further on in the patch, you can see the first
fragment being accessed:

> +	header = kmap_ceph_databuf_page(dbuf, 0);
> +

Note that, because the read buffer is very likely a whole page, I split them
into separate sections rather than trying to allocate an order-1 page as that
would be more likely to fail.

> > -		header.data_len = cpu_to_le32(8 + 8 + 4);
> > -		header.file_offset = 0;
> > +		header->data_len = cpu_to_le32(8 + 8 + 4);
> 
> The same problem of understanding here for me. What this hardcoded 8 +
> 8 + 4 value means? :)

You need to ask a ceph expert.  This is nothing specifically to do with my
changes.  However, I suspect it's the size of the message element.

> > -		memset(iov.iov_base + boff, 0, PAGE_SIZE - boff);
> > +		p = kmap_ceph_databuf_page(dbuf, 1);
> 
> Maybe, we need to introduce some constants to address #0 and #1 pages?
> Because, #0 it's header and I assume #1 is some content.

Whilst that might be useful, I don't know that the 0 and 1... being header and
content respectively always hold.  I haven't checked, but there could even be
a protocol trailer in some cases as well.

> > -	err = ceph_pagelist_reserve(pagelist,
> > -				    4 * 2 + name_len + as_ctx-
> > >lsmctx.len);
> > +	err = ceph_databuf_reserve(dbuf, 4 * 2 + name_len + as_ctx-
> > >lsmctx.len,
> > +				   GFP_KERNEL);
> 
> The 4 * 2 + name_len + as_ctx->lsmctx.len looks unclear to me. It wil
> be good to have some well defined constants here.

Again, nothing specifically to do with my changes.

David

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 06/35] rbd: Use ceph_databuf for rbd_obj_read_sync()
  2025-03-13 23:32 ` [RFC PATCH 06/35] rbd: Use ceph_databuf for rbd_obj_read_sync() David Howells
@ 2025-03-17 19:08   ` Viacheslav Dubeyko
  2025-04-11 13:48   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-17 19:08 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:32 +0000, David Howells wrote:
> Make rbd_obj_read_sync() allocate and use a ceph_databuf object to convey
> the data into the operation.  This has some space preallocated and this is
> allocated by alloc_pages() and accessed with kmap_local rather than being
> kmalloc'd.  This allows MSG_SPLICE_PAGES to be used.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  drivers/block/rbd.c | 45 ++++++++++++++++++++-------------------------
>  1 file changed, 20 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index faafd7ff43d6..bb953634c7cb 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -4822,13 +4822,10 @@ static void rbd_free_disk(struct rbd_device *rbd_dev)
>  static int rbd_obj_read_sync(struct rbd_device *rbd_dev,
>  			     struct ceph_object_id *oid,
>  			     struct ceph_object_locator *oloc,
> -			     void *buf, int buf_len)
> -
> +			     struct ceph_databuf *dbuf, int len)
>  {
>  	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
>  	struct ceph_osd_request *req;
> -	struct page **pages;
> -	int num_pages = calc_pages_for(0, buf_len);
>  	int ret;
>  
>  	req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_KERNEL);
> @@ -4839,15 +4836,8 @@ static int rbd_obj_read_sync(struct rbd_device *rbd_dev,
>  	ceph_oloc_copy(&req->r_base_oloc, oloc);
>  	req->r_flags = CEPH_OSD_FLAG_READ;
>  
> -	pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL);
> -	if (IS_ERR(pages)) {
> -		ret = PTR_ERR(pages);
> -		goto out_req;
> -	}
> -
> -	osd_req_op_extent_init(req, 0, CEPH_OSD_OP_READ, 0, buf_len, 0, 0);
> -	osd_req_op_extent_osd_data_pages(req, 0, pages, buf_len, 0, false,
> -					 true);
> +	osd_req_op_extent_init(req, 0, CEPH_OSD_OP_READ, 0, len, 0, 0);
> +	osd_req_op_extent_osd_databuf(req, 0, dbuf);
>  
>  	ret = ceph_osdc_alloc_messages(req, GFP_KERNEL);
>  	if (ret)
> @@ -4855,9 +4845,6 @@ static int rbd_obj_read_sync(struct rbd_device *rbd_dev,
>  
>  	ceph_osdc_start_request(osdc, req);
>  	ret = ceph_osdc_wait_request(osdc, req);
> -	if (ret >= 0)
> -		ceph_copy_from_page_vector(pages, buf, 0, ret);
> -
>  out_req:
>  	ceph_osdc_put_request(req);
>  	return ret;
> @@ -4872,12 +4859,18 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
>  				  struct rbd_image_header *header,
>  				  bool first_time)
>  {
> -	struct rbd_image_header_ondisk *ondisk = NULL;
> +	struct rbd_image_header_ondisk *ondisk;
> +	struct ceph_databuf *dbuf = NULL;
>  	u32 snap_count = 0;
>  	u64 names_size = 0;
>  	u32 want_count;
>  	int ret;
>  
> +	dbuf = ceph_databuf_req_alloc(1, sizeof(*ondisk), GFP_KERNEL);

I am slightly worried about such using of ondisk variable. We have garbage as a
value of ondisk pointer on this step yet. And pointer dereferencing could look
confusing here. Also, potentially, compiler and static analysis tools could
complain. I don't see a problem here but anyway I am feeling worried. :)

Thanks,
Slava.


> +	if (!dbuf)
> +		return -ENOMEM;
> +	ondisk = kmap_ceph_databuf_page(dbuf, 0);
> +
>  	/*
>  	 * The complete header will include an array of its 64-bit
>  	 * snapshot ids, followed by the names of those snapshots as
> @@ -4888,17 +4881,18 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
>  	do {
>  		size_t size;
>  
> -		kfree(ondisk);
> -
>  		size = sizeof (*ondisk);
>  		size += snap_count * sizeof (struct rbd_image_snap_ondisk);
>  		size += names_size;
> -		ondisk = kmalloc(size, GFP_KERNEL);
> -		if (!ondisk)
> -			return -ENOMEM;
> +
> +		ret = -ENOMEM;
> +		if (size > dbuf->limit &&
> +		    ceph_databuf_reserve(dbuf, size - dbuf->limit,
> +					 GFP_KERNEL) < 0)
> +			goto out;
>  
>  		ret = rbd_obj_read_sync(rbd_dev, &rbd_dev->header_oid,
> -					&rbd_dev->header_oloc, ondisk, size);
> +					&rbd_dev->header_oloc, dbuf, size);
>  		if (ret < 0)
>  			goto out;
>  		if ((size_t)ret < size) {
> @@ -4907,6 +4901,7 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
>  				size, ret);
>  			goto out;
>  		}
> +
>  		if (!rbd_dev_ondisk_valid(ondisk)) {
>  			ret = -ENXIO;
>  			rbd_warn(rbd_dev, "invalid header");
> @@ -4920,8 +4915,8 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
>  
>  	ret = rbd_header_from_disk(header, ondisk, first_time);
>  out:
> -	kfree(ondisk);
> -
> +	kunmap_local(ondisk);
> +	ceph_databuf_release(dbuf);
>  	return ret;
>  }
>  
> 
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 07/35] libceph: Change ceph_osdc_call()'s reply to a ceph_databuf
  2025-03-13 23:32 ` [RFC PATCH 07/35] libceph: Change ceph_osdc_call()'s reply to a ceph_databuf David Howells
@ 2025-03-17 19:41   ` Viacheslav Dubeyko
  2025-03-17 22:12   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-17 19:41 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:32 +0000, David Howells wrote:
> Change the type of ceph_osdc_call()'s reply to a ceph_databuf struct rather
> than a list of pages and access it with kmap_local rather than
> page_address().
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  drivers/block/rbd.c             | 135 ++++++++++++++++++--------------
>  include/linux/ceph/osd_client.h |   2 +-
>  net/ceph/cls_lock_client.c      |  41 +++++-----
>  net/ceph/osd_client.c           |  16 ++--
>  4 files changed, 109 insertions(+), 85 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index bb953634c7cb..073e80d2d966 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -1826,9 +1826,8 @@ static int __rbd_object_map_load(struct rbd_device *rbd_dev)
>  {
>  	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
>  	CEPH_DEFINE_OID_ONSTACK(oid);
> -	struct page **pages;
> -	void *p, *end;
> -	size_t reply_len;
> +	struct ceph_databuf *reply;
> +	void *p, *q, *end;

If I understood correctly the logic, q represents a pointer on current position.
So, maybe, it makes sense to rename p into something like "begin"? In this case,
we will have begin pointer, end pointer and p could be used as the name of
pointer on current position. 

>  	u64 num_objects;
>  	u64 object_map_bytes;
>  	u64 object_map_size;
> @@ -1842,48 +1841,57 @@ static int __rbd_object_map_load(struct rbd_device *rbd_dev)
>  	object_map_bytes = DIV_ROUND_UP_ULL(num_objects * BITS_PER_OBJ,
>  					    BITS_PER_BYTE);
>  	num_pages = calc_pages_for(0, object_map_bytes) + 1;
> -	pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL);
> -	if (IS_ERR(pages))
> -		return PTR_ERR(pages);
>  
> -	reply_len = num_pages * PAGE_SIZE;
> +	reply = ceph_databuf_reply_alloc(num_pages, num_pages * PAGE_SIZE,
> +					 GFP_KERNEL);
> +	if (!reply)
> +		return -ENOMEM;
> +
>  	rbd_object_map_name(rbd_dev, rbd_dev->spec->snap_id, &oid);
>  	ret = ceph_osdc_call(osdc, &oid, &rbd_dev->header_oloc,
>  			     "rbd", "object_map_load", CEPH_OSD_FLAG_READ,
> -			     NULL, 0, pages, &reply_len);
> +			     NULL, 0, reply);
>  	if (ret)
>  		goto out;
>  
> -	p = page_address(pages[0]);
> -	end = p + min(reply_len, (size_t)PAGE_SIZE);
> -	ret = decode_object_map_header(&p, end, &object_map_size);
> +	p = kmap_ceph_databuf_page(reply, 0);
> +	end = p + min(ceph_databuf_len(reply), (size_t)PAGE_SIZE);
> +	q = p;
> +	ret = decode_object_map_header(&q, end, &object_map_size);
>  	if (ret)
> -		goto out;
> +		goto out_unmap;
>  
>  	if (object_map_size != num_objects) {
>  		rbd_warn(rbd_dev, "object map size mismatch: %llu vs %llu",
>  			 object_map_size, num_objects);
>  		ret = -EINVAL;
> -		goto out;
> +		goto out_unmap;
>  	}
> +	iov_iter_advance(&reply->iter, q - p);
>  
> -	if (offset_in_page(p) + object_map_bytes > reply_len) {
> +	if (object_map_bytes > ceph_databuf_len(reply)) {

Does it mean that we had bug before here? Because it was offset_in_page(p) +
object_map_bytes before.

>  		ret = -EINVAL;
> -		goto out;
> +		goto out_unmap;
>  	}
>  
>  	rbd_dev->object_map = kvmalloc(object_map_bytes, GFP_KERNEL);
>  	if (!rbd_dev->object_map) {
>  		ret = -ENOMEM;
> -		goto out;
> +		goto out_unmap;
>  	}
>  
>  	rbd_dev->object_map_size = object_map_size;

Why do we have object_map_size and object_map_bytes at the same time? It could
be confusing for my taste. Maybe, we need to rename the object_map_size to
object_map_num_objects?

> -	ceph_copy_from_page_vector(pages, rbd_dev->object_map,
> -				   offset_in_page(p), object_map_bytes);
>  
> +	ret = -EIO;
> +	if (copy_from_iter(rbd_dev->object_map, object_map_bytes,
> +			   &reply->iter) != object_map_bytes)
> +		goto out_unmap;
> +
> +	ret = 0;
> +out_unmap:
> +	kunmap_local(p);
>  out:
> -	ceph_release_page_vector(pages, num_pages);
> +	ceph_databuf_release(reply);
>  	return ret;
>  }
>  
> @@ -1952,6 +1960,7 @@ static int rbd_object_map_update_finish(struct rbd_obj_request *obj_req,
>  {
>  	struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev;
>  	struct ceph_osd_data *osd_data;
> +	struct ceph_databuf *dbuf;
>  	u64 objno;
>  	u8 state, new_state, current_state;
>  	bool has_current_state;
> @@ -1971,9 +1980,10 @@ static int rbd_object_map_update_finish(struct rbd_obj_request *obj_req,
>  	 */
>  	rbd_assert(osd_req->r_num_ops == 2);
>  	osd_data = osd_req_op_data(osd_req, 1, cls, request_data);
> -	rbd_assert(osd_data->type == CEPH_OSD_DATA_TYPE_PAGES);
> +	rbd_assert(osd_data->type == CEPH_OSD_DATA_TYPE_DATABUF);
> +	dbuf = osd_data->dbuf;
>  
> -	p = page_address(osd_data->pages[0]);
> +	p = kmap_ceph_databuf_page(dbuf, 0);
>  	objno = ceph_decode_64(&p);
>  	rbd_assert(objno == obj_req->ex.oe_objno);
>  	rbd_assert(ceph_decode_64(&p) == objno + 1);
> @@ -1981,6 +1991,7 @@ static int rbd_object_map_update_finish(struct rbd_obj_request *obj_req,
>  	has_current_state = ceph_decode_8(&p);
>  	if (has_current_state)
>  		current_state = ceph_decode_8(&p);
> +	kunmap_local(p);
>  
>  	spin_lock(&rbd_dev->object_map_lock);
>  	state = __rbd_object_map_get(rbd_dev, objno);
> @@ -2020,7 +2031,7 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req,
>  				     int which, u64 objno, u8 new_state,
>  				     const u8 *current_state)
>  {
> -	struct page **pages;
> +	struct ceph_databuf *dbuf;
>  	void *p, *start;
>  	int ret;
>  
> @@ -2028,11 +2039,11 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req,
>  	if (ret)
>  		return ret;
>  
> -	pages = ceph_alloc_page_vector(1, GFP_NOIO);
> -	if (IS_ERR(pages))
> -		return PTR_ERR(pages);
> +	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
> +	if (!dbuf)
> +		return -ENOMEM;
>  
> -	p = start = page_address(pages[0]);
> +	p = start = kmap_ceph_databuf_page(dbuf, 0);
>  	ceph_encode_64(&p, objno);
>  	ceph_encode_64(&p, objno + 1);
>  	ceph_encode_8(&p, new_state);
> @@ -2042,9 +2053,10 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req,
>  	} else {
>  		ceph_encode_8(&p, 0);
>  	}
> +	kunmap_local(p);
> +	ceph_databuf_added_data(dbuf, p - start);
>  
> -	osd_req_op_cls_request_data_pages(req, which, pages, p - start, 0,
> -					  false, true);
> +	osd_req_op_cls_request_databuf(req, which, dbuf);
>  	return 0;
>  }
>  
> @@ -4673,8 +4685,8 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
>  			     size_t inbound_size)
>  {
>  	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
> +	struct ceph_databuf *reply;
>  	struct page *req_page = NULL;
> -	struct page *reply_page;
>  	int ret;
>  
>  	/*
> @@ -4695,8 +4707,8 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
>  		memcpy(page_address(req_page), outbound, outbound_size);
>  	}
>  
> -	reply_page = alloc_page(GFP_KERNEL);
> -	if (!reply_page) {
> +	reply = ceph_databuf_reply_alloc(1, inbound_size, GFP_KERNEL);

Interesting... We allocated memory page before. Now we allocate the memory of
inbound size. Potentially, it could be any size of starting from zero bytes and
including several memory pages. Could we have an issue here?

Thanks,
Slava.

> +	if (!reply) {
>  		if (req_page)
>  			__free_page(req_page);
>  		return -ENOMEM;
> @@ -4704,15 +4716,16 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
>  
>  	ret = ceph_osdc_call(osdc, oid, oloc, RBD_DRV_NAME, method_name,
>  			     CEPH_OSD_FLAG_READ, req_page, outbound_size,
> -			     &reply_page, &inbound_size);
> +			     reply);
>  	if (!ret) {
> -		memcpy(inbound, page_address(reply_page), inbound_size);
> -		ret = inbound_size;
> +		ret = ceph_databuf_len(reply);
> +		if (copy_from_iter(inbound, ret, &reply->iter) != ret)
> +			ret = -EIO;
>  	}
>  
>  	if (req_page)
>  		__free_page(req_page);
> -	__free_page(reply_page);
> +	ceph_databuf_release(reply);
>  	return ret;
>  }
>  
> @@ -5633,7 +5646,7 @@ static int decode_parent_image_spec(void **p, void *end,
>  
>  static int __get_parent_info(struct rbd_device *rbd_dev,
>  			     struct page *req_page,
> -			     struct page *reply_page,
> +			     struct ceph_databuf *reply,
>  			     struct parent_image_info *pii)
>  {
>  	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
> @@ -5643,27 +5656,31 @@ static int __get_parent_info(struct rbd_device *rbd_dev,
>  
>  	ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc,
>  			     "rbd", "parent_get", CEPH_OSD_FLAG_READ,
> -			     req_page, sizeof(u64), &reply_page, &reply_len);
> +			     req_page, sizeof(u64), reply);
>  	if (ret)
>  		return ret == -EOPNOTSUPP ? 1 : ret;
>  
> -	p = page_address(reply_page);
> +	p = kmap_ceph_databuf_page(reply, 0);
>  	end = p + reply_len;
>  	ret = decode_parent_image_spec(&p, end, pii);
> +	kunmap_local(p);
>  	if (ret)
>  		return ret;
>  
> +	ceph_databuf_reset_reply(reply);
> +
>  	ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc,
>  			     "rbd", "parent_overlap_get", CEPH_OSD_FLAG_READ,
> -			     req_page, sizeof(u64), &reply_page, &reply_len);
> +			     req_page, sizeof(u64), reply);
>  	if (ret)
>  		return ret;
>  
> -	p = page_address(reply_page);
> +	p = kmap_ceph_databuf_page(reply, 0);
>  	end = p + reply_len;
>  	ceph_decode_8_safe(&p, end, pii->has_overlap, e_inval);
>  	if (pii->has_overlap)
>  		ceph_decode_64_safe(&p, end, pii->overlap, e_inval);
> +	kunmap_local(p);
>  
>  	dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
>  	     __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id,
> @@ -5679,25 +5696,25 @@ static int __get_parent_info(struct rbd_device *rbd_dev,
>   */
>  static int __get_parent_info_legacy(struct rbd_device *rbd_dev,
>  				    struct page *req_page,
> -				    struct page *reply_page,
> +				    struct ceph_databuf *reply,
>  				    struct parent_image_info *pii)
>  {
>  	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
> -	size_t reply_len = PAGE_SIZE;
>  	void *p, *end;
>  	int ret;
>  
>  	ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc,
>  			     "rbd", "get_parent", CEPH_OSD_FLAG_READ,
> -			     req_page, sizeof(u64), &reply_page, &reply_len);
> +			     req_page, sizeof(u64), reply);
>  	if (ret)
>  		return ret;
>  
> -	p = page_address(reply_page);
> -	end = p + reply_len;
> +	p = kmap_ceph_databuf_page(reply, 0);
> +	end = p + ceph_databuf_len(reply);
>  	ceph_decode_64_safe(&p, end, pii->pool_id, e_inval);
>  	pii->image_id = ceph_extract_encoded_string(&p, end, NULL, GFP_KERNEL);
>  	if (IS_ERR(pii->image_id)) {
> +		kunmap_local(p);
>  		ret = PTR_ERR(pii->image_id);
>  		pii->image_id = NULL;
>  		return ret;
> @@ -5705,6 +5722,7 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev,
>  	ceph_decode_64_safe(&p, end, pii->snap_id, e_inval);
>  	pii->has_overlap = true;
>  	ceph_decode_64_safe(&p, end, pii->overlap, e_inval);
> +	kunmap_local(p);
>  
>  	dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n",
>  	     __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id,
> @@ -5718,29 +5736,30 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev,
>  static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev,
>  				  struct parent_image_info *pii)
>  {
> -	struct page *req_page, *reply_page;
> +	struct ceph_databuf *reply;
> +	struct page *req_page;
>  	void *p;
> -	int ret;
> +	int ret = -ENOMEM;
>  
>  	req_page = alloc_page(GFP_KERNEL);
>  	if (!req_page)
> -		return -ENOMEM;
> +		goto out;
>  
> -	reply_page = alloc_page(GFP_KERNEL);
> -	if (!reply_page) {
> -		__free_page(req_page);
> -		return -ENOMEM;
> -	}
> +	reply = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_KERNEL);
> +	if (!reply)
> +		goto out_free;
>  
> -	p = page_address(req_page);
> +	p = kmap_local_page(req_page);
>  	ceph_encode_64(&p, rbd_dev->spec->snap_id);
> -	ret = __get_parent_info(rbd_dev, req_page, reply_page, pii);
> +	kunmap_local(p);
> +	ret = __get_parent_info(rbd_dev, req_page, reply, pii);
>  	if (ret > 0)
> -		ret = __get_parent_info_legacy(rbd_dev, req_page, reply_page,
> -					       pii);
> +		ret = __get_parent_info_legacy(rbd_dev, req_page, reply, pii);
>  
> +	ceph_databuf_release(reply);
> +out_free:
>  	__free_page(req_page);
> -	__free_page(reply_page);
> +out:
>  	return ret;
>  }
>  
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 172ee515a0f3..57b8aff53f28 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -610,7 +610,7 @@ int ceph_osdc_call(struct ceph_osd_client *osdc,
>  		   const char *class, const char *method,
>  		   unsigned int flags,
>  		   struct page *req_page, size_t req_len,
> -		   struct page **resp_pages, size_t *resp_len);
> +		   struct ceph_databuf *response);
>  
>  /* watch/notify */
>  struct ceph_osd_linger_request *
> diff --git a/net/ceph/cls_lock_client.c b/net/ceph/cls_lock_client.c
> index 66136a4c1ce7..37bb8708e8bb 100644
> --- a/net/ceph/cls_lock_client.c
> +++ b/net/ceph/cls_lock_client.c
> @@ -74,7 +74,7 @@ int ceph_cls_lock(struct ceph_osd_client *osdc,
>  	     __func__, lock_name, type, cookie, tag, desc, flags);
>  	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "lock",
>  			     CEPH_OSD_FLAG_WRITE, lock_op_page,
> -			     lock_op_buf_size, NULL, NULL);
> +			     lock_op_buf_size, NULL);
>  
>  	dout("%s: status %d\n", __func__, ret);
>  	__free_page(lock_op_page);
> @@ -124,7 +124,7 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc,
>  	dout("%s lock_name %s cookie %s\n", __func__, lock_name, cookie);
>  	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "unlock",
>  			     CEPH_OSD_FLAG_WRITE, unlock_op_page,
> -			     unlock_op_buf_size, NULL, NULL);
> +			     unlock_op_buf_size, NULL);
>  
>  	dout("%s: status %d\n", __func__, ret);
>  	__free_page(unlock_op_page);
> @@ -179,7 +179,7 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc,
>  	     cookie, ENTITY_NAME(*locker));
>  	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "break_lock",
>  			     CEPH_OSD_FLAG_WRITE, break_op_page,
> -			     break_op_buf_size, NULL, NULL);
> +			     break_op_buf_size, NULL);
>  
>  	dout("%s: status %d\n", __func__, ret);
>  	__free_page(break_op_page);
> @@ -230,7 +230,7 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc,
>  	     __func__, lock_name, type, old_cookie, tag, new_cookie);
>  	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "set_cookie",
>  			     CEPH_OSD_FLAG_WRITE, cookie_op_page,
> -			     cookie_op_buf_size, NULL, NULL);
> +			     cookie_op_buf_size, NULL);
>  
>  	dout("%s: status %d\n", __func__, ret);
>  	__free_page(cookie_op_page);
> @@ -337,10 +337,10 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc,
>  		       char *lock_name, u8 *type, char **tag,
>  		       struct ceph_locker **lockers, u32 *num_lockers)
>  {
> +	struct ceph_databuf *reply;
>  	int get_info_op_buf_size;
>  	int name_len = strlen(lock_name);
> -	struct page *get_info_op_page, *reply_page;
> -	size_t reply_len = PAGE_SIZE;
> +	struct page *get_info_op_page;
>  	void *p, *end;
>  	int ret;
>  
> @@ -353,8 +353,8 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc,
>  	if (!get_info_op_page)
>  		return -ENOMEM;
>  
> -	reply_page = alloc_page(GFP_NOIO);
> -	if (!reply_page) {
> +	reply = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO);
> +	if (!reply) {
>  		__free_page(get_info_op_page);
>  		return -ENOMEM;
>  	}
> @@ -370,18 +370,19 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc,
>  	dout("%s lock_name %s\n", __func__, lock_name);
>  	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "get_info",
>  			     CEPH_OSD_FLAG_READ, get_info_op_page,
> -			     get_info_op_buf_size, &reply_page, &reply_len);
> +			     get_info_op_buf_size, reply);
>  
>  	dout("%s: status %d\n", __func__, ret);
>  	if (ret >= 0) {
> -		p = page_address(reply_page);
> -		end = p + reply_len;
> +		p = kmap_ceph_databuf_page(reply, 0);
> +		end = p + ceph_databuf_len(reply);
>  
>  		ret = decode_lockers(&p, end, type, tag, lockers, num_lockers);
> +		kunmap_local(p);
>  	}
>  
>  	__free_page(get_info_op_page);
> -	__free_page(reply_page);
> +	ceph_databuf_release(reply);
>  	return ret;
>  }
>  EXPORT_SYMBOL(ceph_cls_lock_info);
> @@ -389,11 +390,11 @@ EXPORT_SYMBOL(ceph_cls_lock_info);
>  int ceph_cls_assert_locked(struct ceph_osd_request *req, int which,
>  			   char *lock_name, u8 type, char *cookie, char *tag)
>  {
> +	struct ceph_databuf *dbuf;
>  	int assert_op_buf_size;
>  	int name_len = strlen(lock_name);
>  	int cookie_len = strlen(cookie);
>  	int tag_len = strlen(tag);
> -	struct page **pages;
>  	void *p, *end;
>  	int ret;
>  
> @@ -408,11 +409,11 @@ int ceph_cls_assert_locked(struct ceph_osd_request *req, int which,
>  	if (ret)
>  		return ret;
>  
> -	pages = ceph_alloc_page_vector(1, GFP_NOIO);
> -	if (IS_ERR(pages))
> -		return PTR_ERR(pages);
> +	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
> +	if (!dbuf)
> +		return -ENOMEM;
>  
> -	p = page_address(pages[0]);
> +	p = kmap_ceph_databuf_page(dbuf, 0);
>  	end = p + assert_op_buf_size;
>  
>  	/* encode cls_lock_assert_op struct */
> @@ -422,10 +423,12 @@ int ceph_cls_assert_locked(struct ceph_osd_request *req, int which,
>  	ceph_encode_8(&p, type);
>  	ceph_encode_string(&p, end, cookie, cookie_len);
>  	ceph_encode_string(&p, end, tag, tag_len);
> +	kunmap(p);
>  	WARN_ON(p != end);
> +	ceph_databuf_added_data(dbuf, assert_op_buf_size);
>  
> -	osd_req_op_cls_request_data_pages(req, which, pages, assert_op_buf_size,
> -					  0, false, true);
> +	osd_req_op_cls_request_databuf(req, which, dbuf);
>  	return 0;
>  }
>  EXPORT_SYMBOL(ceph_cls_assert_locked);
> +
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 720d8a605fc4..b6cf875d3de4 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -5195,7 +5195,10 @@ EXPORT_SYMBOL(ceph_osdc_maybe_request_map);
>   * Execute an OSD class method on an object.
>   *
>   * @flags: CEPH_OSD_FLAG_*
> - * @resp_len: in/out param for reply length
> + * @response: Pointer to the storage descriptor for the reply or NULL.
> + *
> + * The size of the response buffer is set by the caller in @response->limit and
> + * the size of the response obtained is set in @response->iter.
>   */
>  int ceph_osdc_call(struct ceph_osd_client *osdc,
>  		   struct ceph_object_id *oid,
> @@ -5203,7 +5206,7 @@ int ceph_osdc_call(struct ceph_osd_client *osdc,
>  		   const char *class, const char *method,
>  		   unsigned int flags,
>  		   struct page *req_page, size_t req_len,
> -		   struct page **resp_pages, size_t *resp_len)
> +		   struct ceph_databuf *response)
>  {
>  	struct ceph_osd_request *req;
>  	int ret;
> @@ -5226,9 +5229,8 @@ int ceph_osdc_call(struct ceph_osd_client *osdc,
>  	if (req_page)
>  		osd_req_op_cls_request_data_pages(req, 0, &req_page, req_len,
>  						  0, false, false);
> -	if (resp_pages)
> -		osd_req_op_cls_response_data_pages(req, 0, resp_pages,
> -						   *resp_len, 0, false, false);
> +	if (response)
> +		osd_req_op_cls_response_databuf(req, 0, response);
>  
>  	ret = ceph_osdc_alloc_messages(req, GFP_NOIO);
>  	if (ret)
> @@ -5238,8 +5240,8 @@ int ceph_osdc_call(struct ceph_osd_client *osdc,
>  	ret = ceph_osdc_wait_request(osdc, req);
>  	if (ret >= 0) {
>  		ret = req->r_ops[0].rval;
> -		if (resp_pages)
> -			*resp_len = req->r_ops[0].outdata_len;
> +		if (response)
> +			ceph_databuf_reply_ready(response, req->r_ops[0].outdata_len);
>  	}
>  
>  out_put_req:
> 
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 11/35] ceph: Use ceph_databuf in DIO
  2025-03-13 23:33 ` [RFC PATCH 11/35] ceph: Use ceph_databuf in DIO David Howells
@ 2025-03-17 20:03   ` Viacheslav Dubeyko
  2025-03-17 22:26   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-17 20:03 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:33 +0000, David Howells wrote:
> Stash the list of pages to be read into/written from during a ceph fs
> direct read/write in a ceph_databuf struct rather than using a bvec array.
> Eventually this will be replaced with just an iterator supplied by
> netfslib.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/ceph/file.c | 110 +++++++++++++++++++++----------------------------
>  1 file changed, 47 insertions(+), 63 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 9de2960748b9..fb4024bc8274 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -82,11 +82,10 @@ static __le32 ceph_flags_sys2wire(struct ceph_mds_client *mdsc, u32 flags)
>   */
>  #define ITER_GET_BVECS_PAGES	64
>  
> -static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
> -				struct bio_vec *bvecs)
> +static int __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
> +			    struct ceph_databuf *dbuf)
>  {
>  	size_t size = 0;
> -	int bvec_idx = 0;
>  
>  	if (maxsize > iov_iter_count(iter))
>  		maxsize = iov_iter_count(iter);
> @@ -98,22 +97,24 @@ static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
>  		int idx = 0;
>  
>  		bytes = iov_iter_get_pages2(iter, pages, maxsize - size,
> -					   ITER_GET_BVECS_PAGES, &start);
> -		if (bytes < 0)
> -			return size ?: bytes;
> -
> -		size += bytes;
> +					    ITER_GET_BVECS_PAGES, &start);
> +		if (bytes < 0) {
> +			if (size == 0)
> +				return bytes;
> +			break;

I am slightly confused by 'break;' here. Do we have a loop around?

> +		}
>  
> -		for ( ; bytes; idx++, bvec_idx++) {
> +		while (bytes) {
>  			int len = min_t(int, bytes, PAGE_SIZE - start);
>  
> -			bvec_set_page(&bvecs[bvec_idx], pages[idx], len, start);
> +			ceph_databuf_append_page(dbuf, pages[idx++], start, len);
>  			bytes -= len;
> +			size += len;
>  			start = 0;
>  		}
>  	}
>  
> -	return size;
> +	return 0;

Do we really need to return zero here? It looks to me that we calculated the
size for returning here. Am I wrong?

>  }
>  
>  /*
> @@ -124,52 +125,44 @@ static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
>   * Attempt to get up to @maxsize bytes worth of pages from @iter.
>   * Return the number of bytes in the created bio_vec array, or an error.
>   */
> -static ssize_t iter_get_bvecs_alloc(struct iov_iter *iter, size_t maxsize,
> -				    struct bio_vec **bvecs, int *num_bvecs)
> +static struct ceph_databuf *iter_get_bvecs_alloc(struct iov_iter *iter,
> +						 size_t maxsize, bool write)
>  {
> -	struct bio_vec *bv;
> +	struct ceph_databuf *dbuf;
>  	size_t orig_count = iov_iter_count(iter);
> -	ssize_t bytes;
> -	int npages;
> +	int npages, ret;
>  
>  	iov_iter_truncate(iter, maxsize);
>  	npages = iov_iter_npages(iter, INT_MAX);
>  	iov_iter_reexpand(iter, orig_count);
>  
> -	/*
> -	 * __iter_get_bvecs() may populate only part of the array -- zero it
> -	 * out.
> -	 */
> -	bv = kvmalloc_array(npages, sizeof(*bv), GFP_KERNEL | __GFP_ZERO);
> -	if (!bv)
> -		return -ENOMEM;
> +	if (write)
> +		dbuf = ceph_databuf_req_alloc(npages, 0, GFP_KERNEL);

I am still feeling confused of allocated npages of zero size. :)

> +	else
> +		dbuf = ceph_databuf_reply_alloc(npages, 0, GFP_KERNEL);
> +	if (!dbuf)
> +		return ERR_PTR(-ENOMEM);
>  
> -	bytes = __iter_get_bvecs(iter, maxsize, bv);
> -	if (bytes < 0) {
> +	ret = __iter_get_bvecs(iter, maxsize, dbuf);
> +	if (ret < 0) {
>  		/*
>  		 * No pages were pinned -- just free the array.
>  		 */
> -		kvfree(bv);
> -		return bytes;
> +		ceph_databuf_release(dbuf);
> +		return ERR_PTR(ret);
>  	}
>  
> -	*bvecs = bv;
> -	*num_bvecs = npages;
> -	return bytes;
> +	return dbuf;
>  }
>  
> -static void put_bvecs(struct bio_vec *bvecs, int num_bvecs, bool should_dirty)
> +static void ceph_dirty_pages(struct ceph_databuf *dbuf)

Does it mean that we never used should_dirty argument with false value? Or the
main goal of this method is always making the pages dirty?

>  {
> +	struct bio_vec *bvec = dbuf->bvec;
>  	int i;
>  
> -	for (i = 0; i < num_bvecs; i++) {
> -		if (bvecs[i].bv_page) {
> -			if (should_dirty)
> -				set_page_dirty_lock(bvecs[i].bv_page);
> -			put_page(bvecs[i].bv_page);

So, which code will put_page() now?

Thanks,
Slava.

> -		}
> -	}
> -	kvfree(bvecs);
> +	for (i = 0; i < dbuf->nr_bvec; i++)
> +		if (bvec[i].bv_page)
> +			set_page_dirty_lock(bvec[i].bv_page);
>  }
>  
>  /*
> @@ -1338,14 +1331,11 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req)
>  	struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
>  	struct ceph_osd_req_op *op = &req->r_ops[0];
>  	struct ceph_client_metric *metric = &ceph_sb_to_mdsc(inode->i_sb)->metric;
> -	unsigned int len = osd_data->bvec_pos.iter.bi_size;
> +	size_t len = osd_data->iter.count;
>  	bool sparse = (op->op == CEPH_OSD_OP_SPARSE_READ);
>  	struct ceph_client *cl = ceph_inode_to_client(inode);
>  
> -	BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_BVECS);
> -	BUG_ON(!osd_data->num_bvecs);
> -
> -	doutc(cl, "req %p inode %p %llx.%llx, rc %d bytes %u\n", req,
> +	doutc(cl, "req %p inode %p %llx.%llx, rc %d bytes %zu\n", req,
>  	      inode, ceph_vinop(inode), rc, len);
>  
>  	if (rc == -EOLDSNAPC) {
> @@ -1367,7 +1357,6 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req)
>  		if (rc == -ENOENT)
>  			rc = 0;
>  		if (rc >= 0 && len > rc) {
> -			struct iov_iter i;
>  			int zlen = len - rc;
>  
>  			/*
> @@ -1384,10 +1373,8 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req)
>  				aio_req->total_len = rc + zlen;
>  			}
>  
> -			iov_iter_bvec(&i, ITER_DEST, osd_data->bvec_pos.bvecs,
> -				      osd_data->num_bvecs, len);
> -			iov_iter_advance(&i, rc);
> -			iov_iter_zero(zlen, &i);
> +			iov_iter_advance(&osd_data->iter, rc);
> +			iov_iter_zero(zlen, &osd_data->iter);
>  		}
>  	}
>  
> @@ -1401,8 +1388,8 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req)
>  						 req->r_end_latency, len, rc);
>  	}
>  
> -	put_bvecs(osd_data->bvec_pos.bvecs, osd_data->num_bvecs,
> -		  aio_req->should_dirty);
> +	if (aio_req->should_dirty)
> +		ceph_dirty_pages(osd_data->dbuf);
>  	ceph_osdc_put_request(req);
>  
>  	if (rc < 0)
> @@ -1491,9 +1478,8 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
>  	struct ceph_client_metric *metric = &fsc->mdsc->metric;
>  	struct ceph_vino vino;
>  	struct ceph_osd_request *req;
> -	struct bio_vec *bvecs;
>  	struct ceph_aio_request *aio_req = NULL;
> -	int num_pages = 0;
> +	struct ceph_databuf *dbuf = NULL;
>  	int flags;
>  	int ret = 0;
>  	struct timespec64 mtime = current_time(inode);
> @@ -1529,8 +1515,8 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	while (iov_iter_count(iter) > 0) {
>  		u64 size = iov_iter_count(iter);
> -		ssize_t len;
>  		struct ceph_osd_req_op *op;
> +		size_t len;
>  		int readop = sparse ? CEPH_OSD_OP_SPARSE_READ : CEPH_OSD_OP_READ;
>  		int extent_cnt;
>  
> @@ -1563,16 +1549,17 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
>  			}
>  		}
>  
> -		len = iter_get_bvecs_alloc(iter, size, &bvecs, &num_pages);
> -		if (len < 0) {
> +		dbuf = iter_get_bvecs_alloc(iter, size, write);
> +		if (IS_ERR(dbuf)) {
>  			ceph_osdc_put_request(req);
> -			ret = len;
> +			ret = PTR_ERR(dbuf);
>  			break;
>  		}
> +		len = ceph_databuf_len(dbuf);
>  		if (len != size)
>  			osd_req_op_extent_update(req, 0, len);
>  
> -		osd_req_op_extent_osd_data_bvecs(req, 0, bvecs, num_pages, len);
> +		osd_req_op_extent_osd_databuf(req, 0, dbuf);
>  
>  		/*
>  		 * To simplify error handling, allow AIO when IO within i_size
> @@ -1637,20 +1624,17 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
>  				ret = 0;
>  
>  			if (ret >= 0 && ret < len && pos + ret < size) {
> -				struct iov_iter i;
>  				int zlen = min_t(size_t, len - ret,
>  						 size - pos - ret);
>  
> -				iov_iter_bvec(&i, ITER_DEST, bvecs, num_pages, len);
> -				iov_iter_advance(&i, ret);
> -				iov_iter_zero(zlen, &i);
> +				iov_iter_advance(&dbuf->iter, ret);
> +				iov_iter_zero(zlen, &dbuf->iter);
>  				ret += zlen;
>  			}
>  			if (ret >= 0)
>  				len = ret;
>  		}
>  
> -		put_bvecs(bvecs, num_pages, should_dirty);
>  		ceph_osdc_put_request(req);
>  		if (ret < 0)
>  			break;
> 
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 07/35] libceph: Change ceph_osdc_call()'s reply to a ceph_databuf
  2025-03-13 23:32 ` [RFC PATCH 07/35] libceph: Change ceph_osdc_call()'s reply to a ceph_databuf David Howells
  2025-03-17 19:41   ` Viacheslav Dubeyko
@ 2025-03-17 22:12   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-17 22:12 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > +	struct ceph_databuf *reply;
> > +	void *p, *q, *end;
> 
> If I understood correctly the logic, q represents a pointer on current
> position.  So, maybe, it makes sense to rename p into something like
> "begin"? In this case, we will have begin pointer, end pointer and p could
> be used as the name of pointer on current position.

"hdr" might be a better choice.

> > +	iov_iter_advance(&reply->iter, q - p);
> >  
> > -	if (offset_in_page(p) + object_map_bytes > reply_len) {
> > +	if (object_map_bytes > ceph_databuf_len(reply)) {
> 
> Does it mean that we had bug before here? Because it was offset_in_page(p) +
> object_map_bytes before.

No.  The iov_iter_advance() call advances the iterator over the header which
renders the subtraction unnecessary.

> >  	rbd_dev->object_map_size = object_map_size;
> 
> Why do we have object_map_size and object_map_bytes at the same time? It could
> be confusing for my taste. Maybe, we need to rename the object_map_size to
> object_map_num_objects?

Those names preexist.

> > +	reply = ceph_databuf_reply_alloc(1, inbound_size, GFP_KERNEL);
> 
> Interesting... We allocated memory page before. Now we allocate the memory
> of inbound size. Potentially, it could be any size of starting from zero
> bytes and including several memory pages. Could we have an issue here?

Shouldn't do.  ceph_databuf_reply_alloc() will expand databuf's bvec[] as
necessary to accommodate sufficient pages for the requested amount of
bufferage.

David


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 11/35] ceph: Use ceph_databuf in DIO
  2025-03-13 23:33 ` [RFC PATCH 11/35] ceph: Use ceph_databuf in DIO David Howells
  2025-03-17 20:03   ` Viacheslav Dubeyko
@ 2025-03-17 22:26   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-17 22:26 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > +					    ITER_GET_BVECS_PAGES, &start);
> > +		if (bytes < 0) {
> > +			if (size == 0)
> > +				return bytes;
> > +			break;
> 
> I am slightly confused by 'break;' here. Do we have a loop around?

Yes.  You need to look at the original code as the while-directive didn't make
it into the patch context;-).

> > -	return size;
> > +	return 0;
> 
> Do we really need to return zero here? It looks to me that we calculated the
> size for returning here. Am I wrong?

The only caller only cares if an error is returned.  It doesn't actually care
about the size.  The size is stored in the databuf anyway.

> > +		dbuf = ceph_databuf_req_alloc(npages, 0, GFP_KERNEL);
> 
> I am still feeling confused of allocated npages of zero size. :)

That's not what it's saying.  It's allocating npages' worth of bio_vec[] and
not creating any bufferage.  The bio_vecs will be loaded from a DIO request.
As mentioned in a previous reply, it might be worth creating a separate
databuf API call for this case.

> > -static void put_bvecs(struct bio_vec *bvecs, int num_bvecs, bool should_dirty)
> > +static void ceph_dirty_pages(struct ceph_databuf *dbuf)
> 
> Does it mean that we never used should_dirty argument with false value? Or the
> main goal of this method is always making the pages dirty?
> 
> >  {
> > +	struct bio_vec *bvec = dbuf->bvec;
> >  	int i;
> >  
> > -	for (i = 0; i < num_bvecs; i++) {
> > -		if (bvecs[i].bv_page) {
> > -			if (should_dirty)
> > -				set_page_dirty_lock(bvecs[i].bv_page);
> > -			put_page(bvecs[i].bv_page);
> 
> So, which code will put_page() now?

The dirtying of pages is split from the putting of those pages.  The databuf
releaser puts the pages, but doesn't dirty them.  ceph_aio_complete_req()
needs to do that itself.  Netfslib does this on behalf of the filesystem and
switching to that will delegate the responsibility.

Also in future, netfslib will handle putting the page refs or unpinning the
pages as appropriate - and ceph should not then take refs on those pages
(indeed, as struct page is disintegrated into different types such as folios,
there may not even *be* a ref counter on some of the pages).

David

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 13/35] rbd: Switch from using bvec_iter to iov_iter
  2025-03-13 23:33 ` [RFC PATCH 13/35] rbd: Switch from using bvec_iter to iov_iter David Howells
@ 2025-03-18 19:38   ` Viacheslav Dubeyko
  2025-03-18 22:13   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-18 19:38 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: dongsheng.yang@easystack.cn, Xiubo Li,
	linux-fsdevel@vger.kernel.org, ceph-devel@vger.kernel.org,
	linux-kernel@vger.kernel.org, jlayton@kernel.org,
	idryomov@gmail.com, linux-block@vger.kernel.org

On Thu, 2025-03-13 at 23:33 +0000, David Howells wrote:
> Switch from using a ceph_bio_iter/ceph_bvec_iter for iterating over the
> bio_vecs attached to the request to using a ceph_databuf with the bio_vecs
> transscribed from the bio list.  This allows the entire bio bvec[] set to
> be passed down to the socket (if unencrypted).
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: Xiubo Li <xiubli@redhat.com>
> cc: linux-fsdevel@vger.kernel.org
> ---
>  drivers/block/rbd.c          | 642 ++++++++++++++---------------------
>  include/linux/ceph/databuf.h |  22 ++
>  include/linux/ceph/striper.h |  58 +++-
>  net/ceph/striper.c           |  53 ---
>  4 files changed, 331 insertions(+), 444 deletions(-)
> 
> 

<skipped>

> +
>  #endif /* __FS_CEPH_DATABUF_H */
> diff --git a/include/linux/ceph/striper.h b/include/linux/ceph/striper.h
> index 3486636c0e6e..50bc1b88c5c4 100644
> --- a/include/linux/ceph/striper.h
> +++ b/include/linux/ceph/striper.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/list.h>
>  #include <linux/types.h>
> +#include <linux/bug.h>
>  
>  struct ceph_file_layout;
>  
> @@ -39,10 +40,6 @@ int ceph_file_to_extents(struct ceph_file_layout *l, u64 off, u64 len,
>  			 void *alloc_arg,
>  			 ceph_object_extent_fn_t action_fn,
>  			 void *action_arg);
> -int ceph_iterate_extents(struct ceph_file_layout *l, u64 off, u64 len,
> -			 struct list_head *object_extents,
> -			 ceph_object_extent_fn_t action_fn,
> -			 void *action_arg);
>  
>  struct ceph_file_extent {
>  	u64 fe_off;
> @@ -68,4 +65,57 @@ int ceph_extent_to_file(struct ceph_file_layout *l,
>  
>  u64 ceph_get_num_objects(struct ceph_file_layout *l, u64 size);
>  
> +static __always_inline
> +struct ceph_object_extent *ceph_lookup_containing(struct list_head *object_extents,
> +						  u64 objno, u64 objoff, u32 xlen)
> +{
> +	struct ceph_object_extent *ex;
> +
> +	list_for_each_entry(ex, object_extents, oe_item) {
> +		if (ex->oe_objno == objno &&

OK. I see the point that objno should be the same.

> +		    ex->oe_off <= objoff &&

But why ex->oe_off could be lesser than objoff? The objoff could be not exactly
the same?

> +		    ex->oe_off + ex->oe_len >= objoff + xlen) /* paranoia */

Do we really need in this comment? :)

I am still guessing why ex->oe_off + ex->oe_len could be bigger than objoff +
xlen. Is it possible that object size or offset could be bigger?

Thanks,
Slava.

> +			return ex;
> +
> +		if (ex->oe_objno > objno)
> +			break;
> +	}
> +
> +	return NULL;
> +}
> +


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 17/35] libceph, rbd: Use ceph_databuf encoding start/stop
  2025-03-13 23:33 ` [RFC PATCH 17/35] libceph, rbd: Use ceph_databuf encoding start/stop David Howells
@ 2025-03-18 19:59   ` Viacheslav Dubeyko
  2025-03-18 22:19   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-18 19:59 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:33 +0000, David Howells wrote:
> Use ceph_databuf_enc_start() and ceph_databuf_enc_stop() to encode RPC
> parameter data where possible.  The start function maps the buffer and
> returns a pointer to the point to start writing at; the stop function
> updates the buffer size.
> 
> The code is also made a bit more consistent in the use of size_t for length
> variables and using 'request' for a pointer to the request buffer.
> 
> The end pointer is dropped from ceph_encode_string() as we shouldn't
> overrun with the string length being included in the buffer size
> precalculation.  The final pointer is checked by ceph_databuf_enc_stop().
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  drivers/block/rbd.c         |   3 +-
>  include/linux/ceph/decode.h |   4 +-
>  net/ceph/cls_lock_client.c  | 195 +++++++++++++++++-------------------
>  net/ceph/mon_client.c       |  10 +-
>  net/ceph/osd_client.c       |  26 +++--
>  5 files changed, 112 insertions(+), 126 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index ec09d578b0b0..078bb1e3e1da 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -5762,8 +5762,7 @@ static char *rbd_dev_image_name(struct rbd_device *rbd_dev)
>  		return NULL;
>  
>  	p = image_id;
> -	end = image_id + image_id_size;
> -	ceph_encode_string(&p, end, rbd_dev->spec->image_id, (u32)len);
> +	ceph_encode_string(&p, rbd_dev->spec->image_id, len);
>  
>  	size = sizeof (__le32) + RBD_IMAGE_NAME_LEN_MAX;
>  	reply_buf = kmalloc(size, GFP_KERNEL);
> diff --git a/include/linux/ceph/decode.h b/include/linux/ceph/decode.h
> index 8fc1aed64113..e2726c3152db 100644
> --- a/include/linux/ceph/decode.h
> +++ b/include/linux/ceph/decode.h
> @@ -292,10 +292,8 @@ static inline void ceph_encode_filepath(void **p, void *end,
>  	*p += len;
>  }
>  
> -static inline void ceph_encode_string(void **p, void *end,
> -				      const char *s, u32 len)
> +static inline void ceph_encode_string(void **p, const char *s, u32 len)
>  {
> -	BUG_ON(*p + sizeof(len) + len > end);
>  	ceph_encode_32(p, len);
>  	if (len)
>  		memcpy(*p, s, len);
> diff --git a/net/ceph/cls_lock_client.c b/net/ceph/cls_lock_client.c
> index 6c8608aabe5f..c91259ff8557 100644
> --- a/net/ceph/cls_lock_client.c
> +++ b/net/ceph/cls_lock_client.c
> @@ -28,14 +28,14 @@ int ceph_cls_lock(struct ceph_osd_client *osdc,
>  		  char *lock_name, u8 type, char *cookie,
>  		  char *tag, char *desc, u8 flags)
>  {
> -	int lock_op_buf_size;
> -	int name_len = strlen(lock_name);
> -	int cookie_len = strlen(cookie);
> -	int tag_len = strlen(tag);
> -	int desc_len = strlen(desc);
> -	void *p, *end;
> -	struct ceph_databuf *lock_op_req;
> +	struct ceph_databuf *request;
>  	struct timespec64 mtime;
> +	size_t lock_op_buf_size;
> +	size_t name_len = strlen(lock_name);
> +	size_t cookie_len = strlen(cookie);
> +	size_t tag_len = strlen(tag);
> +	size_t desc_len = strlen(desc);
> +	void *p;
>  	int ret;
>  
>  	lock_op_buf_size = name_len + sizeof(__le32) +
> @@ -49,36 +49,34 @@ int ceph_cls_lock(struct ceph_osd_client *osdc,
>  	if (lock_op_buf_size > PAGE_SIZE)
>  		return -E2BIG;
>  
> -	lock_op_req = ceph_databuf_req_alloc(0, lock_op_buf_size, GFP_NOIO);
> -	if (!lock_op_req)
> +	request = ceph_databuf_req_alloc(1, lock_op_buf_size, GFP_NOIO);
> +	if (!request)
>  		return -ENOMEM;
>  
> -	p = kmap_ceph_databuf_page(lock_op_req, 0);
> -	end = p + lock_op_buf_size;
> +	p = ceph_databuf_enc_start(request);
>  
>  	/* encode cls_lock_lock_op struct */
>  	ceph_start_encoding(&p, 1, 1,
>  			    lock_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
> -	ceph_encode_string(&p, end, lock_name, name_len);
> +	ceph_encode_string(&p, lock_name, name_len);
>  	ceph_encode_8(&p, type);
> -	ceph_encode_string(&p, end, cookie, cookie_len);
> -	ceph_encode_string(&p, end, tag, tag_len);
> -	ceph_encode_string(&p, end, desc, desc_len);
> +	ceph_encode_string(&p, cookie, cookie_len);
> +	ceph_encode_string(&p, tag, tag_len);
> +	ceph_encode_string(&p, desc, desc_len);
>  	/* only support infinite duration */
>  	memset(&mtime, 0, sizeof(mtime));
>  	ceph_encode_timespec64(p, &mtime);
>  	p += sizeof(struct ceph_timespec);
>  	ceph_encode_8(&p, flags);
> -	kunmap_local(p);
> -	ceph_databuf_added_data(lock_op_req, lock_op_buf_size);
> +	ceph_databuf_enc_stop(request, p);
>  
>  	dout("%s lock_name %s type %d cookie %s tag %s desc %s flags 0x%x\n",
>  	     __func__, lock_name, type, cookie, tag, desc, flags);
>  	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "lock",
> -			     CEPH_OSD_FLAG_WRITE, lock_op_req, NULL);
> +			     CEPH_OSD_FLAG_WRITE, request, NULL);
>  
>  	dout("%s: status %d\n", __func__, ret);
> -	ceph_databuf_release(lock_op_req);
> +	ceph_databuf_release(request);
>  	return ret;
>  }
>  EXPORT_SYMBOL(ceph_cls_lock);
> @@ -96,11 +94,11 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc,
>  		    struct ceph_object_locator *oloc,
>  		    char *lock_name, char *cookie)
>  {
> -	int unlock_op_buf_size;
> -	int name_len = strlen(lock_name);
> -	int cookie_len = strlen(cookie);
> -	void *p, *end;
> -	struct ceph_databuf *unlock_op_req;
> +	struct ceph_databuf *request;
> +	size_t unlock_op_buf_size;
> +	size_t name_len = strlen(lock_name);
> +	size_t cookie_len = strlen(cookie);
> +	void *p;
>  	int ret;
>  
>  	unlock_op_buf_size = name_len + sizeof(__le32) +
> @@ -109,27 +107,25 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc,
>  	if (unlock_op_buf_size > PAGE_SIZE)
>  		return -E2BIG;
>  
> -	unlock_op_req = ceph_databuf_req_alloc(0, unlock_op_buf_size, GFP_NOIO);
> -	if (!unlock_op_req)
> +	request = ceph_databuf_req_alloc(1, unlock_op_buf_size, GFP_NOIO);
> +	if (!request)
>  		return -ENOMEM;
>  
> -	p = kmap_ceph_databuf_page(unlock_op_req, 0);
> -	end = p + unlock_op_buf_size;
> +	p = ceph_databuf_enc_start(request);
>  
>  	/* encode cls_lock_unlock_op struct */
>  	ceph_start_encoding(&p, 1, 1,
>  			    unlock_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
> -	ceph_encode_string(&p, end, lock_name, name_len);
> -	ceph_encode_string(&p, end, cookie, cookie_len);
> -	kunmap_local(p);
> -	ceph_databuf_added_data(unlock_op_req, unlock_op_buf_size);
> +	ceph_encode_string(&p, lock_name, name_len);
> +	ceph_encode_string(&p, cookie, cookie_len);
> +	ceph_databuf_enc_stop(request, p);
>  
>  	dout("%s lock_name %s cookie %s\n", __func__, lock_name, cookie);
>  	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "unlock",
> -			     CEPH_OSD_FLAG_WRITE, unlock_op_req, NULL);
> +			     CEPH_OSD_FLAG_WRITE, request, NULL);
>  
>  	dout("%s: status %d\n", __func__, ret);
> -	ceph_databuf_release(unlock_op_req);
> +	ceph_databuf_release(request);
>  	return ret;
>  }
>  EXPORT_SYMBOL(ceph_cls_unlock);
> @@ -149,11 +145,11 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc,
>  			char *lock_name, char *cookie,
>  			struct ceph_entity_name *locker)
>  {
> -	int break_op_buf_size;
> -	int name_len = strlen(lock_name);
> -	int cookie_len = strlen(cookie);
> -	struct ceph_databuf *break_op_req;
> -	void *p, *end;
> +	struct ceph_databuf *request;
> +	size_t break_op_buf_size;
> +	size_t name_len = strlen(lock_name);
> +	size_t cookie_len = strlen(cookie);
> +	void *p;
>  	int ret;
>  
>  	break_op_buf_size = name_len + sizeof(__le32) +
> @@ -163,29 +159,27 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc,
>  	if (break_op_buf_size > PAGE_SIZE)
>  		return -E2BIG;
>  
> -	break_op_req = ceph_databuf_req_alloc(0, break_op_buf_size, GFP_NOIO);
> -	if (!break_op_req)
> +	request = ceph_databuf_req_alloc(1, break_op_buf_size, GFP_NOIO);
> +	if (!request)
>  		return -ENOMEM;
>  
> -	p = kmap_ceph_databuf_page(break_op_req, 0);
> -	end = p + break_op_buf_size;
> +	p = ceph_databuf_enc_start(request);
>  
>  	/* encode cls_lock_break_op struct */
>  	ceph_start_encoding(&p, 1, 1,
>  			    break_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
> -	ceph_encode_string(&p, end, lock_name, name_len);
> +	ceph_encode_string(&p, lock_name, name_len);
>  	ceph_encode_copy(&p, locker, sizeof(*locker));
> -	ceph_encode_string(&p, end, cookie, cookie_len);
> -	kunmap_local(p);
> -	ceph_databuf_added_data(break_op_req, break_op_buf_size);
> +	ceph_encode_string(&p, cookie, cookie_len);
> +	ceph_databuf_enc_stop(request, p);
>  
>  	dout("%s lock_name %s cookie %s locker %s%llu\n", __func__, lock_name,
>  	     cookie, ENTITY_NAME(*locker));
>  	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "break_lock",
> -			     CEPH_OSD_FLAG_WRITE, break_op_req, NULL);
> +			     CEPH_OSD_FLAG_WRITE, request, NULL);
>  
>  	dout("%s: status %d\n", __func__, ret);
> -	ceph_databuf_release(break_op_req);
> +	ceph_databuf_release(request);
>  	return ret;
>  }
>  EXPORT_SYMBOL(ceph_cls_break_lock);
> @@ -196,13 +190,13 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc,
>  			char *lock_name, u8 type, char *old_cookie,
>  			char *tag, char *new_cookie)
>  {
> -	int cookie_op_buf_size;
> -	int name_len = strlen(lock_name);
> -	int old_cookie_len = strlen(old_cookie);
> -	int tag_len = strlen(tag);
> -	int new_cookie_len = strlen(new_cookie);
> -	void *p, *end;
> -	struct ceph_databuf *cookie_op_req;
> +	struct ceph_databuf *request;
> +	size_t cookie_op_buf_size;
> +	size_t name_len = strlen(lock_name);
> +	size_t old_cookie_len = strlen(old_cookie);
> +	size_t tag_len = strlen(tag);
> +	size_t new_cookie_len = strlen(new_cookie);
> +	void *p;
>  	int ret;
>  
>  	cookie_op_buf_size = name_len + sizeof(__le32) +
> @@ -213,31 +207,29 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc,
>  	if (cookie_op_buf_size > PAGE_SIZE)
>  		return -E2BIG;
>  
> -	cookie_op_req = ceph_databuf_req_alloc(0, cookie_op_buf_size, GFP_NOIO);
> -	if (!cookie_op_req)
> +	request = ceph_databuf_req_alloc(1, cookie_op_buf_size, GFP_NOIO);
> +	if (!request)
>  		return -ENOMEM;
>  
> -	p = kmap_ceph_databuf_page(cookie_op_req, 0);
> -	end = p + cookie_op_buf_size;
> +	p = ceph_databuf_enc_start(request);
>  
>  	/* encode cls_lock_set_cookie_op struct */
>  	ceph_start_encoding(&p, 1, 1,
>  			    cookie_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
> -	ceph_encode_string(&p, end, lock_name, name_len);
> +	ceph_encode_string(&p, lock_name, name_len);
>  	ceph_encode_8(&p, type);
> -	ceph_encode_string(&p, end, old_cookie, old_cookie_len);
> -	ceph_encode_string(&p, end, tag, tag_len);
> -	ceph_encode_string(&p, end, new_cookie, new_cookie_len);
> -	kunmap_local(p);
> -	ceph_databuf_added_data(cookie_op_req, cookie_op_buf_size);
> +	ceph_encode_string(&p, old_cookie, old_cookie_len);
> +	ceph_encode_string(&p, tag, tag_len);
> +	ceph_encode_string(&p, new_cookie, new_cookie_len);
> +	ceph_databuf_enc_stop(request, p);
>  
>  	dout("%s lock_name %s type %d old_cookie %s tag %s new_cookie %s\n",
>  	     __func__, lock_name, type, old_cookie, tag, new_cookie);
>  	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "set_cookie",
> -			     CEPH_OSD_FLAG_WRITE, cookie_op_req, NULL);
> +			     CEPH_OSD_FLAG_WRITE, request, NULL);
>  
>  	dout("%s: status %d\n", __func__, ret);
> -	ceph_databuf_release(cookie_op_req);
> +	ceph_databuf_release(request);
>  	return ret;
>  }
>  EXPORT_SYMBOL(ceph_cls_set_cookie);
> @@ -289,9 +281,10 @@ static int decode_locker(void **p, void *end, struct ceph_locker *locker)
>  	return 0;
>  }
>  
> -static int decode_lockers(void **p, void *end, u8 *type, char **tag,
> +static int decode_lockers(void **p, size_t size, u8 *type, char **tag,
>  			  struct ceph_locker **lockers, u32 *num_lockers)
>  {
> +	void *end = *p + size;
>  	u8 struct_v;
>  	u32 struct_len;
>  	char *s;
> @@ -341,11 +334,10 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc,
>  		       char *lock_name, u8 *type, char **tag,
>  		       struct ceph_locker **lockers, u32 *num_lockers)
>  {
> -	struct ceph_databuf *reply;
> -	int get_info_op_buf_size;
> -	int name_len = strlen(lock_name);
> -	struct ceph_databuf *get_info_op_req;
> -	void *p, *end;
> +	struct ceph_databuf *request, *reply;
> +	size_t get_info_op_buf_size;
> +	size_t name_len = strlen(lock_name);
> +	void *p;
>  	int ret;
>  
>  	get_info_op_buf_size = name_len + sizeof(__le32) +
> @@ -353,42 +345,39 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc,
>  	if (get_info_op_buf_size > PAGE_SIZE)
>  		return -E2BIG;
>  
> -	get_info_op_req = ceph_databuf_req_alloc(0, get_info_op_buf_size,
> -						 GFP_NOIO);
> -	if (!get_info_op_req)
> +	request = ceph_databuf_req_alloc(1, get_info_op_buf_size, GFP_NOIO);
> +	if (!request)
>  		return -ENOMEM;
>  
>  	reply = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO);
>  	if (!reply) {
> -		ceph_databuf_release(get_info_op_req);
> +		ceph_databuf_release(request);
>  		return -ENOMEM;
>  	}
>  
> -	p = kmap_ceph_databuf_page(get_info_op_req, 0);
> -	end = p + get_info_op_buf_size;
> +	p = ceph_databuf_enc_start(request);
>  
>  	/* encode cls_lock_get_info_op struct */
>  	ceph_start_encoding(&p, 1, 1,
>  			    get_info_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
> -	ceph_encode_string(&p, end, lock_name, name_len);
> -	kunmap_local(p);
> -	ceph_databuf_added_data(get_info_op_req, get_info_op_buf_size);
> +	ceph_encode_string(&p, lock_name, name_len);
> +	ceph_databuf_enc_stop(request, p);
>  
>  	dout("%s lock_name %s\n", __func__, lock_name);
>  	ret = ceph_osdc_call(osdc, oid, oloc, "lock", "get_info",
> -			     CEPH_OSD_FLAG_READ, get_info_op_req, reply);
> +			     CEPH_OSD_FLAG_READ, request, reply);
>  
>  	dout("%s: status %d\n", __func__, ret);
>  	if (ret >= 0) {
>  		p = kmap_ceph_databuf_page(reply, 0);
> -		end = p + ceph_databuf_len(reply);
>  
> -		ret = decode_lockers(&p, end, type, tag, lockers, num_lockers);
> +		ret = decode_lockers(&p, ceph_databuf_len(reply),
> +				     type, tag, lockers, num_lockers);
>  		kunmap_local(p);
>  	}
>  
>  	ceph_databuf_release(reply);
> -	ceph_databuf_release(get_info_op_req);
> +	ceph_databuf_release(request);
>  	return ret;
>  }
>  EXPORT_SYMBOL(ceph_cls_lock_info);
> @@ -396,12 +385,12 @@ EXPORT_SYMBOL(ceph_cls_lock_info);
>  int ceph_cls_assert_locked(struct ceph_osd_request *req, int which,
>  			   char *lock_name, u8 type, char *cookie, char *tag)
>  {
> -	struct ceph_databuf *dbuf;
> -	int assert_op_buf_size;
> -	int name_len = strlen(lock_name);
> -	int cookie_len = strlen(cookie);
> -	int tag_len = strlen(tag);
> -	void *p, *end;
> +	struct ceph_databuf *request;
> +	size_t assert_op_buf_size;
> +	size_t name_len = strlen(lock_name);
> +	size_t cookie_len = strlen(cookie);
> +	size_t tag_len = strlen(tag);
> +	void *p;
>  	int ret;
>  
>  	assert_op_buf_size = name_len + sizeof(__le32) +
> @@ -415,25 +404,23 @@ int ceph_cls_assert_locked(struct ceph_osd_request *req, int which,
>  	if (ret)
>  		return ret;
>  
> -	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
> -	if (!dbuf)
> +	request = ceph_databuf_req_alloc(1, assert_op_buf_size, GFP_NOIO);
> +	if (!request)
>  		return -ENOMEM;
>  
> -	p = kmap_ceph_databuf_page(dbuf, 0);
> -	end = p + assert_op_buf_size;
> +	p = ceph_databuf_enc_start(request);
>  
>  	/* encode cls_lock_assert_op struct */
>  	ceph_start_encoding(&p, 1, 1,
>  			    assert_op_buf_size - CEPH_ENCODING_START_BLK_LEN);
> -	ceph_encode_string(&p, end, lock_name, name_len);
> +	ceph_encode_string(&p, lock_name, name_len);
>  	ceph_encode_8(&p, type);
> -	ceph_encode_string(&p, end, cookie, cookie_len);
> -	ceph_encode_string(&p, end, tag, tag_len);
> -	kunmap(p);
> -	WARN_ON(p != end);
> -	ceph_databuf_added_data(dbuf, assert_op_buf_size);
> +	ceph_encode_string(&p, cookie, cookie_len);
> +	ceph_encode_string(&p, tag, tag_len);
> +	ceph_databuf_enc_stop(request, p);
> +	WARN_ON(ceph_databuf_len(request) != assert_op_buf_size);
>  
> -	osd_req_op_cls_request_databuf(req, which, dbuf);
> +	osd_req_op_cls_request_databuf(req, which, request);
>  	return 0;
>  }
>  EXPORT_SYMBOL(ceph_cls_assert_locked);
> diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
> index ab66b599ac47..39103e4bb07d 100644
> --- a/net/ceph/mon_client.c
> +++ b/net/ceph/mon_client.c
> @@ -367,7 +367,8 @@ static void __send_subscribe(struct ceph_mon_client *monc)
>  		dout("%s %s start %llu flags 0x%x\n", __func__, buf,
>  		     le64_to_cpu(monc->subs[i].item.start),
>  		     monc->subs[i].item.flags);
> -		ceph_encode_string(&p, end, buf, len);
> +		BUG_ON(p + sizeof(__le32) + len > end);

Frankly speaking, it's hard to follow why sizeof(__le32) should be in the
equation. Maybe, it make sense to introduce some constant? The name of constant
makes understanding of this calculation more clear.

> +		ceph_encode_string(&p, buf, len);
>  		memcpy(p, &monc->subs[i].item, sizeof(monc->subs[i].item));
>  		p += sizeof(monc->subs[i].item);
>  	}
> @@ -854,13 +855,14 @@ __ceph_monc_get_version(struct ceph_mon_client *monc, const char *what,
>  			ceph_monc_callback_t cb, u64 private_data)
>  {
>  	struct ceph_mon_generic_request *req;
> +	size_t wsize = strlen(what);
>  
>  	req = alloc_generic_request(monc, GFP_NOIO);
>  	if (!req)
>  		goto err_put_req;
>  
>  	req->request = ceph_msg_new(CEPH_MSG_MON_GET_VERSION,
> -				    sizeof(u64) + sizeof(u32) + strlen(what),
> +				    sizeof(u64) + sizeof(u32) + wsize,

Yeah, this abundance of sizeof(u64) and sizeof(u32) makes understanding of this
calculation is really unclear. :)

>  				    GFP_NOIO, true);
>  	if (!req->request)
>  		goto err_put_req;
> @@ -873,6 +875,8 @@ __ceph_monc_get_version(struct ceph_mon_client *monc, const char *what,
>  	req->complete_cb = cb;
>  	req->private_data = private_data;
>  
> +	BUG_ON(sizeof(__le64) + sizeof(__le32) + wsize > req->request->front_alloc_len);

The same problem is here. It's hard to follow to this check by involving
sizeof(__le64) and sizeof(__le32) in calculation. What these numbers mean here?

Thanks,
Slava.

> +
>  	mutex_lock(&monc->mutex);
>  	register_generic_request(req);
>  	{
> @@ -880,7 +884,7 @@ __ceph_monc_get_version(struct ceph_mon_client *monc, const char *what,
>  		void *const end = p + req->request->front_alloc_len;
>  
>  		ceph_encode_64(&p, req->tid); /* handle */
> -		ceph_encode_string(&p, end, what, strlen(what));
> +		ceph_encode_string(&p, what, wsize);
>  		WARN_ON(p != end);
>  	}
>  	send_generic_request(monc, req);
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index c4525feb8e26..b4adb299f9cd 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -1831,15 +1831,15 @@ static int hoid_encoding_size(const struct ceph_hobject_id *hoid)
>  	       4 + hoid->key_len + 4 + hoid->oid_len + 4 + hoid->nspace_len;
>  }
>  
> -static void encode_hoid(void **p, void *end, const struct ceph_hobject_id *hoid)
> +static void encode_hoid(void **p, const struct ceph_hobject_id *hoid)
>  {
>  	ceph_start_encoding(p, 4, 3, hoid_encoding_size(hoid));
> -	ceph_encode_string(p, end, hoid->key, hoid->key_len);
> -	ceph_encode_string(p, end, hoid->oid, hoid->oid_len);
> +	ceph_encode_string(p, hoid->key, hoid->key_len);
> +	ceph_encode_string(p, hoid->oid, hoid->oid_len);
>  	ceph_encode_64(p, hoid->snapid);
>  	ceph_encode_32(p, hoid->hash);
>  	ceph_encode_8(p, hoid->is_max);
> -	ceph_encode_string(p, end, hoid->nspace, hoid->nspace_len);
> +	ceph_encode_string(p, hoid->nspace, hoid->nspace_len);
>  	ceph_encode_64(p, hoid->pool);
>  }
>  
> @@ -2072,16 +2072,14 @@ static void encode_spgid(void **p, const struct ceph_spg *spgid)
>  	ceph_encode_8(p, spgid->shard);
>  }
>  
> -static void encode_oloc(void **p, void *end,
> -			const struct ceph_object_locator *oloc)
> +static void encode_oloc(void **p, const struct ceph_object_locator *oloc)
>  {
>  	ceph_start_encoding(p, 5, 4, ceph_oloc_encoding_size(oloc));
>  	ceph_encode_64(p, oloc->pool);
>  	ceph_encode_32(p, -1); /* preferred */
>  	ceph_encode_32(p, 0);  /* key len */
>  	if (oloc->pool_ns)
> -		ceph_encode_string(p, end, oloc->pool_ns->str,
> -				   oloc->pool_ns->len);
> +		ceph_encode_string(p, oloc->pool_ns->str, oloc->pool_ns->len);
>  	else
>  		ceph_encode_32(p, 0);
>  }
> @@ -2122,8 +2120,8 @@ static void encode_request_partial(struct ceph_osd_request *req,
>  	ceph_encode_timespec64(p, &req->r_mtime);
>  	p += sizeof(struct ceph_timespec);
>  
> -	encode_oloc(&p, end, &req->r_t.target_oloc);
> -	ceph_encode_string(&p, end, req->r_t.target_oid.name,
> +	encode_oloc(&p, &req->r_t.target_oloc);
> +	ceph_encode_string(&p, req->r_t.target_oid.name,
>  			   req->r_t.target_oid.name_len);
>  
>  	/* ops, can imply data */
> @@ -4329,8 +4327,8 @@ static struct ceph_msg *create_backoff_message(
>  	ceph_encode_32(&p, map_epoch);
>  	ceph_encode_8(&p, CEPH_OSD_BACKOFF_OP_ACK_BLOCK);
>  	ceph_encode_64(&p, backoff->id);
> -	encode_hoid(&p, end, backoff->begin);
> -	encode_hoid(&p, end, backoff->end);
> +	encode_hoid(&p, backoff->begin);
> +	encode_hoid(&p, backoff->end);
>  	BUG_ON(p != end);
>  
>  	msg->front.iov_len = p - msg->front.iov_base;
> @@ -5264,8 +5262,8 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req,
>  
>  	p = page_address(pages[0]);
>  	end = p + PAGE_SIZE;
> -	ceph_encode_string(&p, end, src_oid->name, src_oid->name_len);
> -	encode_oloc(&p, end, src_oloc);
> +	ceph_encode_string(&p, src_oid->name, src_oid->name_len);
> +	encode_oloc(&p, src_oloc);
>  	ceph_encode_32(&p, truncate_seq);
>  	ceph_encode_64(&p, truncate_size);
>  	op->indata_len = PAGE_SIZE - (end - p);
> 
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 18/35] libceph, rbd: Convert some page arrays to ceph_databuf
  2025-03-13 23:33 ` [RFC PATCH 18/35] libceph, rbd: Convert some page arrays to ceph_databuf David Howells
@ 2025-03-18 20:02   ` Viacheslav Dubeyko
  2025-03-18 22:25   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-18 20:02 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:33 +0000, David Howells wrote:
> Convert some miscellaneous page arrays to ceph_databuf containers.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  drivers/block/rbd.c             | 12 ++++-----
>  include/linux/ceph/osd_client.h |  3 +++
>  net/ceph/osd_client.c           | 43 +++++++++++++++++++++------------
>  3 files changed, 36 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 078bb1e3e1da..eea12c7ab2a0 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -2108,7 +2108,7 @@ static int rbd_obj_calc_img_extents(struct rbd_obj_request *obj_req,
>  
>  static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which)
>  {
> -	struct page **pages;
> +	struct ceph_databuf *dbuf;
>  
>  	/*
>  	 * The response data for a STAT call consists of:
> @@ -2118,14 +2118,12 @@ static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which)
>  	 *         le32 tv_nsec;
>  	 *     } mtime;
>  	 */
> -	pages = ceph_alloc_page_vector(1, GFP_NOIO);
> -	if (IS_ERR(pages))
> -		return PTR_ERR(pages);
> +	dbuf = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO);

What this 8 + sizeof(struct ceph_timespec) means? Why do we use 8 here? :)

Thanks,
Slava.

> +	if (!dbuf)
> +		return -ENOMEM;
>  
>  	osd_req_op_init(osd_req, which, CEPH_OSD_OP_STAT, 0);
> -	osd_req_op_raw_data_in_pages(osd_req, which, pages,
> -				     8 + sizeof(struct ceph_timespec),
> -				     0, false, true);
> +	osd_req_op_raw_data_in_databuf(osd_req, which, dbuf);
>  	return 0;
>  }
>  
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index d31e59bd128c..6e126e212271 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -482,6 +482,9 @@ extern void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *,
>  					struct page **pages, u64 length,
>  					u32 offset, bool pages_from_pool,
>  					bool own_pages);
> +void osd_req_op_raw_data_in_databuf(struct ceph_osd_request *osd_req,
> +				    unsigned int which,
> +				    struct ceph_databuf *databuf);
>  extern void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *,
>  					unsigned int which,
>  					struct ceph_pagelist *pagelist);
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index b4adb299f9cd..64a06267e7b3 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -182,6 +182,17 @@ osd_req_op_extent_osd_data(struct ceph_osd_request *osd_req,
>  }
>  EXPORT_SYMBOL(osd_req_op_extent_osd_data);
>  
> +void osd_req_op_raw_data_in_databuf(struct ceph_osd_request *osd_req,
> +				    unsigned int which,
> +				    struct ceph_databuf *dbuf)
> +{
> +	struct ceph_osd_data *osd_data;
> +
> +	osd_data = osd_req_op_raw_data_in(osd_req, which);
> +	ceph_osd_databuf_init(osd_data, dbuf);
> +}
> +EXPORT_SYMBOL(osd_req_op_raw_data_in_databuf);
> +
>  void osd_req_op_raw_data_in_pages(struct ceph_osd_request *osd_req,
>  			unsigned int which, struct page **pages,
>  			u64 length, u32 offset,
> @@ -5000,7 +5011,7 @@ int ceph_osdc_list_watchers(struct ceph_osd_client *osdc,
>  			    u32 *num_watchers)
>  {
>  	struct ceph_osd_request *req;
> -	struct page **pages;
> +	struct ceph_databuf *dbuf;
>  	int ret;
>  
>  	req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_NOIO);
> @@ -5011,16 +5022,16 @@ int ceph_osdc_list_watchers(struct ceph_osd_client *osdc,
>  	ceph_oloc_copy(&req->r_base_oloc, oloc);
>  	req->r_flags = CEPH_OSD_FLAG_READ;
>  
> -	pages = ceph_alloc_page_vector(1, GFP_NOIO);
> -	if (IS_ERR(pages)) {
> -		ret = PTR_ERR(pages);
> +	dbuf = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO);
> +	if (!dbuf) {
> +		ret = -ENOMEM;
>  		goto out_put_req;
>  	}
>  
>  	osd_req_op_init(req, 0, CEPH_OSD_OP_LIST_WATCHERS, 0);
> -	ceph_osd_data_pages_init(osd_req_op_data(req, 0, list_watchers,
> -						 response_data),
> -				 pages, PAGE_SIZE, 0, false, true);
> +	ceph_osd_databuf_init(osd_req_op_data(req, 0, list_watchers,
> +					      response_data),
> +			      dbuf);
>  
>  	ret = ceph_osdc_alloc_messages(req, GFP_NOIO);
>  	if (ret)
> @@ -5029,10 +5040,11 @@ int ceph_osdc_list_watchers(struct ceph_osd_client *osdc,
>  	ceph_osdc_start_request(osdc, req);
>  	ret = ceph_osdc_wait_request(osdc, req);
>  	if (ret >= 0) {
> -		void *p = page_address(pages[0]);
> +		void *p = kmap_ceph_databuf_page(dbuf, 0);
>  		void *const end = p + req->r_ops[0].outdata_len;
>  
>  		ret = decode_watchers(&p, end, watchers, num_watchers);
> +		kunmap(p);
>  	}
>  
>  out_put_req:
> @@ -5246,12 +5258,12 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req,
>  			      u8 copy_from_flags)
>  {
>  	struct ceph_osd_req_op *op;
> -	struct page **pages;
> +	struct ceph_databuf *dbuf;
>  	void *p, *end;
>  
> -	pages = ceph_alloc_page_vector(1, GFP_KERNEL);
> -	if (IS_ERR(pages))
> -		return PTR_ERR(pages);
> +	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_KERNEL);
> +	if (!dbuf)
> +		return -ENOMEM;
>  
>  	op = osd_req_op_init(req, 0, CEPH_OSD_OP_COPY_FROM2,
>  			     dst_fadvise_flags);
> @@ -5260,16 +5272,17 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req,
>  	op->copy_from.flags = copy_from_flags;
>  	op->copy_from.src_fadvise_flags = src_fadvise_flags;
>  
> -	p = page_address(pages[0]);
> +	p = kmap_ceph_databuf_page(dbuf, 0);
>  	end = p + PAGE_SIZE;
>  	ceph_encode_string(&p, src_oid->name, src_oid->name_len);
>  	encode_oloc(&p, src_oloc);
>  	ceph_encode_32(&p, truncate_seq);
>  	ceph_encode_64(&p, truncate_size);
>  	op->indata_len = PAGE_SIZE - (end - p);
> +	ceph_databuf_added_data(dbuf, op->indata_len);
> +	kunmap_local(p);
>  
> -	ceph_osd_data_pages_init(&op->copy_from.osd_data, pages,
> -				 op->indata_len, 0, false, true);
> +	ceph_osd_databuf_init(&op->copy_from.osd_data, dbuf);
>  	return 0;
>  }
>  EXPORT_SYMBOL(osd_req_op_copy_from_init);
> 
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 19/35] libceph, ceph: Convert users of ceph_pagelist to ceph_databuf
  2025-03-13 23:33 ` [RFC PATCH 19/35] libceph, ceph: Convert users of ceph_pagelist " David Howells
@ 2025-03-18 20:09   ` Viacheslav Dubeyko
  2025-03-18 22:27   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-18 20:09 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:33 +0000, David Howells wrote:
> Convert users of ceph_pagelist to use ceph_databuf instead.  ceph_pagelist
> is then unused and can be removed.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/ceph/locks.c                 |  22 +++---
>  fs/ceph/mds_client.c            | 122 +++++++++++++++-----------------
>  fs/ceph/super.h                 |   6 +-
>  include/linux/ceph/osd_client.h |   2 +-
>  net/ceph/osd_client.c           |  61 ++++++++--------
>  5 files changed, 104 insertions(+), 109 deletions(-)
> 
> diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> index ebf4ac0055dd..32c7b0f0d61f 100644
> --- a/fs/ceph/locks.c
> +++ b/fs/ceph/locks.c
> @@ -371,8 +371,8 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
>  }
>  
>  /*
> - * Fills in the passed counter variables, so you can prepare pagelist metadata
> - * before calling ceph_encode_locks.
> + * Fills in the passed counter variables, so you can prepare metadata before
> + * calling ceph_encode_locks.
>   */
>  void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count)
>  {
> @@ -483,38 +483,38 @@ int ceph_encode_locks_to_buffer(struct inode *inode,
>  }
>  
>  /*
> - * Copy the encoded flock and fcntl locks into the pagelist.
> + * Copy the encoded flock and fcntl locks into the data buffer.
>   * Format is: #fcntl locks, sequential fcntl locks, #flock locks,
>   * sequential flock locks.
>   * Returns zero on success.
>   */
> -int ceph_locks_to_pagelist(struct ceph_filelock *flocks,
> -			   struct ceph_pagelist *pagelist,
> +int ceph_locks_to_databuf(struct ceph_filelock *flocks,
> +			   struct ceph_databuf *dbuf,
>  			   int num_fcntl_locks, int num_flock_locks)
>  {
>  	int err = 0;
>  	__le32 nlocks;
>  
>  	nlocks = cpu_to_le32(num_fcntl_locks);
> -	err = ceph_pagelist_append(pagelist, &nlocks, sizeof(nlocks));
> +	err = ceph_databuf_append(dbuf, &nlocks, sizeof(nlocks));
>  	if (err)
>  		goto out_fail;
>  
>  	if (num_fcntl_locks > 0) {
> -		err = ceph_pagelist_append(pagelist, flocks,
> -					   num_fcntl_locks * sizeof(*flocks));
> +		err = ceph_databuf_append(dbuf, flocks,
> +					  num_fcntl_locks * sizeof(*flocks));
>  		if (err)
>  			goto out_fail;
>  	}
>  
>  	nlocks = cpu_to_le32(num_flock_locks);
> -	err = ceph_pagelist_append(pagelist, &nlocks, sizeof(nlocks));
> +	err = ceph_databuf_append(dbuf, &nlocks, sizeof(nlocks));
>  	if (err)
>  		goto out_fail;
>  
>  	if (num_flock_locks > 0) {
> -		err = ceph_pagelist_append(pagelist, &flocks[num_fcntl_locks],
> -					   num_flock_locks * sizeof(*flocks));
> +		err = ceph_databuf_append(dbuf, &flocks[num_fcntl_locks],
> +					  num_flock_locks * sizeof(*flocks));
>  	}
>  out_fail:
>  	return err;
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 09661a34f287..f1c6d0ebf548 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -55,7 +55,7 @@
>  struct ceph_reconnect_state {
>  	struct ceph_mds_session *session;
>  	int nr_caps, nr_realms;
> -	struct ceph_pagelist *pagelist;
> +	struct ceph_databuf *dbuf;
>  	unsigned msg_version;
>  	bool allow_multi;
>  };
> @@ -4456,8 +4456,7 @@ static void replay_unsafe_requests(struct ceph_mds_client *mdsc,
>  static int send_reconnect_partial(struct ceph_reconnect_state *recon_state)
>  {
>  	struct ceph_msg *reply;
> -	struct ceph_pagelist *_pagelist;
> -	struct page *page;
> +	struct ceph_databuf *_dbuf;
>  	__le32 *addr;
>  	int err = -ENOMEM;
>  
> @@ -4467,9 +4466,9 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state)
>  	/* can't handle message that contains both caps and realm */
>  	BUG_ON(!recon_state->nr_caps == !recon_state->nr_realms);
>  
> -	/* pre-allocate new pagelist */
> -	_pagelist = ceph_pagelist_alloc(GFP_NOFS);
> -	if (!_pagelist)
> +	/* pre-allocate new databuf */
> +	_dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOFS);
> +	if (!_dbuf)
>  		return -ENOMEM;
>  
>  	reply = ceph_msg_new2(CEPH_MSG_CLIENT_RECONNECT, 0, 1, GFP_NOFS, false);
> @@ -4477,28 +4476,27 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state)
>  		goto fail_msg;
>  
>  	/* placeholder for nr_caps */
> -	err = ceph_pagelist_encode_32(_pagelist, 0);
> +	err = ceph_databuf_encode_32(_dbuf, 0);
>  	if (err < 0)
>  		goto fail;
>  
>  	if (recon_state->nr_caps) {
>  		/* currently encoding caps */
> -		err = ceph_pagelist_encode_32(recon_state->pagelist, 0);
> +		err = ceph_databuf_encode_32(recon_state->dbuf, 0);
>  		if (err)
>  			goto fail;
>  	} else {
>  		/* placeholder for nr_realms (currently encoding relams) */
> -		err = ceph_pagelist_encode_32(_pagelist, 0);
> +		err = ceph_databuf_encode_32(_dbuf, 0);
>  		if (err < 0)
>  			goto fail;
>  	}
>  
> -	err = ceph_pagelist_encode_8(recon_state->pagelist, 1);
> +	err = ceph_databuf_encode_8(recon_state->dbuf, 1);
>  	if (err)
>  		goto fail;
>  
> -	page = list_first_entry(&recon_state->pagelist->head, struct page, lru);
> -	addr = kmap_atomic(page);
> +	addr = kmap_ceph_databuf_page(recon_state->dbuf, 0);
>  	if (recon_state->nr_caps) {
>  		/* currently encoding caps */
>  		*addr = cpu_to_le32(recon_state->nr_caps);
> @@ -4506,18 +4504,18 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state)
>  		/* currently encoding relams */
>  		*(addr + 1) = cpu_to_le32(recon_state->nr_realms);
>  	}
> -	kunmap_atomic(addr);
> +	kunmap_local(addr);
>  
>  	reply->hdr.version = cpu_to_le16(5);
>  	reply->hdr.compat_version = cpu_to_le16(4);
>  
> -	reply->hdr.data_len = cpu_to_le32(recon_state->pagelist->length);
> -	ceph_msg_data_add_pagelist(reply, recon_state->pagelist);
> +	reply->hdr.data_len = cpu_to_le32(ceph_databuf_len(recon_state->dbuf));
> +	ceph_msg_data_add_databuf(reply, recon_state->dbuf);
>  
>  	ceph_con_send(&recon_state->session->s_con, reply);
> -	ceph_pagelist_release(recon_state->pagelist);
> +	ceph_databuf_release(recon_state->dbuf);
>  
> -	recon_state->pagelist = _pagelist;
> +	recon_state->dbuf = _dbuf;
>  	recon_state->nr_caps = 0;
>  	recon_state->nr_realms = 0;
>  	recon_state->msg_version = 5;
> @@ -4525,7 +4523,7 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state)
>  fail:
>  	ceph_msg_put(reply);
>  fail_msg:
> -	ceph_pagelist_release(_pagelist);
> +	ceph_databuf_release(_dbuf);
>  	return err;
>  }
>  
> @@ -4575,7 +4573,7 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
>  	} rec;
>  	struct ceph_inode_info *ci = ceph_inode(inode);
>  	struct ceph_reconnect_state *recon_state = arg;
> -	struct ceph_pagelist *pagelist = recon_state->pagelist;
> +	struct ceph_databuf *dbuf = recon_state->dbuf;
>  	struct dentry *dentry;
>  	struct ceph_cap *cap;
>  	char *path;
> @@ -4698,7 +4696,7 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
>  			struct_v = 2;
>  		}
>  		/*
> -		 * number of encoded locks is stable, so copy to pagelist
> +		 * number of encoded locks is stable, so copy to databuf
>  		 */
>  		struct_len = 2 * sizeof(u32) +
>  			    (num_fcntl_locks + num_flock_locks) *

I think we have too many mysterious equations in CephFS code. :)

> @@ -4712,41 +4710,42 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
>  
>  		total_len += struct_len;
>  
> -		if (pagelist->length + total_len > RECONNECT_MAX_SIZE) {
> +		if (ceph_databuf_len(dbuf) + total_len > RECONNECT_MAX_SIZE) {
>  			err = send_reconnect_partial(recon_state);
>  			if (err)
>  				goto out_freeflocks;
> -			pagelist = recon_state->pagelist;
> +			dbuf = recon_state->dbuf;
>  		}
>  
> -		err = ceph_pagelist_reserve(pagelist, total_len);
> +		err = ceph_databuf_reserve(dbuf, total_len, GFP_NOFS);
>  		if (err)
>  			goto out_freeflocks;
>  
> -		ceph_pagelist_encode_64(pagelist, ceph_ino(inode));
> +		ceph_databuf_encode_64(dbuf, ceph_ino(inode));
>  		if (recon_state->msg_version >= 3) {
> -			ceph_pagelist_encode_8(pagelist, struct_v);
> -			ceph_pagelist_encode_8(pagelist, 1);
> -			ceph_pagelist_encode_32(pagelist, struct_len);
> +			ceph_databuf_encode_8(dbuf, struct_v);
> +			ceph_databuf_encode_8(dbuf, 1);
> +			ceph_databuf_encode_32(dbuf, struct_len);
>  		}
> -		ceph_pagelist_encode_string(pagelist, path, pathlen);
> -		ceph_pagelist_append(pagelist, &rec, sizeof(rec.v2));
> -		ceph_locks_to_pagelist(flocks, pagelist,
> -				       num_fcntl_locks, num_flock_locks);
> +		ceph_databuf_encode_string(dbuf, path, pathlen);
> +		ceph_databuf_append(dbuf, &rec, sizeof(rec.v2));
> +		ceph_locks_to_databuf(flocks, dbuf,
> +				      num_fcntl_locks, num_flock_locks);
>  		if (struct_v >= 2)
> -			ceph_pagelist_encode_64(pagelist, snap_follows);
> +			ceph_databuf_encode_64(dbuf, snap_follows);
>  out_freeflocks:
>  		kfree(flocks);
>  	} else {
> -		err = ceph_pagelist_reserve(pagelist,
> -					    sizeof(u64) + sizeof(u32) +
> -					    pathlen + sizeof(rec.v1));
> +		err = ceph_databuf_reserve(dbuf,
> +					   sizeof(u64) + sizeof(u32) +
> +					   pathlen + sizeof(rec.v1),
> +					   GFP_NOFS);

Yeah, another mysterious calculation. Why do we add sizeof(u64) and sizeof(u32)
here?

Thanks,
Slava.

>  		if (err)
>  			goto out_err;
>  
> -		ceph_pagelist_encode_64(pagelist, ceph_ino(inode));
> -		ceph_pagelist_encode_string(pagelist, path, pathlen);
> -		ceph_pagelist_append(pagelist, &rec, sizeof(rec.v1));
> +		ceph_databuf_encode_64(dbuf, ceph_ino(inode));
> +		ceph_databuf_encode_string(dbuf, path, pathlen);
> +		ceph_databuf_append(dbuf, &rec, sizeof(rec.v1));
>  	}
>  
>  out_err:
> @@ -4760,12 +4759,12 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
>  			      struct ceph_reconnect_state *recon_state)
>  {
>  	struct rb_node *p;
> -	struct ceph_pagelist *pagelist = recon_state->pagelist;
>  	struct ceph_client *cl = mdsc->fsc->client;
> +	struct ceph_databuf *dbuf = recon_state->dbuf;
>  	int err = 0;
>  
>  	if (recon_state->msg_version >= 4) {
> -		err = ceph_pagelist_encode_32(pagelist, mdsc->num_snap_realms);
> +		err = ceph_databuf_encode_32(dbuf, mdsc->num_snap_realms);
>  		if (err < 0)
>  			goto fail;
>  	}
> @@ -4784,20 +4783,20 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
>  			size_t need = sizeof(u8) * 2 + sizeof(u32) +
>  				      sizeof(sr_rec);
>  
> -			if (pagelist->length + need > RECONNECT_MAX_SIZE) {
> +			if (ceph_databuf_len(dbuf) + need > RECONNECT_MAX_SIZE) {
>  				err = send_reconnect_partial(recon_state);
>  				if (err)
>  					goto fail;
> -				pagelist = recon_state->pagelist;
> +				dbuf = recon_state->dbuf;
>  			}
>  
> -			err = ceph_pagelist_reserve(pagelist, need);
> +			err = ceph_databuf_reserve(dbuf, need, GFP_NOFS);
>  			if (err)
>  				goto fail;
>  
> -			ceph_pagelist_encode_8(pagelist, 1);
> -			ceph_pagelist_encode_8(pagelist, 1);
> -			ceph_pagelist_encode_32(pagelist, sizeof(sr_rec));
> +			ceph_databuf_encode_8(dbuf, 1);
> +			ceph_databuf_encode_8(dbuf, 1);
> +			ceph_databuf_encode_32(dbuf, sizeof(sr_rec));
>  		}
>  
>  		doutc(cl, " adding snap realm %llx seq %lld parent %llx\n",
> @@ -4806,7 +4805,7 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
>  		sr_rec.seq = cpu_to_le64(realm->seq);
>  		sr_rec.parent = cpu_to_le64(realm->parent_ino);
>  
> -		err = ceph_pagelist_append(pagelist, &sr_rec, sizeof(sr_rec));
> +		err = ceph_databuf_append(dbuf, &sr_rec, sizeof(sr_rec));
>  		if (err)
>  			goto fail;
>  
> @@ -4841,9 +4840,9 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>  
>  	pr_info_client(cl, "mds%d reconnect start\n", mds);
>  
> -	recon_state.pagelist = ceph_pagelist_alloc(GFP_NOFS);
> -	if (!recon_state.pagelist)
> -		goto fail_nopagelist;
> +	recon_state.dbuf = ceph_databuf_req_alloc(1, 0, GFP_NOFS);
> +	if (!recon_state.dbuf)
> +		goto fail_nodatabuf;
>  
>  	reply = ceph_msg_new2(CEPH_MSG_CLIENT_RECONNECT, 0, 1, GFP_NOFS, false);
>  	if (!reply)
> @@ -4891,7 +4890,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>  	down_read(&mdsc->snap_rwsem);
>  
>  	/* placeholder for nr_caps */
> -	err = ceph_pagelist_encode_32(recon_state.pagelist, 0);
> +	err = ceph_databuf_encode_32(recon_state.dbuf, 0);
>  	if (err)
>  		goto fail;
>  
> @@ -4916,7 +4915,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>  	/* check if all realms can be encoded into current message */
>  	if (mdsc->num_snap_realms) {
>  		size_t total_len =
> -			recon_state.pagelist->length +
> +			ceph_databuf_len(recon_state.dbuf) +
>  			mdsc->num_snap_realms *
>  			sizeof(struct ceph_mds_snaprealm_reconnect);
>  		if (recon_state.msg_version >= 4) {
> @@ -4945,31 +4944,28 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>  		goto fail;
>  
>  	if (recon_state.msg_version >= 5) {
> -		err = ceph_pagelist_encode_8(recon_state.pagelist, 0);
> +		err = ceph_databuf_encode_8(recon_state.dbuf, 0);
>  		if (err < 0)
>  			goto fail;
>  	}
>  
>  	if (recon_state.nr_caps || recon_state.nr_realms) {
> -		struct page *page =
> -			list_first_entry(&recon_state.pagelist->head,
> -					struct page, lru);
> -		__le32 *addr = kmap_atomic(page);
> +		__le32 *addr = kmap_ceph_databuf_page(recon_state.dbuf, 0);
>  		if (recon_state.nr_caps) {
>  			WARN_ON(recon_state.nr_realms != mdsc->num_snap_realms);
>  			*addr = cpu_to_le32(recon_state.nr_caps);
>  		} else if (recon_state.msg_version >= 4) {
>  			*(addr + 1) = cpu_to_le32(recon_state.nr_realms);
>  		}
> -		kunmap_atomic(addr);
> +		kunmap_local(addr);
>  	}
>  
>  	reply->hdr.version = cpu_to_le16(recon_state.msg_version);
>  	if (recon_state.msg_version >= 4)
>  		reply->hdr.compat_version = cpu_to_le16(4);
>  
> -	reply->hdr.data_len = cpu_to_le32(recon_state.pagelist->length);
> -	ceph_msg_data_add_pagelist(reply, recon_state.pagelist);
> +	reply->hdr.data_len = cpu_to_le32(ceph_databuf_len(recon_state.dbuf));
> +	ceph_msg_data_add_databuf(reply, recon_state.dbuf);
>  
>  	ceph_con_send(&session->s_con, reply);
>  
> @@ -4980,7 +4976,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>  	mutex_unlock(&mdsc->mutex);
>  
>  	up_read(&mdsc->snap_rwsem);
> -	ceph_pagelist_release(recon_state.pagelist);
> +	ceph_databuf_release(recon_state.dbuf);
>  	return;
>  
>  fail:
> @@ -4988,8 +4984,8 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>  	up_read(&mdsc->snap_rwsem);
>  	mutex_unlock(&session->s_mutex);
>  fail_nomsg:
> -	ceph_pagelist_release(recon_state.pagelist);
> -fail_nopagelist:
> +	ceph_databuf_release(recon_state.dbuf);
> +fail_nodatabuf:
>  	pr_err_client(cl, "error %d preparing reconnect for mds%d\n",
>  		      err, mds);
>  	return;
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 984a6d2a5378..b072572e2cf4 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -1351,9 +1351,9 @@ extern int ceph_encode_locks_to_buffer(struct inode *inode,
>  				       struct ceph_filelock *flocks,
>  				       int num_fcntl_locks,
>  				       int num_flock_locks);
> -extern int ceph_locks_to_pagelist(struct ceph_filelock *flocks,
> -				  struct ceph_pagelist *pagelist,
> -				  int num_fcntl_locks, int num_flock_locks);
> +extern int ceph_locks_to_databuf(struct ceph_filelock *flocks,
> +				 struct ceph_databuf *dbuf,
> +				 int num_fcntl_locks, int num_flock_locks);
>  
>  /* debugfs.c */
>  extern void ceph_fs_debugfs_init(struct ceph_fs_client *client);
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 6e126e212271..ce04205b8143 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -334,7 +334,7 @@ struct ceph_osd_linger_request {
>  	rados_watcherrcb_t errcb;
>  	void *data;
>  
> -	struct ceph_pagelist *request_pl;
> +	struct ceph_databuf *request_pl;
>  	struct ceph_databuf *notify_id_buf;
>  
>  	struct page ***preply_pages;
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 64a06267e7b3..a967309d01a7 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -810,37 +810,37 @@ int osd_req_op_xattr_init(struct ceph_osd_request *osd_req, unsigned int which,
>  {
>  	struct ceph_osd_req_op *op = osd_req_op_init(osd_req, which,
>  						     opcode, 0);
> -	struct ceph_pagelist *pagelist;
> +	struct ceph_databuf *dbuf;
>  	size_t payload_len;
>  	int ret;
>  
>  	BUG_ON(opcode != CEPH_OSD_OP_SETXATTR && opcode != CEPH_OSD_OP_CMPXATTR);
>  
> -	pagelist = ceph_pagelist_alloc(GFP_NOFS);
> -	if (!pagelist)
> +	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOFS);
> +	if (!dbuf)
>  		return -ENOMEM;
>  
>  	payload_len = strlen(name);
>  	op->xattr.name_len = payload_len;
> -	ret = ceph_pagelist_append(pagelist, name, payload_len);
> +	ret = ceph_databuf_append(dbuf, name, payload_len);
>  	if (ret)
> -		goto err_pagelist_free;
> +		goto err_databuf_free;
>  
>  	op->xattr.value_len = size;
> -	ret = ceph_pagelist_append(pagelist, value, size);
> +	ret = ceph_databuf_append(dbuf, value, size);
>  	if (ret)
> -		goto err_pagelist_free;
> +		goto err_databuf_free;
>  	payload_len += size;
>  
>  	op->xattr.cmp_op = cmp_op;
>  	op->xattr.cmp_mode = cmp_mode;
>  
> -	ceph_osd_data_pagelist_init(&op->xattr.osd_data, pagelist);
> +	ceph_osd_databuf_init(&op->xattr.osd_data, dbuf);
>  	op->indata_len = payload_len;
>  	return 0;
>  
> -err_pagelist_free:
> -	ceph_pagelist_release(pagelist);
> +err_databuf_free:
> +	ceph_databuf_release(dbuf);
>  	return ret;
>  }
>  EXPORT_SYMBOL(osd_req_op_xattr_init);
> @@ -864,15 +864,15 @@ static void osd_req_op_watch_init(struct ceph_osd_request *req, int which,
>   * encoded in @request_pl
>   */
>  static void osd_req_op_notify_init(struct ceph_osd_request *req, int which,
> -				   u64 cookie, struct ceph_pagelist *request_pl)
> +				   u64 cookie, struct ceph_databuf *request_pl)
>  {
>  	struct ceph_osd_req_op *op;
>  
>  	op = osd_req_op_init(req, which, CEPH_OSD_OP_NOTIFY, 0);
>  	op->notify.cookie = cookie;
>  
> -	ceph_osd_data_pagelist_init(&op->notify.request_data, request_pl);
> -	op->indata_len = request_pl->length;
> +	ceph_osd_databuf_init(&op->notify.request_data, request_pl);
> +	op->indata_len = ceph_databuf_len(request_pl);
>  }
>  
>  /*
> @@ -2730,8 +2730,7 @@ static void linger_release(struct kref *kref)
>  	WARN_ON(!list_empty(&lreq->pending_lworks));
>  	WARN_ON(lreq->osd);
>  
> -	if (lreq->request_pl)
> -		ceph_pagelist_release(lreq->request_pl);
> +	ceph_databuf_release(lreq->request_pl);
>  	ceph_databuf_release(lreq->notify_id_buf);
>  	ceph_osdc_put_request(lreq->reg_req);
>  	ceph_osdc_put_request(lreq->ping_req);
> @@ -4800,30 +4799,30 @@ static int osd_req_op_notify_ack_init(struct ceph_osd_request *req, int which,
>  				      u32 payload_len)
>  {
>  	struct ceph_osd_req_op *op;
> -	struct ceph_pagelist *pl;
> +	struct ceph_databuf *dbuf;
>  	int ret;
>  
>  	op = osd_req_op_init(req, which, CEPH_OSD_OP_NOTIFY_ACK, 0);
>  
> -	pl = ceph_pagelist_alloc(GFP_NOIO);
> -	if (!pl)
> +	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
> +	if (!dbuf)
>  		return -ENOMEM;
>  
> -	ret = ceph_pagelist_encode_64(pl, notify_id);
> -	ret |= ceph_pagelist_encode_64(pl, cookie);
> +	ret = ceph_databuf_encode_64(dbuf, notify_id);
> +	ret |= ceph_databuf_encode_64(dbuf, cookie);
>  	if (payload) {
> -		ret |= ceph_pagelist_encode_32(pl, payload_len);
> -		ret |= ceph_pagelist_append(pl, payload, payload_len);
> +		ret |= ceph_databuf_encode_32(dbuf, payload_len);
> +		ret |= ceph_databuf_append(dbuf, payload, payload_len);
>  	} else {
> -		ret |= ceph_pagelist_encode_32(pl, 0);
> +		ret |= ceph_databuf_encode_32(dbuf, 0);
>  	}
>  	if (ret) {
> -		ceph_pagelist_release(pl);
> +		ceph_databuf_release(dbuf);
>  		return -ENOMEM;
>  	}
>  
> -	ceph_osd_data_pagelist_init(&op->notify_ack.request_data, pl);
> -	op->indata_len = pl->length;
> +	ceph_osd_databuf_init(&op->notify_ack.request_data, dbuf);
> +	op->indata_len = ceph_databuf_len(dbuf);
>  	return 0;
>  }
>  
> @@ -4894,16 +4893,16 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
>  	if (!lreq)
>  		return -ENOMEM;
>  
> -	lreq->request_pl = ceph_pagelist_alloc(GFP_NOIO);
> +	lreq->request_pl = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
>  	if (!lreq->request_pl) {
>  		ret = -ENOMEM;
>  		goto out_put_lreq;
>  	}
>  
> -	ret = ceph_pagelist_encode_32(lreq->request_pl, 1); /* prot_ver */
> -	ret |= ceph_pagelist_encode_32(lreq->request_pl, timeout);
> -	ret |= ceph_pagelist_encode_32(lreq->request_pl, payload_len);
> -	ret |= ceph_pagelist_append(lreq->request_pl, payload, payload_len);
> +	ret = ceph_databuf_encode_32(lreq->request_pl, 1); /* prot_ver */
> +	ret |= ceph_databuf_encode_32(lreq->request_pl, timeout);
> +	ret |= ceph_databuf_encode_32(lreq->request_pl, payload_len);
> +	ret |= ceph_databuf_append(lreq->request_pl, payload, payload_len);
>  	if (ret) {
>  		ret = -ENOMEM;
>  		goto out_put_lreq;
> 
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 21/35] libceph: Make notify code use ceph_databuf_enc_start/stop
  2025-03-13 23:33 ` [RFC PATCH 21/35] libceph: Make notify code use ceph_databuf_enc_start/stop David Howells
@ 2025-03-18 20:12   ` Viacheslav Dubeyko
  2025-03-18 22:36   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-18 20:12 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:33 +0000, David Howells wrote:
> Make the ceph_osdc_notify*() functions use ceph_databuf_enc_start() and
> ceph_databuf_enc_stop() when filling out the request data.  Also use
> ceph_encode_*() rather than ceph_databuf_encode_*() as the latter will do
> an iterator copy to deal with page crossing and misalignment (the latter
> being something that the CPU will handle on some arches).
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  net/ceph/osd_client.c | 55 +++++++++++++++++++++----------------------
>  1 file changed, 27 insertions(+), 28 deletions(-)
> 
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 0ac439e7e730..1a0cb2cdcc52 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -4759,7 +4759,10 @@ static int osd_req_op_notify_ack_init(struct ceph_osd_request *req, int which,
>  {
>  	struct ceph_osd_req_op *op;
>  	struct ceph_databuf *dbuf;
> -	int ret;
> +	void *p;
> +
> +	if (!payload)
> +		payload_len = 0;
>  
>  	op = osd_req_op_init(req, which, CEPH_OSD_OP_NOTIFY_ACK, 0);
>  
> @@ -4767,18 +4770,13 @@ static int osd_req_op_notify_ack_init(struct ceph_osd_request *req, int which,
>  	if (!dbuf)
>  		return -ENOMEM;
>  
> -	ret = ceph_databuf_encode_64(dbuf, notify_id);
> -	ret |= ceph_databuf_encode_64(dbuf, cookie);
> -	if (payload) {
> -		ret |= ceph_databuf_encode_32(dbuf, payload_len);
> -		ret |= ceph_databuf_append(dbuf, payload, payload_len);
> -	} else {
> -		ret |= ceph_databuf_encode_32(dbuf, 0);
> -	}
> -	if (ret) {
> -		ceph_databuf_release(dbuf);
> -		return -ENOMEM;
> -	}
> +	p = ceph_databuf_enc_start(dbuf);
> +	ceph_encode_64(&p, notify_id);
> +	ceph_encode_64(&p, cookie);
> +	ceph_encode_32(&p, payload_len);
> +	if (payload)
> +		ceph_encode_copy(&p, payload, payload_len);
> +	ceph_databuf_enc_stop(dbuf, p);
>  
>  	ceph_osd_databuf_init(&op->notify_ack.request_data, dbuf);
>  	op->indata_len = ceph_databuf_len(dbuf);
> @@ -4840,8 +4838,12 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
>  		     size_t *preply_len)
>  {
>  	struct ceph_osd_linger_request *lreq;
> +	void *p;
>  	int ret;
>  
> +	if (WARN_ON_ONCE(payload_len > PAGE_SIZE - 3 * 4))

Why PAGE_SIZE - 3 * 4? Could make this more clear here?

> +		return -EIO;
> +
>  	WARN_ON(!timeout);
>  	if (preply_pages) {
>  		*preply_pages = NULL;
> @@ -4852,20 +4854,19 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
>  	if (!lreq)
>  		return -ENOMEM;
>  
> -	lreq->request_pl = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
> +	lreq->request_pl = ceph_databuf_req_alloc(0, 3 * 4 + payload_len,

The same question... :)

Thanks,
Slava.

> +						  GFP_NOIO);
>  	if (!lreq->request_pl) {
>  		ret = -ENOMEM;
>  		goto out_put_lreq;
>  	}
>  
> -	ret = ceph_databuf_encode_32(lreq->request_pl, 1); /* prot_ver */
> -	ret |= ceph_databuf_encode_32(lreq->request_pl, timeout);
> -	ret |= ceph_databuf_encode_32(lreq->request_pl, payload_len);
> -	ret |= ceph_databuf_append(lreq->request_pl, payload, payload_len);
> -	if (ret) {
> -		ret = -ENOMEM;
> -		goto out_put_lreq;
> -	}
> +	p = ceph_databuf_enc_start(lreq->request_pl);
> +	ceph_encode_32(&p, 1); /* prot_ver */
> +	ceph_encode_32(&p, timeout);
> +	ceph_encode_32(&p, payload_len);
> +	ceph_encode_copy(&p, payload, payload_len);
> +	ceph_databuf_enc_stop(lreq->request_pl, p);
>  
>  	/* for notify_id */
>  	lreq->notify_id_buf = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO);
> @@ -5217,7 +5218,7 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req,
>  {
>  	struct ceph_osd_req_op *op;
>  	struct ceph_databuf *dbuf;
> -	void *p, *end;
> +	void *p;
>  
>  	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_KERNEL);
>  	if (!dbuf)
> @@ -5230,15 +5231,13 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req,
>  	op->copy_from.flags = copy_from_flags;
>  	op->copy_from.src_fadvise_flags = src_fadvise_flags;
>  
> -	p = kmap_ceph_databuf_page(dbuf, 0);
> -	end = p + PAGE_SIZE;
> +	p = ceph_databuf_enc_start(dbuf);
>  	ceph_encode_string(&p, src_oid->name, src_oid->name_len);
>  	encode_oloc(&p, src_oloc);
>  	ceph_encode_32(&p, truncate_seq);
>  	ceph_encode_64(&p, truncate_size);
> -	op->indata_len = PAGE_SIZE - (end - p);
> -	ceph_databuf_added_data(dbuf, op->indata_len);
> -	kunmap_local(p);
> +	ceph_databuf_enc_stop(dbuf, p);
> +	op->indata_len = ceph_databuf_len(dbuf);
>  
>  	ceph_osd_databuf_init(&op->copy_from.osd_data, dbuf);
>  	return 0;
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 13/35] rbd: Switch from using bvec_iter to iov_iter
  2025-03-13 23:33 ` [RFC PATCH 13/35] rbd: Switch from using bvec_iter to iov_iter David Howells
  2025-03-18 19:38   ` Viacheslav Dubeyko
@ 2025-03-18 22:13   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-18 22:13 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	dongsheng.yang@easystack.cn, Xiubo Li,
	linux-fsdevel@vger.kernel.org, ceph-devel@vger.kernel.org,
	linux-kernel@vger.kernel.org, jlayton@kernel.org,
	idryomov@gmail.com, linux-block@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > +	list_for_each_entry(ex, object_extents, oe_item) {
> > +		if (ex->oe_objno == objno &&
> 
> OK. I see the point that objno should be the same.
> 
> > +		    ex->oe_off <= objoff &&
> 
> But why ex->oe_off could be lesser than objoff? The objoff could be not exactly
> the same?
> 
> > +		    ex->oe_off + ex->oe_len >= objoff + xlen) /* paranoia */
> 
> Do we really need in this comment? :)
> 
> I am still guessing why ex->oe_off + ex->oe_len could be bigger than objoff +
> xlen. Is it possible that object size or offset could be bigger?

Look further on in the patch.  The code is preexisting, just moved a bit.

My guess is that we're looking at data from the server so it *has* to be
sanity chacked before we can trust it.

David


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 17/35] libceph, rbd: Use ceph_databuf encoding start/stop
  2025-03-13 23:33 ` [RFC PATCH 17/35] libceph, rbd: Use ceph_databuf encoding start/stop David Howells
  2025-03-18 19:59   ` Viacheslav Dubeyko
@ 2025-03-18 22:19   ` David Howells
  2025-03-20 21:45     ` Viacheslav Dubeyko
  1 sibling, 1 reply; 72+ messages in thread
From: David Howells @ 2025-03-18 22:19 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > -		ceph_encode_string(&p, end, buf, len);
> > +		BUG_ON(p + sizeof(__le32) + len > end);
> 
> Frankly speaking, it's hard to follow why sizeof(__le32) should be in the
> equation. Maybe, it make sense to introduce some constant? The name of
> constant makes understanding of this calculation more clear.

Look through the patch.  It's done all over the place, even on parts I haven't
touched.  However, it's probably because of the way the string is encoded
(4-byte LE length followed by the characters).

It probably would make sense to use a calculation wrapper for this.  I have
this in fs/afs/yfsclient.c for example:

	static size_t xdr_strlen(unsigned int len)
	{
		return sizeof(__be32) + round_up(len, sizeof(__be32));
	}

> > +	BUG_ON(sizeof(__le64) + sizeof(__le32) + wsize > req->request->front_alloc_len);
> 
> The same problem is here. It's hard to follow to this check by involving
> sizeof(__le64) and sizeof(__le32) in calculation. What these numbers mean here?

Presumably the sizes of the protocol elements in the marshalled data.  If you
want to clean all those up in some way, I can add your patch into my
series;-).

David


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 18/35] libceph, rbd: Convert some page arrays to ceph_databuf
  2025-03-13 23:33 ` [RFC PATCH 18/35] libceph, rbd: Convert some page arrays to ceph_databuf David Howells
  2025-03-18 20:02   ` Viacheslav Dubeyko
@ 2025-03-18 22:25   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-18 22:25 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> >  	/*
> >  	 * The response data for a STAT call consists of:
> > @@ -2118,14 +2118,12 @@ static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which)
> >  	 *         le32 tv_nsec;
> >  	 *     } mtime;
> >  	 */
> > -	pages = ceph_alloc_page_vector(1, GFP_NOIO);
> > -	if (IS_ERR(pages))
> > -		return PTR_ERR(pages);
> > +	dbuf = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO);
> 
> What this 8 + sizeof(struct ceph_timespec) means? Why do we use 8 here? :)

See the comment that's partially obscured by the patch hunk line:

	/*
	 * The response data for a STAT call consists of:
	 *     le64 length;
	 *     struct {
	 *         le32 tv_sec;
	 *         le32 tv_nsec;
	 *     } mtime;
	 */

If you want to clean up and formalise all of these sorts of things, you might
need to invest in an rpcgen-like tool.  I've occasionally toyed with the idea
for afs in the kernel (I've hand-written all the marshalling/unmarshalling
code in fs/afs/fsclient.c, fs/afs/yfsclient.c and fs/afs/vlclient.c, but there
are some not-so-simple RPC calls to handle - FetchData and StoreData for
example).

David


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 19/35] libceph, ceph: Convert users of ceph_pagelist to ceph_databuf
  2025-03-13 23:33 ` [RFC PATCH 19/35] libceph, ceph: Convert users of ceph_pagelist " David Howells
  2025-03-18 20:09   ` Viacheslav Dubeyko
@ 2025-03-18 22:27   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-18 22:27 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> >  		/*
> > -		 * number of encoded locks is stable, so copy to pagelist
> > +		 * number of encoded locks is stable, so copy to databuf
> >  		 */
> >  		struct_len = 2 * sizeof(u32) +
> >  			    (num_fcntl_locks + num_flock_locks) *
> 
> I think we have too many mysterious equations in CephFS code. :)

That's not particularly a function of these patches.

> > -		err = ceph_pagelist_reserve(pagelist,
> > -					    sizeof(u64) + sizeof(u32) +
> > -					    pathlen + sizeof(rec.v1));
> > +		err = ceph_databuf_reserve(dbuf,
> > +					   sizeof(u64) + sizeof(u32) +
> > +					   pathlen + sizeof(rec.v1),
> > +					   GFP_NOFS);
> 
> Yeah, another mysterious calculation. Why do we add sizeof(u64) and
> sizeof(u32) here?

Protocol element space.

David


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 21/35] libceph: Make notify code use ceph_databuf_enc_start/stop
  2025-03-13 23:33 ` [RFC PATCH 21/35] libceph: Make notify code use ceph_databuf_enc_start/stop David Howells
  2025-03-18 20:12   ` Viacheslav Dubeyko
@ 2025-03-18 22:36   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-18 22:36 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > @@ -4852,20 +4854,19 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
> >  	if (!lreq)
> >  		return -ENOMEM;
> >  
> > -	lreq->request_pl = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
> > +	lreq->request_pl = ceph_databuf_req_alloc(0, 3 * 4 + payload_len,
> 
> The same question... :)
> 
> Thanks,
> Slava.
> 
> > +						  GFP_NOIO);
> >  	if (!lreq->request_pl) {
> >  		ret = -ENOMEM;
> >  		goto out_put_lreq;
> >  	}
> >  
> > -	ret = ceph_databuf_encode_32(lreq->request_pl, 1); /* prot_ver */
> > -	ret |= ceph_databuf_encode_32(lreq->request_pl, timeout);
> > -	ret |= ceph_databuf_encode_32(lreq->request_pl, payload_len);
> > -	ret |= ceph_databuf_append(lreq->request_pl, payload, payload_len);
> > -	if (ret) {
> > -		ret = -ENOMEM;
> > -		goto out_put_lreq;
> > -	}
> > +	p = ceph_databuf_enc_start(lreq->request_pl);
> > +	ceph_encode_32(&p, 1); /* prot_ver */
> > +	ceph_encode_32(&p, timeout);
> > +	ceph_encode_32(&p, payload_len);
> > +	ceph_encode_copy(&p, payload, payload_len);
> > +	ceph_databuf_enc_stop(lreq->request_pl, p);

I think the answer is obvious from that.  You have 3 protocol LE32 words plus
the payload.  Previously, ceph just allocated a whole page, whether or not we
needed anywhere near that much, and then would dynamically add pages as it
went along if one was insufficient.  By allocating up front, we get to make
use of the bulk allocator.

However, if we don't need all that much space, it affords us the opportunity
to make use of a page fragment allocator.

David


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 22/35] libceph, rbd: Convert ceph_osdc_notify() reply to ceph_databuf
  2025-03-13 23:33 ` [RFC PATCH 22/35] libceph, rbd: Convert ceph_osdc_notify() reply to ceph_databuf David Howells
@ 2025-03-19  0:08   ` Viacheslav Dubeyko
  2025-03-20 14:44   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-19  0:08 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:33 +0000, David Howells wrote:
> Convert the reply buffer of ceph_osdc_notify() to ceph_databuf rather than
> an array of pages.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  drivers/block/rbd.c             | 36 +++++++++++++++++----------
>  include/linux/ceph/databuf.h    | 16 ++++++++++++
>  include/linux/ceph/osd_client.h |  7 ++----
>  net/ceph/osd_client.c           | 44 +++++++++++----------------------
>  4 files changed, 55 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index eea12c7ab2a0..a2674077edea 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -3585,8 +3585,7 @@ static void rbd_unlock(struct rbd_device *rbd_dev)
>  
>  static int __rbd_notify_op_lock(struct rbd_device *rbd_dev,
>  				enum rbd_notify_op notify_op,
> -				struct page ***preply_pages,
> -				size_t *preply_len)
> +				struct ceph_databuf *reply)
>  {
>  	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
>  	struct rbd_client_id cid = rbd_get_cid(rbd_dev);
> @@ -3604,13 +3603,13 @@ static int __rbd_notify_op_lock(struct rbd_device *rbd_dev,
>  
>  	return ceph_osdc_notify(osdc, &rbd_dev->header_oid,
>  				&rbd_dev->header_oloc, buf, buf_size,
> -				RBD_NOTIFY_TIMEOUT, preply_pages, preply_len);
> +				RBD_NOTIFY_TIMEOUT, reply);
>  }
>  
>  static void rbd_notify_op_lock(struct rbd_device *rbd_dev,
>  			       enum rbd_notify_op notify_op)
>  {
> -	__rbd_notify_op_lock(rbd_dev, notify_op, NULL, NULL);
> +	__rbd_notify_op_lock(rbd_dev, notify_op, NULL);
>  }
>  
>  static void rbd_notify_acquired_lock(struct work_struct *work)
> @@ -3631,23 +3630,29 @@ static void rbd_notify_released_lock(struct work_struct *work)
>  
>  static int rbd_request_lock(struct rbd_device *rbd_dev)
>  {
> -	struct page **reply_pages;
> -	size_t reply_len;
> +	struct ceph_databuf *reply;
>  	bool lock_owner_responded = false;
>  	int ret;
>  
>  	dout("%s rbd_dev %p\n", __func__, rbd_dev);
>  
> -	ret = __rbd_notify_op_lock(rbd_dev, RBD_NOTIFY_OP_REQUEST_LOCK,
> -				   &reply_pages, &reply_len);
> +	/* The actual reply pages will be allocated in the read path and then
> +	 * pasted in in handle_watch_notify().
> +	 */
> +	reply = ceph_databuf_reply_alloc(0, 0, GFP_KERNEL);
> +	if (!reply)
> +		return -ENOMEM;
> +
> +	ret = __rbd_notify_op_lock(rbd_dev, RBD_NOTIFY_OP_REQUEST_LOCK, reply);
>  	if (ret && ret != -ETIMEDOUT) {
>  		rbd_warn(rbd_dev, "failed to request lock: %d", ret);
>  		goto out;
>  	}
>  
> -	if (reply_len > 0 && reply_len <= PAGE_SIZE) {
> -		void *p = page_address(reply_pages[0]);
> -		void *const end = p + reply_len;
> +	if (ceph_databuf_len(reply) > 0 && ceph_databuf_len(reply) <= PAGE_SIZE) {
> +		void *s = kmap_ceph_databuf_page(reply, 0);

Maybe, start instead of s?

> +		void *p = s;
> +		void *const end = p + ceph_databuf_len(reply);
>  		u32 n;
>  
>  		ceph_decode_32_safe(&p, end, n, e_inval); /* num_acks */
> @@ -3659,10 +3664,12 @@ static int rbd_request_lock(struct rbd_device *rbd_dev)
>  			p += 8 + 8; /* skip gid and cookie */
>  
>  			ceph_decode_32_safe(&p, end, len, e_inval);
> -			if (!len)
> +			if (!len) {
>  				continue;
> +			}
>  
>  			if (lock_owner_responded) {
> +				kunmap_local(s);
>  				rbd_warn(rbd_dev,
>  					 "duplicate lock owners detected");
>  				ret = -EIO;
> @@ -3673,6 +3680,7 @@ static int rbd_request_lock(struct rbd_device *rbd_dev)
>  			ret = ceph_start_decoding(&p, end, 1, "ResponseMessage",
>  						  &struct_v, &len);
>  			if (ret) {
> +				kunmap_local(s);

Is it possible to have kunmap_local() only in one place and to use goto of
jumping there?

>  				rbd_warn(rbd_dev,
>  					 "failed to decode ResponseMessage: %d",
>  					 ret);
> @@ -3681,6 +3689,8 @@ static int rbd_request_lock(struct rbd_device *rbd_dev)
>  
>  			ret = ceph_decode_32(&p);
>  		}
> +
> +		kunmap_local(s);
>  	}
>  
>  	if (!lock_owner_responded) {
> @@ -3689,7 +3699,7 @@ static int rbd_request_lock(struct rbd_device *rbd_dev)
>  	}
>  
>  out:
> -	ceph_release_page_vector(reply_pages, calc_pages_for(0, reply_len));
> +	ceph_databuf_release(reply);
>  	return ret;
>  
>  e_inval:
> diff --git a/include/linux/ceph/databuf.h b/include/linux/ceph/databuf.h
> index 54b76d0c91a0..25154b3d08fa 100644
> --- a/include/linux/ceph/databuf.h
> +++ b/include/linux/ceph/databuf.h
> @@ -150,4 +150,20 @@ static inline bool ceph_databuf_is_all_zero(struct ceph_databuf *dbuf, size_t co
>  			    ceph_databuf_scan_for_nonzero) == count;
>  }
>  
> +static inline void ceph_databuf_transfer(struct ceph_databuf *to,
> +					 struct ceph_databuf *from)
> +{
> +	BUG_ON(to->nr_bvec || to->bvec);
> +	to->bvec	= from->bvec;
> +	to->nr_bvec	= from->nr_bvec;
> +	to->max_bvec	= from->max_bvec;
> +	to->limit	= from->limit;
> +	to->iter	= from->iter;
> +
> +	from->bvec = NULL;
> +	from->nr_bvec = from->max_bvec = 0;
> +	from->limit = 0;
> +	iov_iter_discard(&from->iter, ITER_DEST, 0);
> +}
> +
>  #endif /* __FS_CEPH_DATABUF_H */
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 5a1ee66ca216..7eff589711cc 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -333,9 +333,7 @@ struct ceph_osd_linger_request {
>  
>  	struct ceph_databuf *request_pl;
>  	struct ceph_databuf *notify_id_buf;
> -
> -	struct page ***preply_pages;

Really!!! We had pointer on pointer on pointer... :) Damn, I never saw something
like this.

> -	size_t *preply_len;
> +	struct ceph_databuf *reply;
>  };
>  
>  struct ceph_watch_item {
> @@ -589,8 +587,7 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
>  		     void *payload,
>  		     u32 payload_len,
>  		     u32 timeout,
> -		     struct page ***preply_pages,
> -		     size_t *preply_len);
> +		     struct ceph_databuf *reply);
>  int ceph_osdc_list_watchers(struct ceph_osd_client *osdc,
>  			    struct ceph_object_id *oid,
>  			    struct ceph_object_locator *oloc,
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 1a0cb2cdcc52..92aaa5ed9145 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -4523,17 +4523,11 @@ static void handle_watch_notify(struct ceph_osd_client *osdc,
>  			dout("lreq %p notify_id %llu != %llu, ignoring\n", lreq,
>  			     lreq->notify_id, notify_id);
>  		} else if (!completion_done(&lreq->notify_finish_wait)) {
> -			struct ceph_msg_data *data =
> -			    msg->num_data_items ? &msg->data[0] : NULL;
> -
> -			if (data) {
> -				if (lreq->preply_pages) {
> -					WARN_ON(data->type !=
> -							CEPH_MSG_DATA_PAGES);
> -					*lreq->preply_pages = data->pages;
> -					*lreq->preply_len = data->length;
> -					data->own_pages = false;
> -				}
> +			if (msg->num_data_items && lreq->reply) {
> +				struct ceph_msg_data *data = &msg->data[0];

This low-level access slightly worry me. I don't see any real problem here. But,
maybe, we need to hide this access into some iterator-like function? However, it
could be not feasible for the scope of this patchset.

Thanks,
Slava.

> +
> +				WARN_ON(data->type != CEPH_MSG_DATA_DATABUF);
> +				ceph_databuf_transfer(lreq->reply, data->dbuf);
>  			}
>  			lreq->notify_finish_error = return_code;
>  			complete_all(&lreq->notify_finish_wait);
> @@ -4823,10 +4817,7 @@ EXPORT_SYMBOL(ceph_osdc_notify_ack);
>  /*
>   * @timeout: in seconds
>   *
> - * @preply_{pages,len} are initialized both on success and error.
> - * The caller is responsible for:
> - *
> - *     ceph_release_page_vector(reply_pages, calc_pages_for(0, reply_len))
> + * @reply should be an empty ceph_databuf.
>   */
>  int ceph_osdc_notify(struct ceph_osd_client *osdc,
>  		     struct ceph_object_id *oid,
> @@ -4834,8 +4825,7 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
>  		     void *payload,
>  		     u32 payload_len,
>  		     u32 timeout,
> -		     struct page ***preply_pages,
> -		     size_t *preply_len)
> +		     struct ceph_databuf *reply)
>  {
>  	struct ceph_osd_linger_request *lreq;
>  	void *p;
> @@ -4845,10 +4835,6 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
>  		return -EIO;
>  
>  	WARN_ON(!timeout);
> -	if (preply_pages) {
> -		*preply_pages = NULL;
> -		*preply_len = 0;
> -	}
>  
>  	lreq = linger_alloc(osdc);
>  	if (!lreq)
> @@ -4875,8 +4861,7 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
>  		goto out_put_lreq;
>  	}
>  
> -	lreq->preply_pages = preply_pages;
> -	lreq->preply_len = preply_len;
> +	lreq->reply = reply;
>  
>  	ceph_oid_copy(&lreq->t.base_oid, oid);
>  	ceph_oloc_copy(&lreq->t.base_oloc, oloc);
> @@ -5383,7 +5368,7 @@ static struct ceph_msg *get_reply(struct ceph_connection *con,
>  	return m;
>  }
>  
> -static struct ceph_msg *alloc_msg_with_page_vector(struct ceph_msg_header *hdr)
> +static struct ceph_msg *alloc_msg_with_data_buffer(struct ceph_msg_header *hdr)
>  {
>  	struct ceph_msg *m;
>  	int type = le16_to_cpu(hdr->type);
> @@ -5395,16 +5380,15 @@ static struct ceph_msg *alloc_msg_with_page_vector(struct ceph_msg_header *hdr)
>  		return NULL;
>  
>  	if (data_len) {
> -		struct page **pages;
> +		struct ceph_databuf *dbuf;
>  
> -		pages = ceph_alloc_page_vector(calc_pages_for(0, data_len),
> -					       GFP_NOIO);
> -		if (IS_ERR(pages)) {
> +		dbuf = ceph_databuf_reply_alloc(0, data_len, GFP_NOIO);
> +		if (!dbuf) {
>  			ceph_msg_put(m);
>  			return NULL;
>  		}
>  
> -		ceph_msg_data_add_pages(m, pages, data_len, 0, true);
> +		ceph_msg_data_add_databuf(m, dbuf);
>  	}
>  
>  	return m;
> @@ -5422,7 +5406,7 @@ static struct ceph_msg *osd_alloc_msg(struct ceph_connection *con,
>  	case CEPH_MSG_OSD_MAP:
>  	case CEPH_MSG_OSD_BACKOFF:
>  	case CEPH_MSG_WATCH_NOTIFY:
> -		return alloc_msg_with_page_vector(hdr);
> +		return alloc_msg_with_data_buffer(hdr);
>  	case CEPH_MSG_OSD_OPREPLY:
>  		return get_reply(con, hdr, skip);
>  	default:
> 
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 23/35] rbd: Use ceph_databuf_enc_start/stop()
  2025-03-13 23:33 ` [RFC PATCH 23/35] rbd: Use ceph_databuf_enc_start/stop() David Howells
@ 2025-03-19  0:32   ` Viacheslav Dubeyko
  2025-03-20 14:59   ` Why use plain numbers and totals rather than predef'd constants for RPC sizes? David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-19  0:32 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:33 +0000, David Howells wrote:
> Make rbd use ceph_databuf_enc_start() and ceph_databuf_enc_stop() when
> filling out the request data.  Also use ceph_encode_*() rather than
> ceph_databuf_encode_*() as the latter will do an iterator copy to deal with
> page crossing and misalignment (the latter being something that the CPU
> will handle on some arches).
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  drivers/block/rbd.c | 64 ++++++++++++++++++++++-----------------------
>  1 file changed, 31 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index a2674077edea..956fc4a8f1da 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -1970,19 +1970,19 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req,
>  				     int which, u64 objno, u8 new_state,
>  				     const u8 *current_state)
>  {
> -	struct ceph_databuf *dbuf;
> -	void *p, *start;
> +	struct ceph_databuf *request;
> +	void *p;
>  	int ret;
>  
>  	ret = osd_req_op_cls_init(req, which, "rbd", "object_map_update");
>  	if (ret)
>  		return ret;
>  
> -	dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO);
> -	if (!dbuf)
> +	request = ceph_databuf_req_alloc(1, 8 * 2 + 3 * 1, GFP_NOIO);

This 8 * 2 + 3 * 1 is too unclear for me. :) Could we introduce named constants
here?

> +	if (!request)
>  		return -ENOMEM;
>  
> -	p = start = kmap_ceph_databuf_page(dbuf, 0);
> +	p = ceph_databuf_enc_start(request);
>  	ceph_encode_64(&p, objno);
>  	ceph_encode_64(&p, objno + 1);
>  	ceph_encode_8(&p, new_state);
> @@ -1992,10 +1992,9 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req,
>  	} else {
>  		ceph_encode_8(&p, 0);
>  	}
> -	kunmap_local(p);
> -	ceph_databuf_added_data(dbuf, p - start);
> +	ceph_databuf_enc_stop(request, p);
>  
> -	osd_req_op_cls_request_databuf(req, which, dbuf);
> +	osd_req_op_cls_request_databuf(req, which, request);
>  	return 0;
>  }
>  
> @@ -2108,7 +2107,7 @@ static int rbd_obj_calc_img_extents(struct rbd_obj_request *obj_req,
>  
>  static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which)
>  {
> -	struct ceph_databuf *dbuf;
> +	struct ceph_databuf *request;
>  
>  	/*
>  	 * The response data for a STAT call consists of:
> @@ -2118,12 +2117,12 @@ static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which)
>  	 *         le32 tv_nsec;
>  	 *     } mtime;
>  	 */
> -	dbuf = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO);
> -	if (!dbuf)
> +	request = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO);

Ditto. Why do we have 8 + sizeof(struct ceph_timespec) here?

Thanks,
Slava.

> +	if (!request)
>  		return -ENOMEM;
>  
>  	osd_req_op_init(osd_req, which, CEPH_OSD_OP_STAT, 0);
> -	osd_req_op_raw_data_in_databuf(osd_req, which, dbuf);
> +	osd_req_op_raw_data_in_databuf(osd_req, which, request);
>  	return 0;
>  }
>  
> @@ -2964,16 +2963,16 @@ static int rbd_obj_copyup_current_snapc(struct rbd_obj_request *obj_req,
>  
>  static int setup_copyup_buf(struct rbd_obj_request *obj_req, u64 obj_overlap)
>  {
> -	struct ceph_databuf *dbuf;
> +	struct ceph_databuf *request;
>  
>  	rbd_assert(!obj_req->copyup_buf);
>  
> -	dbuf = ceph_databuf_req_alloc(calc_pages_for(0, obj_overlap),
> +	request = ceph_databuf_req_alloc(calc_pages_for(0, obj_overlap),
>  				      obj_overlap, GFP_NOIO);
> -	if (!dbuf)
> +	if (!request)
>  		return -ENOMEM;
>  
> -	obj_req->copyup_buf = dbuf;
> +	obj_req->copyup_buf = request;
>  	return 0;
>  }
>  
> @@ -4580,10 +4579,9 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
>  		if (!request)
>  			return -ENOMEM;
>  
> -		p = kmap_ceph_databuf_page(request, 0);
> -		memcpy(p, outbound, outbound_size);
> -		kunmap_local(p);
> -		ceph_databuf_added_data(request, outbound_size);
> +		p = ceph_databuf_enc_start(request);
> +		ceph_encode_copy(&p, outbound, outbound_size);
> +		ceph_databuf_enc_stop(request, p);
>  	}
>  
>  	reply = ceph_databuf_reply_alloc(1, inbound_size, GFP_KERNEL);
> @@ -4712,7 +4710,7 @@ static void rbd_free_disk(struct rbd_device *rbd_dev)
>  static int rbd_obj_read_sync(struct rbd_device *rbd_dev,
>  			     struct ceph_object_id *oid,
>  			     struct ceph_object_locator *oloc,
> -			     struct ceph_databuf *dbuf, int len)
> +			     struct ceph_databuf *request, int len)
>  {
>  	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
>  	struct ceph_osd_request *req;
> @@ -4727,7 +4725,7 @@ static int rbd_obj_read_sync(struct rbd_device *rbd_dev,
>  	req->r_flags = CEPH_OSD_FLAG_READ;
>  
>  	osd_req_op_extent_init(req, 0, CEPH_OSD_OP_READ, 0, len, 0, 0);
> -	osd_req_op_extent_osd_databuf(req, 0, dbuf);
> +	osd_req_op_extent_osd_databuf(req, 0, request);
>  
>  	ret = ceph_osdc_alloc_messages(req, GFP_KERNEL);
>  	if (ret)
> @@ -4750,16 +4748,16 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
>  				  bool first_time)
>  {
>  	struct rbd_image_header_ondisk *ondisk;
> -	struct ceph_databuf *dbuf = NULL;
> +	struct ceph_databuf *request = NULL;
>  	u32 snap_count = 0;
>  	u64 names_size = 0;
>  	u32 want_count;
>  	int ret;
>  
> -	dbuf = ceph_databuf_req_alloc(1, sizeof(*ondisk), GFP_KERNEL);
> -	if (!dbuf)
> +	request = ceph_databuf_req_alloc(1, sizeof(*ondisk), GFP_KERNEL);
> +	if (!request)
>  		return -ENOMEM;
> -	ondisk = kmap_ceph_databuf_page(dbuf, 0);
> +	ondisk = kmap_ceph_databuf_page(request, 0);
>  
>  	/*
>  	 * The complete header will include an array of its 64-bit
> @@ -4776,13 +4774,13 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
>  		size += names_size;
>  
>  		ret = -ENOMEM;
> -		if (size > dbuf->limit &&
> -		    ceph_databuf_reserve(dbuf, size - dbuf->limit,
> +		if (size > request->limit &&
> +		    ceph_databuf_reserve(request, size - request->limit,
>  					 GFP_KERNEL) < 0)
>  			goto out;
>  
>  		ret = rbd_obj_read_sync(rbd_dev, &rbd_dev->header_oid,
> -					&rbd_dev->header_oloc, dbuf, size);
> +					&rbd_dev->header_oloc, request, size);
>  		if (ret < 0)
>  			goto out;
>  		if ((size_t)ret < size) {
> @@ -4806,7 +4804,7 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev,
>  	ret = rbd_header_from_disk(header, ondisk, first_time);
>  out:
>  	kunmap_local(ondisk);
> -	ceph_databuf_release(dbuf);
> +	ceph_databuf_release(request);
>  	return ret;
>  }
>  
> @@ -5625,10 +5623,10 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev,
>  	if (!reply)
>  		goto out_free;
>  
> -	p = kmap_ceph_databuf_page(request, 0);
> +	p = ceph_databuf_enc_start(request);
>  	ceph_encode_64(&p, rbd_dev->spec->snap_id);
> -	kunmap_local(p);
> -	ceph_databuf_added_data(request, sizeof(__le64));
> +	ceph_databuf_enc_stop(request, p);
> +
>  	ret = __get_parent_info(rbd_dev, request, reply, pii);
>  	if (ret > 0)
>  		ret = __get_parent_info_legacy(rbd_dev, request, reply, pii);
> 
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 28/35] netfs: Adjust group handling
  2025-03-13 23:33 ` [RFC PATCH 28/35] netfs: Adjust group handling David Howells
@ 2025-03-19 18:57   ` Viacheslav Dubeyko
  2025-03-20 15:22   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-19 18:57 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:33 +0000, David Howells wrote:
> Make some adjustments to the handling of netfs groups so that ceph can use
> them for snap contexts:
> 
>  - Move netfs_get_group(), netfs_put_group() and netfs_put_group_many() to
>    linux/netfs.h so that ceph can build its snap context on netfs groups.
> 
>  - Move netfs_set_group() and __netfs_set_group() to linux/netfs.h so that
>    ceph_dirty_folio() can call them from inside of the locked section in
>    which it finds the snap context to attach.
> 
>  - Provide a netfs_writepages_group() that takes a group as a parameter and
>    attaches it to the request and make netfs_free_request() drop the ref on
>    it.  netfs_writepages() then becomes a wrapper that passes in a NULL
>    group.
> 
>  - In netfs_perform_write(), only consider a folio to have a conflicting
>    group if the folio's group pointer isn't NULL and if the folio is dirty.
> 
>  - In netfs_perform_write(), interject a small 10ms sleep after every 16
>    attempts to flush a folio within a single call.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/netfs/buffered_write.c | 25 ++++-------------
>  fs/netfs/internal.h       | 32 ---------------------
>  fs/netfs/objects.c        |  1 +
>  fs/netfs/write_issue.c    | 38 +++++++++++++++++++++----
>  include/linux/netfs.h     | 59 +++++++++++++++++++++++++++++++++++++++
>  5 files changed, 98 insertions(+), 57 deletions(-)
> 
> diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
> index 0245449b93e3..12ddbe9bc78b 100644
> --- a/fs/netfs/buffered_write.c
> +++ b/fs/netfs/buffered_write.c
> @@ -11,26 +11,9 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/pagevec.h>
> +#include <linux/delay.h>
>  #include "internal.h"
>  
> -static void __netfs_set_group(struct folio *folio, struct netfs_group *netfs_group)
> -{
> -	if (netfs_group)
> -		folio_attach_private(folio, netfs_get_group(netfs_group));
> -}
> -
> -static void netfs_set_group(struct folio *folio, struct netfs_group *netfs_group)
> -{
> -	void *priv = folio_get_private(folio);
> -
> -	if (unlikely(priv != netfs_group)) {
> -		if (netfs_group && (!priv || priv == NETFS_FOLIO_COPY_TO_CACHE))
> -			folio_attach_private(folio, netfs_get_group(netfs_group));
> -		else if (!netfs_group && priv == NETFS_FOLIO_COPY_TO_CACHE)
> -			folio_detach_private(folio);
> -	}
> -}
> -
>  /*
>   * Grab a folio for writing and lock it.  Attempt to allocate as large a folio
>   * as possible to hold as much of the remaining length as possible in one go.
> @@ -113,6 +96,7 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
>  	};
>  	struct netfs_io_request *wreq = NULL;
>  	struct folio *folio = NULL, *writethrough = NULL;
> +	unsigned int flush_counter = 0;
>  	unsigned int bdp_flags = (iocb->ki_flags & IOCB_NOWAIT) ? BDP_ASYNC : 0;
>  	ssize_t written = 0, ret, ret2;
>  	loff_t i_size, pos = iocb->ki_pos;
> @@ -208,7 +192,8 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
>  		group = netfs_folio_group(folio);
>  
>  		if (unlikely(group != netfs_group) &&
> -		    group != NETFS_FOLIO_COPY_TO_CACHE)
> +		    group != NETFS_FOLIO_COPY_TO_CACHE &&
> +		    (group || folio_test_dirty(folio)))

I am trying to follow to this complex condition. Is it possible case that folio
is dirty but we don't flush the content?

>  			goto flush_content;
>  
>  		if (folio_test_uptodate(folio)) {
> @@ -341,6 +326,8 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
>  		trace_netfs_folio(folio, netfs_flush_content);
>  		folio_unlock(folio);
>  		folio_put(folio);
> +		if ((++flush_counter & 0xf) == 0xf)
> +			msleep(10);

Do we really need to use sleep? And why is it 10 ms? And even if we would like
to use sleep, then it is better to introduce the named constant. And what is teh
justification for 10 ms?

>  		ret = filemap_write_and_wait_range(mapping, fpos, fpos + flen - 1);
>  		if (ret < 0)
>  			goto error_folio_unlock;
> diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
> index eebb4f0f660e..2a6123c4da35 100644
> --- a/fs/netfs/internal.h
> +++ b/fs/netfs/internal.h
> @@ -261,38 +261,6 @@ static inline bool netfs_is_cache_enabled(struct netfs_inode *ctx)
>  #endif
>  }
>  
> -/*
> - * Get a ref on a netfs group attached to a dirty page (e.g. a ceph snap).
> - */
> -static inline struct netfs_group *netfs_get_group(struct netfs_group *netfs_group)
> -{
> -	if (netfs_group && netfs_group != NETFS_FOLIO_COPY_TO_CACHE)
> -		refcount_inc(&netfs_group->ref);
> -	return netfs_group;
> -}
> -
> -/*
> - * Dispose of a netfs group attached to a dirty page (e.g. a ceph snap).
> - */
> -static inline void netfs_put_group(struct netfs_group *netfs_group)
> -{
> -	if (netfs_group &&
> -	    netfs_group != NETFS_FOLIO_COPY_TO_CACHE &&
> -	    refcount_dec_and_test(&netfs_group->ref))
> -		netfs_group->free(netfs_group);
> -}
> -
> -/*
> - * Dispose of a netfs group attached to a dirty page (e.g. a ceph snap).
> - */
> -static inline void netfs_put_group_many(struct netfs_group *netfs_group, int nr)
> -{
> -	if (netfs_group &&
> -	    netfs_group != NETFS_FOLIO_COPY_TO_CACHE &&
> -	    refcount_sub_and_test(nr, &netfs_group->ref))
> -		netfs_group->free(netfs_group);
> -}
> -
>  /*
>   * Check to see if a buffer aligns with the crypto block size.  If it doesn't
>   * the crypto layer is going to copy all the data - in which case relying on
> diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
> index 52d6fce70837..7fdbaa5c5cab 100644
> --- a/fs/netfs/objects.c
> +++ b/fs/netfs/objects.c
> @@ -153,6 +153,7 @@ static void netfs_free_request(struct work_struct *work)
>  		kvfree(rreq->direct_bv);
>  	}
>  
> +	netfs_put_group(rreq->group);
>  	rolling_buffer_clear(&rreq->buffer);
>  	rolling_buffer_clear(&rreq->bounce);
>  	if (test_bit(NETFS_RREQ_PUT_RMW_TAIL, &rreq->flags))
> diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
> index 93601033ba08..3921fcf4f859 100644
> --- a/fs/netfs/write_issue.c
> +++ b/fs/netfs/write_issue.c
> @@ -418,7 +418,7 @@ static int netfs_write_folio(struct netfs_io_request *wreq,
>  		netfs_issue_write(wreq, upload);
>  	} else if (fgroup != wreq->group) {
>  		/* We can't write this page to the server yet. */
> -		kdebug("wrong group");
> +		kdebug("wrong group %px != %px", fgroup, wreq->group);

I believe to use the %px is not very good practice. Do we really need to show
the real pointer?

>  		folio_redirty_for_writepage(wbc, folio);
>  		folio_unlock(folio);
>  		netfs_issue_write(wreq, upload);
> @@ -593,11 +593,19 @@ static void netfs_end_issue_write(struct netfs_io_request *wreq)
>  		netfs_wake_write_collector(wreq, false);
>  }
>  
> -/*
> - * Write some of the pending data back to the server
> +/**
> + * netfs_writepages_group - Flush data from the pagecache for a file
> + * @mapping: The file to flush from
> + * @wbc: Details of what should be flushed
> + * @group: The write grouping to flush (or NULL)
> + *
> + * Start asynchronous write back operations to flush dirty data belonging to a
> + * particular group in a file's pagecache back to the server and to the local
> + * cache.
>   */
> -int netfs_writepages(struct address_space *mapping,
> -		     struct writeback_control *wbc)
> +int netfs_writepages_group(struct address_space *mapping,
> +			   struct writeback_control *wbc,
> +			   struct netfs_group *group)
>  {
>  	struct netfs_inode *ictx = netfs_inode(mapping->host);
>  	struct netfs_io_request *wreq = NULL;
> @@ -618,12 +626,15 @@ int netfs_writepages(struct address_space *mapping,
>  	if (!folio)
>  		goto out;
>  
> -	wreq = netfs_create_write_req(mapping, NULL, folio_pos(folio), NETFS_WRITEBACK);
> +	wreq = netfs_create_write_req(mapping, NULL, folio_pos(folio),
> +				      NETFS_WRITEBACK);
>  	if (IS_ERR(wreq)) {
>  		error = PTR_ERR(wreq);
>  		goto couldnt_start;
>  	}
>  
> +	wreq->group = netfs_get_group(group);
> +
>  	trace_netfs_write(wreq, netfs_write_trace_writeback);
>  	netfs_stat(&netfs_n_wh_writepages);
>  
> @@ -659,6 +670,21 @@ int netfs_writepages(struct address_space *mapping,
>  	_leave(" = %d", error);
>  	return error;
>  }
> +EXPORT_SYMBOL(netfs_writepages_group);
> +
> +/**
> + * netfs_writepages - Flush data from the pagecache for a file
> + * @mapping: The file to flush from
> + * @wbc: Details of what should be flushed
> + *
> + * Start asynchronous write back operations to flush dirty data in a file's
> + * pagecache back to the server and to the local cache.
> + */
> +int netfs_writepages(struct address_space *mapping,
> +		     struct writeback_control *wbc)
> +{
> +	return netfs_writepages_group(mapping, wbc, NULL);
> +}
>  EXPORT_SYMBOL(netfs_writepages);
>  
>  /*
> diff --git a/include/linux/netfs.h b/include/linux/netfs.h
> index a67297de8a20..69052ac47ab1 100644
> --- a/include/linux/netfs.h
> +++ b/include/linux/netfs.h
> @@ -457,6 +457,9 @@ int netfs_read_folio(struct file *, struct folio *);
>  int netfs_write_begin(struct netfs_inode *, struct file *,
>  		      struct address_space *, loff_t pos, unsigned int len,
>  		      struct folio **, void **fsdata);
> +int netfs_writepages_group(struct address_space *mapping,
> +			   struct writeback_control *wbc,
> +			   struct netfs_group *group);
>  int netfs_writepages(struct address_space *mapping,
>  		     struct writeback_control *wbc);
>  bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio);
> @@ -597,4 +600,60 @@ static inline void netfs_wait_for_outstanding_io(struct inode *inode)
>  	wait_var_event(&ictx->io_count, atomic_read(&ictx->io_count) == 0);
>  }
>  
> +/*
> + * Get a ref on a netfs group attached to a dirty page (e.g. a ceph snap).
> + */
> +static inline struct netfs_group *netfs_get_group(struct netfs_group *netfs_group)
> +{
> +	if (netfs_group && netfs_group != NETFS_FOLIO_COPY_TO_CACHE)

The netfs_group is a pointer. Is it correct comparison of pointer with the
NETFS_FOLIO_COPY_TO_CACHE constant?

> +		refcount_inc(&netfs_group->ref);
> +	return netfs_group;
> +}
> +
> +/*
> + * Dispose of a netfs group attached to a dirty page (e.g. a ceph snap).
> + */
> +static inline void netfs_put_group(struct netfs_group *netfs_group)
> +{
> +	if (netfs_group &&
> +	    netfs_group != NETFS_FOLIO_COPY_TO_CACHE &&

Ditto. The same question here.

> +	    refcount_dec_and_test(&netfs_group->ref))
> +		netfs_group->free(netfs_group);
> +}
> +
> +/*
> + * Dispose of a netfs group attached to a dirty page (e.g. a ceph snap).
> + */
> +static inline void netfs_put_group_many(struct netfs_group *netfs_group, int nr)
> +{
> +	if (netfs_group &&
> +	    netfs_group != NETFS_FOLIO_COPY_TO_CACHE &&

Ditto.

Thanks,
Slava.

> +	    refcount_sub_and_test(nr, &netfs_group->ref))
> +		netfs_group->free(netfs_group);
> +}
> +
> +/*
> + * Set the group pointer directly on a folio.
> + */
> +static inline void __netfs_set_group(struct folio *folio, struct netfs_group *netfs_group)
> +{
> +	if (netfs_group)
> +		folio_attach_private(folio, netfs_get_group(netfs_group));
> +}
> +
> +/*
> + * Set the group pointer on a folio or the folio info record.
> + */
> +static inline void netfs_set_group(struct folio *folio, struct netfs_group *netfs_group)
> +{
> +	void *priv = folio_get_private(folio);
> +
> +	if (unlikely(priv != netfs_group)) {
> +		if (netfs_group && (!priv || priv == NETFS_FOLIO_COPY_TO_CACHE))
> +			folio_attach_private(folio, netfs_get_group(netfs_group));
> +		else if (!netfs_group && priv == NETFS_FOLIO_COPY_TO_CACHE)
> +			folio_detach_private(folio);
> +	}
> +}
> +
>  #endif /* _LINUX_NETFS_H */
> 
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 32/35] netfs: Add some more RMW support for ceph
  2025-03-13 23:33 ` [RFC PATCH 32/35] netfs: Add some more RMW support for ceph David Howells
@ 2025-03-19 19:14   ` Viacheslav Dubeyko
  2025-03-20 15:25   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-19 19:14 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:33 +0000, David Howells wrote:
> Add some support for RMW in ceph:
> 
>  (1) Add netfs_unbuffered_read_from_inode() to allow reading from an inode
>      without having a file pointer so that truncate can modify a
>      now-partial tail block of a content-encrypted file.
> 
>      This takes an additional argument to cause it to fail or give a short
>      read if a hole is encountered.  This is noted on the request with
>      NETFS_RREQ_NO_READ_HOLE for the filesystem to pick up.
> 
>  (2) Set NETFS_RREQ_RMW when doing an RMW as part of a request.
> 
>  (3) Provide a ->rmw_read_done() op for netfslib to tell the filesystem
>      that it has completed the read required for RMW.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/netfs/direct_read.c       | 75 ++++++++++++++++++++++++++++++++++++
>  fs/netfs/direct_write.c      |  1 +
>  fs/netfs/main.c              |  1 +
>  fs/netfs/objects.c           |  1 +
>  fs/netfs/read_collect.c      |  2 +
>  fs/netfs/write_retry.c       |  3 ++
>  include/linux/netfs.h        |  7 ++++
>  include/trace/events/netfs.h |  3 ++
>  8 files changed, 93 insertions(+)
> 
> diff --git a/fs/netfs/direct_read.c b/fs/netfs/direct_read.c
> index 5e4bd1e5a378..4061f934dfe6 100644
> --- a/fs/netfs/direct_read.c
> +++ b/fs/netfs/direct_read.c
> @@ -373,3 +373,78 @@ ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  	return ret;
>  }
>  EXPORT_SYMBOL(netfs_unbuffered_read_iter);
> +
> +/**
> + * netfs_unbuffered_read_from_inode - Perform an unbuffered sync I/O read
> + * @inode: The inode being accessed
> + * @pos: The file position to read from
> + * @iter: The output buffer (also specifies read length)
> + * @nohole: True to return short/ENODATA if hole encountered
> + *
> + * Perform a synchronous unbuffered I/O from the inode to the output buffer.
> + * No use is made of the pagecache.  The output buffer must be suitably aligned
> + * if content encryption is to be used.  If @nohole is true then the read will
> + * stop short if a hole is encountered and return -ENODATA if the read begins
> + * with a hole.
> + *
> + * The caller must hold any appropriate locks.
> + */
> +ssize_t netfs_unbuffered_read_from_inode(struct inode *inode, loff_t pos,
> +					 struct iov_iter *iter, bool nohole)
> +{
> +	struct netfs_io_request *rreq;
> +	ssize_t ret;
> +	size_t orig_count = iov_iter_count(iter);
> +
> +	_enter("");
> +
> +	if (WARN_ON(user_backed_iter(iter)))
> +		return -EIO;
> +
> +	if (!orig_count)
> +		return 0; /* Don't update atime */
> +
> +	ret = filemap_write_and_wait_range(inode->i_mapping, pos, orig_count);
> +	if (ret < 0)
> +		return ret;
> +	inode_update_time(inode, S_ATIME);
> +
> +	rreq = netfs_alloc_request(inode->i_mapping, NULL, pos, orig_count,
> +				   NULL, NETFS_UNBUFFERED_READ);
> +	if (IS_ERR(rreq))
> +		return PTR_ERR(rreq);
> +
> +	ret = -EIO;
> +	if (test_bit(NETFS_RREQ_CONTENT_ENCRYPTION, &rreq->flags) &&
> +	    WARN_ON(!netfs_is_crypto_aligned(rreq, iter)))
> +		goto out;
> +
> +	netfs_stat(&netfs_n_rh_dio_read);
> +	trace_netfs_read(rreq, rreq->start, rreq->len,
> +			 netfs_read_trace_unbuffered_read_from_inode);
> +
> +	rreq->buffer.iter	= *iter;

The struct iov_iter structure is complex enough and we assign it by value to
rreq->buffer.iter. So, the initial pointer will not receive any changes then. Is
it desired behavior here?

Thanks,
Slava.

> +	rreq->len		= orig_count;
> +	rreq->direct_bv_unpin	= false;
> +	iov_iter_advance(iter, orig_count);
> +
> +	if (nohole)
> +		__set_bit(NETFS_RREQ_NO_READ_HOLE, &rreq->flags);
> +
> +	/* We're going to do the crypto in place in the destination buffer. */
> +	if (test_bit(NETFS_RREQ_CONTENT_ENCRYPTION, &rreq->flags))
> +		__set_bit(NETFS_RREQ_CRYPT_IN_PLACE, &rreq->flags);
> +
> +	ret = netfs_dispatch_unbuffered_reads(rreq);
> +
> +	if (!rreq->submitted) {
> +		netfs_put_request(rreq, false, netfs_rreq_trace_put_no_submit);
> +		goto out;
> +	}
> +
> +	ret = netfs_wait_for_read(rreq);
> +out:
> +	netfs_put_request(rreq, false, netfs_rreq_trace_put_return);
> +	return ret;
> +}
> +EXPORT_SYMBOL(netfs_unbuffered_read_from_inode);
> diff --git a/fs/netfs/direct_write.c b/fs/netfs/direct_write.c
> index 83c5c06c4710..a99722f90c71 100644
> --- a/fs/netfs/direct_write.c
> +++ b/fs/netfs/direct_write.c
> @@ -145,6 +145,7 @@ static ssize_t netfs_write_through_bounce_buffer(struct netfs_io_request *wreq,
>  		wreq->start		= gstart;
>  		wreq->len		= gend - gstart;
>  
> +		__set_bit(NETFS_RREQ_RMW, &ictx->flags);
>  		if (gstart >= end) {
>  			/* At or after EOF, nothing to read. */
>  		} else {
> diff --git a/fs/netfs/main.c b/fs/netfs/main.c
> index 07f8cffbda8c..0900dea53e4a 100644
> --- a/fs/netfs/main.c
> +++ b/fs/netfs/main.c
> @@ -39,6 +39,7 @@ static const char *netfs_origins[nr__netfs_io_origin] = {
>  	[NETFS_READ_GAPS]		= "RG",
>  	[NETFS_READ_SINGLE]		= "R1",
>  	[NETFS_READ_FOR_WRITE]		= "RW",
> +	[NETFS_UNBUFFERED_READ]		= "UR",
>  	[NETFS_DIO_READ]		= "DR",
>  	[NETFS_WRITEBACK]		= "WB",
>  	[NETFS_WRITEBACK_SINGLE]	= "W1",
> diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
> index 4606e830c116..958c4d460d07 100644
> --- a/fs/netfs/objects.c
> +++ b/fs/netfs/objects.c
> @@ -60,6 +60,7 @@ struct netfs_io_request *netfs_alloc_request(struct address_space *mapping,
>  	    origin == NETFS_READ_GAPS ||
>  	    origin == NETFS_READ_SINGLE ||
>  	    origin == NETFS_READ_FOR_WRITE ||
> +	    origin == NETFS_UNBUFFERED_READ ||
>  	    origin == NETFS_DIO_READ) {
>  		INIT_WORK(&rreq->work, netfs_read_collection_worker);
>  		rreq->io_streams[0].avail = true;
> diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c
> index 0a0bff90ca9e..013a90738dcd 100644
> --- a/fs/netfs/read_collect.c
> +++ b/fs/netfs/read_collect.c
> @@ -462,6 +462,7 @@ static void netfs_read_collection(struct netfs_io_request *rreq)
>  	//netfs_rreq_is_still_valid(rreq);
>  
>  	switch (rreq->origin) {
> +	case NETFS_UNBUFFERED_READ:
>  	case NETFS_DIO_READ:
>  	case NETFS_READ_GAPS:
>  	case NETFS_RMW_READ:
> @@ -681,6 +682,7 @@ ssize_t netfs_wait_for_read(struct netfs_io_request *rreq)
>  	if (ret == 0) {
>  		ret = rreq->transferred;
>  		switch (rreq->origin) {
> +		case NETFS_UNBUFFERED_READ:
>  		case NETFS_DIO_READ:
>  		case NETFS_READ_SINGLE:
>  			ret = rreq->transferred;
> diff --git a/fs/netfs/write_retry.c b/fs/netfs/write_retry.c
> index f727b48e2bfe..9e4e79d5a403 100644
> --- a/fs/netfs/write_retry.c
> +++ b/fs/netfs/write_retry.c
> @@ -386,6 +386,9 @@ ssize_t netfs_rmw_read(struct netfs_io_request *wreq, struct file *file,
>  		ret = 0;
>  	}
>  
> +	if (ret == 0 && rreq->netfs_ops->rmw_read_done)
> +		rreq->netfs_ops->rmw_read_done(wreq, rreq);
> +
>  error:
>  	netfs_put_request(rreq, false, netfs_rreq_trace_put_return);
>  	return ret;
> diff --git a/include/linux/netfs.h b/include/linux/netfs.h
> index 9d17d4bd9753..4049c985b9b4 100644
> --- a/include/linux/netfs.h
> +++ b/include/linux/netfs.h
> @@ -220,6 +220,7 @@ enum netfs_io_origin {
>  	NETFS_READ_GAPS,		/* This read is a synchronous read to fill gaps */
>  	NETFS_READ_SINGLE,		/* This read should be treated as a single object */
>  	NETFS_READ_FOR_WRITE,		/* This read is to prepare a write */
> +	NETFS_UNBUFFERED_READ,		/* This is an unbuffered I/O read */
>  	NETFS_DIO_READ,			/* This is a direct I/O read */
>  	NETFS_WRITEBACK,		/* This write was triggered by writepages */
>  	NETFS_WRITEBACK_SINGLE,		/* This monolithic write was triggered by writepages */
> @@ -308,6 +309,9 @@ struct netfs_io_request {
>  #define NETFS_RREQ_CONTENT_ENCRYPTION	16	/* Content encryption is in use */
>  #define NETFS_RREQ_CRYPT_IN_PLACE	17	/* Do decryption in place */
>  #define NETFS_RREQ_PUT_RMW_TAIL		18	/* Need to put ->rmw_tail */
> +#define NETFS_RREQ_RMW			19	/* Performing RMW cycle */
> +#define NETFS_RREQ_REPEAT_RMW		20	/* Need to perform an RMW cycle */
> +#define NETFS_RREQ_NO_READ_HOLE		21	/* Give short read/error if hole encountered */
>  #define NETFS_RREQ_USE_PGPRIV2		31	/* [DEPRECATED] Use PG_private_2 to mark
>  						 * write to cache on read */
>  	const struct netfs_request_ops *netfs_ops;
> @@ -336,6 +340,7 @@ struct netfs_request_ops {
>  	/* Modification handling */
>  	void (*update_i_size)(struct inode *inode, loff_t i_size);
>  	void (*post_modify)(struct inode *inode, void *fs_priv);
> +	void (*rmw_read_done)(struct netfs_io_request *wreq, struct netfs_io_request *rreq);
>  
>  	/* Write request handling */
>  	void (*begin_writeback)(struct netfs_io_request *wreq);
> @@ -432,6 +437,8 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *i
>  ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
>  ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
>  ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter);
> +ssize_t netfs_unbuffered_read_from_inode(struct inode *inode, loff_t pos,
> +					 struct iov_iter *iter, bool nohole);
>  
>  /* High-level write API */
>  ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
> diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
> index 74af82d773bd..9254c6f0e604 100644
> --- a/include/trace/events/netfs.h
> +++ b/include/trace/events/netfs.h
> @@ -23,6 +23,7 @@
>  	EM(netfs_read_trace_read_gaps,		"READ-GAPS")	\
>  	EM(netfs_read_trace_read_single,	"READ-SNGL")	\
>  	EM(netfs_read_trace_prefetch_for_write,	"PREFETCHW")	\
> +	EM(netfs_read_trace_unbuffered_read_from_inode, "READ-INOD") \
>  	E_(netfs_read_trace_write_begin,	"WRITEBEGN")
>  
>  #define netfs_write_traces					\
> @@ -38,6 +39,7 @@
>  	EM(NETFS_READ_GAPS,			"RG")		\
>  	EM(NETFS_READ_SINGLE,			"R1")		\
>  	EM(NETFS_READ_FOR_WRITE,		"RW")		\
> +	EM(NETFS_UNBUFFERED_READ,		"UR")		\
>  	EM(NETFS_DIO_READ,			"DR")		\
>  	EM(NETFS_WRITEBACK,			"WB")		\
>  	EM(NETFS_WRITEBACK_SINGLE,		"W1")		\
> @@ -104,6 +106,7 @@
>  	EM(netfs_sreq_trace_io_progress,	"IO   ")	\
>  	EM(netfs_sreq_trace_limited,		"LIMIT")	\
>  	EM(netfs_sreq_trace_need_clear,		"N-CLR")	\
> +	EM(netfs_sreq_trace_need_rmw,		"N-RMW")	\
>  	EM(netfs_sreq_trace_partial_read,	"PARTR")	\
>  	EM(netfs_sreq_trace_need_retry,		"ND-RT")	\
>  	EM(netfs_sreq_trace_pending,		"PEND ")	\
> 
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  [RFC PATCH 33/35] ceph: Use netfslib [INCOMPLETE]
  2025-03-13 23:33 ` [RFC PATCH 33/35] ceph: Use netfslib [INCOMPLETE] David Howells
@ 2025-03-19 19:54   ` Viacheslav Dubeyko
  2025-03-20 15:38   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-19 19:54 UTC (permalink / raw)
  To: Alex Markuze, slava@dubeyko.com, David Howells
  Cc: linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

On Thu, 2025-03-13 at 23:33 +0000, David Howells wrote:
> Implement netfslib support for ceph.
> 
> Note that I've put the new code into its own file for now rather than
> attempting to modify the old code or putting it into an existing file.  The
> old code is just #if'd out for removal in a subsequent patch to make this
> patch easier to review.
> 
> Note also that this is incomplete as sparse map support and content crypto
> support are currently non-functional - but plain I/O should work.
> 
> There may also be an inode ref leak due to the way the ceph sometimes takes
> and holds on to an extra inode ref under some circumstances.  I'm not sure
> these are actually necessary.  For instance, ceph_dirty_folio() will ihold
> the inode if ci->i_wrbuffer_ref is 0
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Viacheslav Dubeyko <slava@dubeyko.com>
> cc: Alex Markuze <amarkuze@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
>  drivers/block/rbd.c             |    2 +-
>  fs/ceph/Makefile                |    2 +-
>  fs/ceph/addr.c                  |   46 +-
>  fs/ceph/cache.h                 |    5 +
>  fs/ceph/caps.c                  |    2 +-
>  fs/ceph/crypto.c                |   54 ++
>  fs/ceph/file.c                  |   15 +-
>  fs/ceph/inode.c                 |   30 +-
>  fs/ceph/rdwr.c                  | 1006 +++++++++++++++++++++++++++++++
>  fs/ceph/super.h                 |   39 +-
>  fs/netfs/internal.h             |    6 +-
>  fs/netfs/main.c                 |    4 +-
>  fs/netfs/write_issue.c          |    6 +-
>  include/linux/ceph/libceph.h    |    3 +-
>  include/linux/ceph/osd_client.h |    1 +
>  include/linux/netfs.h           |   13 +-
>  net/ceph/snapshot.c             |   20 +-
>  17 files changed, 1190 insertions(+), 64 deletions(-)
>  create mode 100644 fs/ceph/rdwr.c
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 956fc4a8f1da..94bb29c95b0d 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -468,7 +468,7 @@ static DEFINE_IDA(rbd_dev_id_ida);
>  static struct workqueue_struct *rbd_wq;
>  
>  static struct ceph_snap_context rbd_empty_snapc = {
> -	.nref = REFCOUNT_INIT(1),
> +	.group.ref = REFCOUNT_INIT(1),
>  };
>  
>  /*
> diff --git a/fs/ceph/Makefile b/fs/ceph/Makefile
> index 1f77ca04c426..e4d3c2d6e9c2 100644
> --- a/fs/ceph/Makefile
> +++ b/fs/ceph/Makefile
> @@ -5,7 +5,7 @@
>  
>  obj-$(CONFIG_CEPH_FS) += ceph.o
>  
> -ceph-y := super.o inode.o dir.o file.o locks.o addr.o ioctl.o \
> +ceph-y := super.o inode.o dir.o file.o locks.o addr.o rdwr.o ioctl.o \
>  	export.o caps.o snap.o xattr.o quota.o io.o \
>  	mds_client.o mdsmap.o strings.o ceph_frag.o \
>  	debugfs.o util.o metric.o
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 27f27ab24446..325fbbce1eaa 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -64,27 +64,30 @@
>  	(CONGESTION_ON_THRESH(congestion_kb) -				\
>  	 (CONGESTION_ON_THRESH(congestion_kb) >> 2))
>  
> +#if 0 // TODO: Remove after netfs conversion
>  static int ceph_netfs_check_write_begin(struct file *file, loff_t pos, unsigned int len,
>  					struct folio **foliop, void **_fsdata);
>  
> -static inline struct ceph_snap_context *page_snap_context(struct page *page)
> +static struct ceph_snap_context *page_snap_context(struct page *page)
>  {
>  	if (PagePrivate(page))
>  		return (void *)page->private;
>  	return NULL;
>  }
> +#endif // TODO: Remove after netfs conversion
>  
>  /*
>   * Dirty a page.  Optimistically adjust accounting, on the assumption
>   * that we won't race with invalidate.  If we do, readjust.
>   */
> -static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio)
> +bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio)
>  {
>  	struct inode *inode = mapping->host;
>  	struct ceph_client *cl = ceph_inode_to_client(inode);
>  	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
>  	struct ceph_inode_info *ci;
>  	struct ceph_snap_context *snapc;
> +	struct netfs_group *group;
>  
>  	if (folio_test_dirty(folio)) {
>  		doutc(cl, "%llx.%llx %p idx %lu -- already dirty\n",
> @@ -101,16 +104,28 @@ static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio)
>  	spin_lock(&ci->i_ceph_lock);
>  	if (__ceph_have_pending_cap_snap(ci)) {
>  		struct ceph_cap_snap *capsnap =
> -				list_last_entry(&ci->i_cap_snaps,
> -						struct ceph_cap_snap,
> -						ci_item);
> -		snapc = ceph_get_snap_context(capsnap->context);
> +			list_last_entry(&ci->i_cap_snaps,
> +					struct ceph_cap_snap,
> +					ci_item);
> +		snapc = capsnap->context;
>  		capsnap->dirty_pages++;
>  	} else {
> -		BUG_ON(!ci->i_head_snapc);
> -		snapc = ceph_get_snap_context(ci->i_head_snapc);
> +		snapc = ci->i_head_snapc;
> +		BUG_ON(!snapc);
>  		++ci->i_wrbuffer_ref_head;
>  	}
> +
> +	/* Attach a reference to the snap/group to the folio. */
> +	group = netfs_folio_group(folio);
> +	if (group != &snapc->group) {
> +		netfs_set_group(folio, &snapc->group);
> +		if (group) {
> +			doutc(cl, "Different group %px != %px\n",

Do we really need to use %px?

> +			      group, &snapc->group);
> +			netfs_put_group(group);
> +		}
> +	}
> +
>  	if (ci->i_wrbuffer_ref == 0)
>  		ihold(inode);
>  	++ci->i_wrbuffer_ref;
> @@ -122,16 +137,10 @@ static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio)
>  	      snapc, snapc->seq, snapc->num_snaps);
>  	spin_unlock(&ci->i_ceph_lock);
>  
> -	/*
> -	 * Reference snap context in folio->private.  Also set
> -	 * PagePrivate so that we get invalidate_folio callback.
> -	 */
> -	VM_WARN_ON_FOLIO(folio->private, folio);
> -	folio_attach_private(folio, snapc);
> -
> -	return ceph_fscache_dirty_folio(mapping, folio);
> +	return netfs_dirty_folio(mapping, folio);
>  }
>  
> +#if 0 // TODO: Remove after netfs conversion
>  /*
>   * If we are truncating the full folio (i.e. offset == 0), adjust the
>   * dirty folio counters appropriately.  Only called if there is private
> @@ -1236,6 +1245,7 @@ bool is_num_ops_too_big(struct ceph_writeback_ctl *ceph_wbc)
>  	return ceph_wbc->num_ops >=
>  		(ceph_wbc->from_pool ?  CEPH_OSD_SLAB_OPS : CEPH_OSD_MAX_OPS);
>  }
> +#endif // TODO: Remove after netfs conversion
>  
>  static inline
>  bool is_write_congestion_happened(struct ceph_fs_client *fsc)
> @@ -1244,6 +1254,7 @@ bool is_write_congestion_happened(struct ceph_fs_client *fsc)
>  		CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb);
>  }
>  
> +#if 0 // TODO: Remove after netfs conversion
>  static inline int move_dirty_folio_in_page_array(struct address_space *mapping,
>  		struct writeback_control *wbc,
>  		struct ceph_writeback_ctl *ceph_wbc, struct folio *folio)
> @@ -1930,6 +1941,7 @@ const struct address_space_operations ceph_aops = {
>  	.direct_IO = noop_direct_IO,
>  	.migrate_folio = filemap_migrate_folio,
>  };
> +#endif // TODO: Remove after netfs conversion
>  
>  static void ceph_block_sigs(sigset_t *oldset)
>  {
> @@ -2034,6 +2046,7 @@ static vm_fault_t ceph_filemap_fault(struct vm_fault *vmf)
>  	return ret;
>  }
>  
> +#if 0 // TODO: Remove after netfs conversion
>  static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> @@ -2137,6 +2150,7 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
>  		ret = vmf_error(err);
>  	return ret;
>  }
> +#endif // TODO: Remove after netfs conversion
>  
>  void ceph_fill_inline_data(struct inode *inode, struct page *locked_page,
>  			   char	*data, size_t len)
> diff --git a/fs/ceph/cache.h b/fs/ceph/cache.h
> index 20efac020394..d6afca292f08 100644
> --- a/fs/ceph/cache.h
> +++ b/fs/ceph/cache.h
> @@ -43,6 +43,8 @@ static inline void ceph_fscache_resize(struct inode *inode, loff_t to)
>  	}
>  }
>  
> +#if 0 // TODO: Remove after netfs conversion
> +
>  static inline int ceph_fscache_unpin_writeback(struct inode *inode,
>  						struct writeback_control *wbc)
>  {
> @@ -50,6 +52,7 @@ static inline int ceph_fscache_unpin_writeback(struct inode *inode,
>  }
>  
>  #define ceph_fscache_dirty_folio netfs_dirty_folio
> +#endif // TODO: Remove after netfs conversion
>  
>  static inline bool ceph_is_cache_enabled(struct inode *inode)
>  {
> @@ -100,6 +103,7 @@ static inline void ceph_fscache_resize(struct inode *inode, loff_t to)
>  {
>  }
>  
> +#if 0 // TODO: Remove after netfs conversion
>  static inline int ceph_fscache_unpin_writeback(struct inode *inode,
>  					       struct writeback_control *wbc)
>  {
> @@ -107,6 +111,7 @@ static inline int ceph_fscache_unpin_writeback(struct inode *inode,
>  }
>  
>  #define ceph_fscache_dirty_folio filemap_dirty_folio
> +#endif // TODO: Remove after netfs conversion
>  
>  static inline bool ceph_is_cache_enabled(struct inode *inode)
>  {
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index a8d8b56cf9d2..53f23f351003 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -2536,7 +2536,7 @@ int ceph_write_inode(struct inode *inode, struct writeback_control *wbc)
>  	int wait = (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync);
>  
>  	doutc(cl, "%p %llx.%llx wait=%d\n", inode, ceph_vinop(inode), wait);
> -	ceph_fscache_unpin_writeback(inode, wbc);
> +	netfs_unpin_writeback(inode, wbc);
>  	if (wait) {
>  		err = ceph_wait_on_async_create(inode);
>  		if (err)
> diff --git a/fs/ceph/crypto.c b/fs/ceph/crypto.c
> index a28dea74ca6f..8d4e908da7d8 100644
> --- a/fs/ceph/crypto.c
> +++ b/fs/ceph/crypto.c
> @@ -636,6 +636,60 @@ int ceph_fscrypt_decrypt_extents(struct inode *inode, struct page **page,
>  	return ret;
>  }
>  
> +#if 0
> +int ceph_decrypt_block(struct netfs_io_request *rreq, loff_t pos, size_t len,
> +		       struct scatterlist *source_sg, unsigned int n_source,
> +		       struct scatterlist *dest_sg, unsigned int n_dest)
> +{
> +	struct ceph_sparse_extent *map = op->extent.sparse_ext;
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	size_t xlen;
> +	u64 objno, objoff;
> +	u32 ext_cnt = op->extent.sparse_ext_cnt;
> +	int i, ret = 0;
> +
> +	/* Nothing to do for empty array */
> +	if (ext_cnt == 0) {
> +		dout("%s: empty array, ret 0\n", __func__);

Yeah, I always would like to see the function name during the debugging the
code. Maybe, do we need to change dout() itself to show the function name?

> +		return 0;
> +	}
> +
> +	ceph_calc_file_object_mapping(&ci->i_layout, pos, map[0].len,
> +				      &objno, &objoff, &xlen);
> +
> +	for (i = 0; i < ext_cnt; ++i) {
> +		struct ceph_sparse_extent *ext = &map[i];
> +		int pgsoff = ext->off - objoff;
> +		int pgidx = pgsoff >> PAGE_SHIFT;
> +		int fret;
> +
> +		if ((ext->off | ext->len) & ~CEPH_FSCRYPT_BLOCK_MASK) {
> +			pr_warn("%s: bad encrypted sparse extent idx %d off %llx len %llx\n",
> +				__func__, i, ext->off, ext->len);
> +			return -EIO;
> +		}
> +		fret = ceph_fscrypt_decrypt_pages(inode, &page[pgidx],
> +						 off + pgsoff, ext->len);
> +		dout("%s: [%d] 0x%llx~0x%llx fret %d\n", __func__, i,
> +				ext->off, ext->len, fret);
> +		if (fret < 0) {

Possibly, I am missing some logic here. But do we really need to introduce fret?
Why we cannot user ret here? 

> +			if (ret == 0)
> +				ret = fret;
> +			break;
> +		}
> +		ret = pgsoff + fret;
> +	}
> +	dout("%s: ret %d\n", __func__, ret);
> +	return ret;
> +}
> +
> +int ceph_encrypt_block(struct netfs_io_request *wreq, loff_t pos, size_t len,
> +		       struct scatterlist *source_sg, unsigned int n_source,
> +		       struct scatterlist *dest_sg, unsigned int n_dest)
> +{
> +}
> +#endif
> +
>  /**
>   * ceph_fscrypt_encrypt_pages - encrypt an array of pages
>   * @inode: pointer to inode associated with these pages
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 4512215cccc6..94b91b5bc843 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -77,6 +77,7 @@ static __le32 ceph_flags_sys2wire(struct ceph_mds_client *mdsc, u32 flags)
>   * need to wait for MDS acknowledgement.
>   */
>  
> +#if 0 // TODO: Remove after netfs conversion
>  /*
>   * How many pages to get in one call to iov_iter_get_pages().  This
>   * determines the size of the on-stack array used as a buffer.
> @@ -165,6 +166,7 @@ static void ceph_dirty_pages(struct ceph_databuf *dbuf)
>  		if (bvec[i].bv_page)
>  			set_page_dirty_lock(bvec[i].bv_page);
>  }
> +#endif // TODO: Remove after netfs conversion
>  
>  /*
>   * Prepare an open request.  Preallocate ceph_cap to avoid an
> @@ -1021,6 +1023,7 @@ int ceph_release(struct inode *inode, struct file *file)
>  	return 0;
>  }
>  
> +#if 0 // TODO: Remove after netfs conversion
>  enum {
>  	HAVE_RETRIED = 1,
>  	CHECK_EOF =    2,
> @@ -2234,6 +2237,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  
>  	return ret;
>  }
> +#endif // TODO: Remove after netfs conversion
>  
>  /*
>   * Wrap filemap_splice_read with checks for cap bits on the inode.
> @@ -2294,6 +2298,7 @@ static ssize_t ceph_splice_read(struct file *in, loff_t *ppos,
>  	return ret;
>  }
>  
> +#if 0 // TODO: Remove after netfs conversion
>  /*
>   * Take cap references to avoid releasing caps to MDS mid-write.
>   *
> @@ -2488,6 +2493,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	ceph_free_cap_flush(prealloc_cf);
>  	return written ? written : err;
>  }
> +#endif // TODO: Remove after netfs conversion
>  
>  /*
>   * llseek.  be sure to verify file size on SEEK_END.
> @@ -3160,6 +3166,10 @@ static int ceph_fadvise(struct file *file, loff_t offset, loff_t len, int advice
>  	if (fi->fmode & CEPH_FILE_MODE_LAZY)
>  		return -EACCES;
>  
> +	ret = netfs_start_io_read(inode);
> +	if (ret < 0)
> +		return ret;
> +
>  	ret = ceph_get_caps(file, CEPH_CAP_FILE_RD, want, -1, &got);
>  	if (ret < 0) {
>  		doutc(cl, "%llx.%llx, error getting cap\n", ceph_vinop(inode));
> @@ -3180,6 +3190,7 @@ static int ceph_fadvise(struct file *file, loff_t offset, loff_t len, int advice
>  	      inode, ceph_vinop(inode), ceph_cap_string(got), ret);
>  	ceph_put_cap_refs(ceph_inode(inode), got);
>  out:
> +	netfs_end_io_read(inode);
>  	return ret;
>  }
>  
> @@ -3187,8 +3198,8 @@ const struct file_operations ceph_file_fops = {
>  	.open = ceph_open,
>  	.release = ceph_release,
>  	.llseek = ceph_llseek,
> -	.read_iter = ceph_read_iter,
> -	.write_iter = ceph_write_iter,
> +	.read_iter = ceph_netfs_read_iter,
> +	.write_iter = ceph_netfs_write_iter,
>  	.mmap = ceph_mmap,
>  	.fsync = ceph_fsync,
>  	.lock = ceph_lock,
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index ec9b80fec7be..8f73f3a55a3e 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -2345,11 +2345,9 @@ static int fill_fscrypt_truncate(struct inode *inode,
>  	struct iov_iter iter;
>  	struct ceph_fscrypt_truncate_size_header *header;
>  	void *p;
> -	int retry_op = 0;
>  	int len = CEPH_FSCRYPT_BLOCK_SIZE;
>  	loff_t i_size = i_size_read(inode);
>  	int got, ret, issued;
> -	u64 objver;
>  
>  	ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got);
>  	if (ret < 0)
> @@ -2361,16 +2359,6 @@ static int fill_fscrypt_truncate(struct inode *inode,
>  	      i_size, attr->ia_size, ceph_cap_string(got),
>  	      ceph_cap_string(issued));
>  
> -	/* Try to writeback the dirty pagecaches */
> -	if (issued & (CEPH_CAP_FILE_BUFFER)) {
> -		loff_t lend = orig_pos + CEPH_FSCRYPT_BLOCK_SIZE - 1;
> -
> -		ret = filemap_write_and_wait_range(inode->i_mapping,
> -						   orig_pos, lend);
> -		if (ret < 0)
> -			goto out;
> -	}
> -
>  	ret = -ENOMEM;
>  	dbuf = ceph_databuf_req_alloc(2, 0, GFP_KERNEL);
>  	if (!dbuf)
> @@ -2382,10 +2370,8 @@ static int fill_fscrypt_truncate(struct inode *inode,
>  		goto out;
>  
>  	iov_iter_bvec(&iter, ITER_DEST, &dbuf->bvec[1], 1, len);
> -
> -	pos = orig_pos;
> -	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op, &objver);
> -	if (ret < 0)
> +	ret = netfs_unbuffered_read_from_inode(inode, orig_pos, &iter, true);
> +	if (ret < 0 && ret != -ENODATA)
>  		goto out;
>  
>  	header = kmap_ceph_databuf_page(dbuf, 0);
> @@ -2402,16 +2388,14 @@ static int fill_fscrypt_truncate(struct inode *inode,
>  	header->block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE);
>  
>  	/*
> -	 * If we hit a hole here, we should just skip filling
> -	 * the fscrypt for the request, because once the fscrypt
> -	 * is enabled, the file will be split into many blocks
> -	 * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there
> -	 * has a hole, the hole size should be multiple of block
> -	 * size.
> +	 * If we hit a hole here, we should just skip filling the fscrypt for
> +	 * the request, because once the fscrypt is enabled, the file will be
> +	 * split into many blocks with the size of CEPH_FSCRYPT_BLOCK_SIZE.  If
> +	 * there was a hole, the hole size should be multiple of block size.
>  	 *
>  	 * If the Rados object doesn't exist, it will be set to 0.
>  	 */
> -	if (!objver) {
> +	if (ret != -ENODATA) {
>  		doutc(cl, "hit hole, ppos %lld < size %lld\n", pos, i_size);
>  
>  		header->data_len = cpu_to_le32(8 + 8 + 4);
> diff --git a/fs/ceph/rdwr.c b/fs/ceph/rdwr.c
> new file mode 100644
> index 000000000000..952c36be2cd9
> --- /dev/null
> +++ b/fs/ceph/rdwr.c
> @@ -0,0 +1,1006 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Ceph netfs-based file read-write operations.
> + *
> + * There are a few funny things going on here.
> + *
> + * The page->private field is used to reference a struct ceph_snap_context for
> + * _every_ dirty page.  This indicates which snapshot the page was logically
> + * dirtied in, and thus which snap context needs to be associated with the osd
> + * write during writeback.
> + *
> + * Similarly, struct ceph_inode_info maintains a set of counters to count dirty
> + * pages on the inode.  In the absence of snapshots, i_wrbuffer_ref ==
> + * i_wrbuffer_ref_head == the dirty page count.
> + *
> + * When a snapshot is taken (that is, when the client receives notification
> + * that a snapshot was taken), each inode with caps and with dirty pages (dirty
> + * pages implies there is a cap) gets a new ceph_cap_snap in the i_cap_snaps
> + * list (which is sorted in ascending order, new snaps go to the tail).  The
> + * i_wrbuffer_ref_head count is moved to capsnap->dirty. (Unless a sync write
> + * is currently in progress.  In that case, the capsnap is said to be
> + * "pending", new writes cannot start, and the capsnap isn't "finalized" until
> + * the write completes (or fails) and a final size/mtime for the inode for that
> + * snap can be settled upon.)  i_wrbuffer_ref_head is reset to 0.
> + *
> + * On writeback, we must submit writes to the osd IN SNAP ORDER.  So, we look
> + * for the first capsnap in i_cap_snaps and write out pages in that snap
> + * context _only_.  Then we move on to the next capsnap, eventually reaching
> + * the "live" or "head" context (i.e., pages that are not yet snapped) and are
> + * writing the most recently dirtied pages.
> + *
> + * Invalidate and so forth must take care to ensure the dirty page accounting
> + * is preserved.
> + *
> + * Copyright (C) 2025 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +#include <linux/ceph/ceph_debug.h>
> +
> +#include <linux/backing-dev.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/swap.h>
> +#include <linux/pagemap.h>
> +#include <linux/slab.h>
> +#include <linux/pagevec.h>
> +#include <linux/task_io_accounting_ops.h>
> +#include <linux/signal.h>
> +#include <linux/iversion.h>
> +#include <linux/ktime.h>
> +#include <linux/netfs.h>
> +#include <trace/events/netfs.h>
> +
> +#include "super.h"
> +#include "mds_client.h"
> +#include "cache.h"
> +#include "metric.h"
> +#include "crypto.h"
> +#include <linux/ceph/osd_client.h>
> +#include <linux/ceph/striper.h>
> +
> +struct ceph_writeback_ctl
> +{
> +	loff_t i_size;
> +	u64 truncate_size;
> +	u32 truncate_seq;
> +	bool size_stable;
> +	bool head_snapc;
> +};
> +
> +struct kmem_cache *ceph_io_request_cachep;
> +struct kmem_cache *ceph_io_subrequest_cachep;
> +
> +static struct ceph_io_subrequest *ceph_sreq2io(struct netfs_io_subrequest *subreq)
> +{
> +	BUILD_BUG_ON(sizeof(struct ceph_io_request) > NETFS_DEF_IO_REQUEST_SIZE);
> +	BUILD_BUG_ON(sizeof(struct ceph_io_subrequest) > NETFS_DEF_IO_SUBREQUEST_SIZE);
> +
> +	return container_of(subreq, struct ceph_io_subrequest, sreq);
> +}
> +
> +/*
> + * Get the snapc from the group attached to a request
> + */
> +static struct ceph_snap_context *ceph_wreq_snapc(struct netfs_io_request *wreq)
> +{
> +	struct ceph_snap_context *snapc =
> +		container_of(wreq->group, struct ceph_snap_context, group);
> +	return snapc;
> +}
> +
> +#if 0
> +static void ceph_put_many_snap_context(struct ceph_snap_context *sc, unsigned int nr)
> +{
> +	if (sc)
> +		netfs_put_group_many(&sc->group, nr);
> +}
> +#endif
> +
> +/*
> + * Handle the termination of a write to the server.
> + */
> +static void ceph_netfs_write_callback(struct ceph_osd_request *req)
> +{
> +	struct netfs_io_subrequest *subreq = req->r_subreq;
> +	struct ceph_io_subrequest *csub = ceph_sreq2io(subreq);
> +	struct ceph_io_request *creq = csub->creq;
> +	struct inode *inode = creq->rreq.inode;
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> +	struct ceph_client *cl = ceph_inode_to_client(inode);
> +	size_t wrote = req->r_result ? 0 : subreq->len;
> +	int err = req->r_result;
> +
> +	trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
> +
> +	ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
> +				  req->r_end_latency, wrote, err);
> +
> +	if (err) {
> +		doutc(cl, "sync_write osd write returned %d\n", err);
> +		/* Version changed! Must re-do the rmw cycle */
> +		if ((creq->rmw_assert_version && (err == -ERANGE || err == -EOVERFLOW)) ||
> +		    (!creq->rmw_assert_version && err == -EEXIST)) {
> +			/* We should only ever see this on a rmw */
> +			WARN_ON_ONCE(!test_bit(NETFS_RREQ_RMW, &ci->netfs.flags));
> +
> +			/* The version should never go backward */
> +			WARN_ON_ONCE(err == -EOVERFLOW);
> +
> +			/* FIXME: limit number of times we loop? */
> +			set_bit(NETFS_RREQ_REPEAT_RMW, &creq->rreq.flags);
> +			trace_netfs_sreq(subreq, netfs_sreq_trace_need_rmw);
> +		}
> +		ceph_set_error_write(ci);
> +	} else {
> +		ceph_clear_error_write(ci);
> +	}
> +
> +	csub->req = NULL;
> +	ceph_osdc_put_request(req);
> +	netfs_write_subrequest_terminated(subreq, err ?: wrote, true);
> +}
> +
> +/*
> + * Issue a subrequest to upload to the server.
> + */
> +static void ceph_issue_write(struct netfs_io_subrequest *subreq)
> +{
> +	struct ceph_io_subrequest *csub = ceph_sreq2io(subreq);
> +	struct ceph_snap_context *snapc = ceph_wreq_snapc(subreq->rreq);
> +	struct ceph_osd_request *req;
> +	struct ceph_io_request *creq = csub->creq;
> +	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(subreq->rreq->inode);
> +	struct ceph_osd_client *osdc = &fsc->client->osdc;
> +	struct inode *inode = subreq->rreq->inode;
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	struct ceph_client *cl = ceph_inode_to_client(inode);
> +	unsigned long long len;
> +	unsigned int rmw = test_bit(NETFS_RREQ_RMW, &ci->netfs.flags) ? 1 : 0;
> +
> +	doutc(cl, "issue_write R=%08x[%x] ino %llx %lld~%zu -- %srmw\n",
> +	      subreq->rreq->debug_id, subreq->debug_index, ci->i_vino.ino,
> +	      subreq->start, subreq->len,
> +	      rmw ? "" : "no ");
> +
> +	len = subreq->len;
> +	req = ceph_osdc_new_request(osdc, &ci->i_layout, ci->i_vino,
> +				    subreq->start, &len,
> +				    rmw,	/* which: 0 or 1 */
> +				    rmw + 1,	/* num_ops: 1 or 2 */
> +				    CEPH_OSD_OP_WRITE,
> +				    CEPH_OSD_FLAG_WRITE,
> +				    snapc,
> +				    ci->i_truncate_seq,
> +				    ci->i_truncate_size, false);
> +	if (IS_ERR(req)) {
> +		netfs_write_subrequest_terminated(subreq, PTR_ERR(req), false);
> +		return netfs_prepare_write_failed(subreq);
> +	}
> +
> +	subreq->len = len;
> +	doutc(cl, "write op %lld~%zu\n", subreq->start, subreq->len);
> +	iov_iter_truncate(&subreq->io_iter, len);
> +	osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter);
> +	req->r_inode	= inode;
> +	req->r_mtime	= current_time(inode);
> +	req->r_callback	= ceph_netfs_write_callback;
> +	req->r_subreq	= subreq;
> +	csub->req	= req;
> +
> +	/*
> +	 * If we're doing an RMW cycle, set up an assertion that the remote
> +	 * data hasn't changed.  If we don't have a version number, then the
> +	 * object doesn't exist yet.  Use an exclusive create instead of a
> +	 * version assertion in that case.
> +	 */
> +	if (rmw) {
> +		if (creq->rmw_assert_version) {
> +			osd_req_op_init(req, 0, CEPH_OSD_OP_ASSERT_VER, 0);
> +			req->r_ops[0].assert_ver.ver = creq->rmw_assert_version;
> +		} else {
> +			osd_req_op_init(req, 0, CEPH_OSD_OP_CREATE,
> +					CEPH_OSD_OP_FLAG_EXCL);
> +		}
> +	}
> +
> +	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
> +	ceph_osdc_start_request(osdc, req);
> +}
> +
> +/*
> + * Prepare a subrequest to upload to the server.
> + */
> +static void ceph_prepare_write(struct netfs_io_subrequest *subreq)
> +{
> +	struct ceph_inode_info *ci = ceph_inode(subreq->rreq->inode);
> +	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(subreq->rreq->inode);
> +	u64 objnum, objoff;
> +
> +	/* Clamp the length to the next object boundary. */
> +	ceph_calc_file_object_mapping(&ci->i_layout, subreq->start,
> +				      fsc->mount_options->wsize,
> +				      &objnum, &objoff,
> +				      &subreq->rreq->io_streams[0].sreq_max_len);
> +}
> +
> +/*
> + * Mark the caps as dirty
> + */
> +static void ceph_netfs_post_modify(struct inode *inode, void *fs_priv)
> +{
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	struct ceph_cap_flush **prealloc_cf = fs_priv;
> +	int dirty;
> +
> +	spin_lock(&ci->i_ceph_lock);
> +	dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR, prealloc_cf);
> +	spin_unlock(&ci->i_ceph_lock);
> +	if (dirty)
> +		__mark_inode_dirty(inode, dirty);
> +}
> +
> +static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq)
> +{
> +	struct inode *inode = rreq->inode;
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	struct ceph_file_layout *lo = &ci->i_layout;
> +	unsigned long max_pages = inode->i_sb->s_bdi->ra_pages;
> +	loff_t end = rreq->start + rreq->len, new_end;
> +	struct ceph_io_request *priv = container_of(rreq, struct ceph_io_request, rreq);
> +	unsigned long max_len;
> +	u32 blockoff;
> +
> +	if (priv) {
> +		/* Readahead is disabled by posix_fadvise POSIX_FADV_RANDOM */
> +		if (priv->file_ra_disabled)
> +			max_pages = 0;
> +		else
> +			max_pages = priv->file_ra_pages;
> +
> +	}
> +
> +	/* Readahead is disabled */
> +	if (!max_pages)
> +		return;
> +
> +	max_len = max_pages << PAGE_SHIFT;
> +
> +	/*
> +	 * Try to expand the length forward by rounding up it to the next
> +	 * block, but do not exceed the file size, unless the original
> +	 * request already exceeds it.
> +	 */
> +	new_end = umin(round_up(end, lo->stripe_unit), rreq->i_size);
> +	if (new_end > end && new_end <= rreq->start + max_len)
> +		rreq->len = new_end - rreq->start;
> +
> +	/* Try to expand the start downward */
> +	div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
> +	if (rreq->len + blockoff <= max_len) {
> +		rreq->start -= blockoff;
> +		rreq->len += blockoff;
> +	}
> +}
> +
> +static int ceph_netfs_prepare_read(struct netfs_io_subrequest *subreq)
> +{
> +	struct netfs_io_request *rreq = subreq->rreq;
> +	struct ceph_inode_info *ci = ceph_inode(rreq->inode);
> +	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(rreq->inode);
> +	size_t xlen;
> +	u64 objno, objoff;
> +
> +	/* Truncate the extent at the end of the current block */
> +	ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len,
> +				      &objno, &objoff, &xlen);
> +	rreq->io_streams[0].sreq_max_len = umin(xlen, fsc->mount_options->rsize);
> +	return 0;
> +}
> +
> +static void ceph_netfs_read_callback(struct ceph_osd_request *req)
> +{
> +	struct inode *inode = req->r_inode;
> +	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> +	struct ceph_client *cl = fsc->client;
> +	struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
> +	struct netfs_io_subrequest *subreq = req->r_priv;
> +	struct ceph_osd_req_op *op = &req->r_ops[0];
> +	bool sparse = (op->op == CEPH_OSD_OP_SPARSE_READ);
> +	int err = req->r_result;
> +
> +	ceph_update_read_metrics(&fsc->mdsc->metric, req->r_start_latency,
> +				 req->r_end_latency, osd_data->iter.count, err);
> +
> +	doutc(cl, "result %d subreq->len=%zu i_size=%lld\n", req->r_result,
> +	      subreq->len, i_size_read(req->r_inode));
> +
> +	/* no object means success but no data */
> +	if (err == -ENOENT)
> +		err = 0;
> +	else if (err == -EBLOCKLISTED)
> +		fsc->blocklisted = true;
> +
> +	if (err >= 0) {

Maybe, we need not to use err here. It looks really confusing for the case of
positive value. I assume that positive value of req->r_result is not error code.

> +		if (sparse && err > 0)
> +			err = ceph_sparse_ext_map_end(op);
> +		if (err < subreq->len &&
> +		    subreq->rreq->origin != NETFS_DIO_READ)
> +			__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
> +		if (IS_ENCRYPTED(inode) && err > 0) {
> +#if 0
> +			err = ceph_fscrypt_decrypt_extents(inode, osd_data->dbuf,
> +							   subreq->start,
> +							   op->extent.sparse_ext,
> +							   op->extent.sparse_ext_cnt);
> +			if (err > subreq->len)
> +				err = subreq->len;
> +#else
> +			pr_err("TODO: Content-decrypt currently disabled\n");
> +			err = -EOPNOTSUPP;
> +#endif
> +		}
> +	}
> +
> +	if (err > 0) {
> +		subreq->transferred = err;
> +		err = 0;
> +	}
> +
> +	subreq->error = err;

So, err is error code or not? :)

> +	trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
> +	ceph_dec_osd_stopping_blocker(fsc->mdsc);
> +	netfs_read_subreq_terminated(subreq);
> +}
> +
> +static void ceph_rmw_read_done(struct netfs_io_request *wreq, struct netfs_io_request *rreq)
> +{
> +	struct ceph_io_request *cwreq = container_of(wreq, struct ceph_io_request, rreq);
> +	struct ceph_io_request *crreq = container_of(rreq, struct ceph_io_request, rreq);
> +
> +	cwreq->rmw_assert_version = crreq->rmw_assert_version;
> +}
> +
> +static bool ceph_netfs_issue_read_inline(struct netfs_io_subrequest *subreq)
> +{
> +	struct netfs_io_request *rreq = subreq->rreq;
> +	struct inode *inode = rreq->inode;
> +	struct ceph_mds_reply_info_parsed *rinfo;
> +	struct ceph_mds_reply_info_in *iinfo;
> +	struct ceph_mds_request *req;
> +	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	ssize_t err = 0;
> +	size_t len, copied;
> +	int mode;
> +
> +	__clear_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
> +
> +	if (subreq->start >= inode->i_size)

Maybe, i_size_read(inode)?

> +		goto out;
> +
> +	/* We need to fetch the inline data. */
> +	mode = ceph_try_to_choose_auth_mds(inode, CEPH_STAT_CAP_INLINE_DATA);
> +	req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_GETATTR, mode);
> +	if (IS_ERR(req)) {
> +		err = PTR_ERR(req);
> +		goto out;
> +	}
> +	req->r_ino1 = ci->i_vino;
> +	req->r_args.getattr.mask = cpu_to_le32(CEPH_STAT_CAP_INLINE_DATA);
> +	req->r_num_caps = 2;
> +
> +	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
> +	err = ceph_mdsc_do_request(mdsc, NULL, req);
> +	if (err < 0)
> +		goto out;
> +
> +	rinfo = &req->r_reply_info;
> +	iinfo = &rinfo->targeti;
> +	if (iinfo->inline_version == CEPH_INLINE_NONE) {
> +		/* The data got uninlined */
> +		ceph_mdsc_put_request(req);
> +		return false;
> +	}
> +
> +	len = umin(iinfo->inline_len - subreq->start, subreq->len);
> +	copied = copy_to_iter(iinfo->inline_data + subreq->start, len, &subreq->io_iter);
> +	if (copied) {
> +		subreq->transferred += copied;
> +		if (copied == len)
> +			__set_bit(NETFS_SREQ_HIT_EOF, &subreq->flags);
> +		subreq->error = 0;
> +	} else {
> +		subreq->error = -EFAULT;
> +	}
> +
> +	ceph_mdsc_put_request(req);
> +out:
> +	netfs_read_subreq_terminated(subreq);
> +	return true;
> +}
> +
> +static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq)
> +{
> +	struct netfs_io_request *rreq = subreq->rreq;
> +	struct inode *inode = rreq->inode;
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> +	struct ceph_client *cl = fsc->client;
> +	struct ceph_osd_request *req = NULL;
> +	struct ceph_vino vino = ceph_vino(inode);
> +	int extent_cnt;
> +	bool sparse = IS_ENCRYPTED(inode) || ceph_test_mount_opt(fsc, SPARSEREAD);
> +	u64 off = subreq->start, len = subreq->len;
> +	int err = 0;
> +
> +	if (ceph_inode_is_shutdown(inode)) {
> +		err = -EIO;
> +		goto out;
> +	}
> +
> +	if (ceph_has_inline_data(ci) && ceph_netfs_issue_read_inline(subreq))
> +		return;
> +
> +	req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, vino,
> +				    off, &len, 0, 1,
> +				    sparse ? CEPH_OSD_OP_SPARSE_READ : CEPH_OSD_OP_READ,
> +				    CEPH_OSD_FLAG_READ, /*  read_from_replica will be or'd in */
> +				    NULL, ci->i_truncate_seq, ci->i_truncate_size, false);
> +	if (IS_ERR(req)) {
> +		err = PTR_ERR(req);
> +		req = NULL;
> +		goto out;
> +	}
> +
> +	if (sparse) {
> +		extent_cnt = __ceph_sparse_read_ext_count(inode, len);
> +		err = ceph_alloc_sparse_ext_map(&req->r_ops[0], extent_cnt);
> +		if (err)
> +			goto out;
> +	}
> +
> +	doutc(cl, "%llx.%llx pos=%llu orig_len=%zu len=%llu\n",
> +	      ceph_vinop(inode), subreq->start, subreq->len, len);
> +
> +	osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter);
> +	if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
> +		err = -EIO;
> +		goto out;
> +	}
> +	req->r_callback = ceph_netfs_read_callback;
> +	req->r_priv = subreq;
> +	req->r_inode = inode;
> +
> +	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
> +	ceph_osdc_start_request(req->r_osdc, req);
> +out:
> +	ceph_osdc_put_request(req);
> +	doutc(cl, "%llx.%llx result %d\n", ceph_vinop(inode), err);
> +	if (err) {
> +		subreq->error = err;
> +		netfs_read_subreq_terminated(subreq);
> +	}
> +}
> +
> +static int ceph_init_request(struct netfs_io_request *rreq, struct file *file)
> +{
> +	struct ceph_io_request *priv = container_of(rreq, struct ceph_io_request, rreq);
> +	struct inode *inode = rreq->inode;
> +	struct ceph_client *cl = ceph_inode_to_client(inode);
> +	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> +	int got = 0, want = CEPH_CAP_FILE_CACHE;
> +	int ret = 0;
> +
> +	rreq->rsize = 1024 * 1024;

Why do we hardcoded rreq->rsize value?

struct ceph_mount_options {
	unsigned int flags;

	unsigned int wsize;            /* max write size */
	unsigned int rsize;            /* max read size */
	unsigned int rasize;           /* max readahead */
	unsigned int congestion_kb;    /* max writeback in flight */
	unsigned int caps_wanted_delay_min, caps_wanted_delay_max;
	int caps_max;
	unsigned int max_readdir;       /* max readdir result (entries) */
	unsigned int max_readdir_bytes; /* max readdir result (bytes) */

	bool new_dev_syntax;

	/*
	 * everything above this point can be memcmp'd; everything below
	 * is handled in compare_mount_options()
	 */

	char *snapdir_name;   /* default ".snap" */
	char *mds_namespace;  /* default NULL */
	char *server_path;    /* default NULL (means "/") */
	char *fscache_uniq;   /* default NULL */
	char *mon_addr;
	struct fscrypt_dummy_policy dummy_enc_policy;
};

Why we don't use fsc->mount_options->rsize?

> +	rreq->wsize = umin(i_blocksize(inode), fsc->mount_options->wsize);
> +
> +	switch (rreq->origin) {
> +	case NETFS_READAHEAD:
> +		goto init_readahead;
> +	case NETFS_WRITEBACK:
> +	case NETFS_WRITETHROUGH:
> +	case NETFS_UNBUFFERED_WRITE:
> +	case NETFS_DIO_WRITE:
> +		if (S_ISREG(rreq->inode->i_mode))
> +			rreq->io_streams[0].avail = true;
> +		return 0;
> +	default:
> +		return 0;
> +	}
> +
> +init_readahead:
> +	/*
> +	 * If we are doing readahead triggered by a read, fault-in or
> +	 * MADV/FADV_WILLNEED, someone higher up the stack must be holding the
> +	 * FILE_CACHE and/or LAZYIO caps.
> +	 */
> +	if (file) {
> +		priv->file_ra_pages = file->f_ra.ra_pages;
> +		priv->file_ra_disabled = file->f_mode & FMODE_RANDOM;
> +		rreq->netfs_priv = priv;
> +		return 0;
> +	}
> +
> +	/*
> +	 * readahead callers do not necessarily hold Fcb caps
> +	 * (e.g. fadvise, madvise).
> +	 */
> +	ret = ceph_try_get_caps(inode, CEPH_CAP_FILE_RD, want, true, &got);
> +	if (ret < 0) {
> +		doutc(cl, "%llx.%llx, error getting cap\n", ceph_vinop(inode));
> +		goto out;
> +	}
> +
> +	if (!(got & want)) {
> +		doutc(cl, "%llx.%llx, no cache cap\n", ceph_vinop(inode));
> +		ret = -EACCES;
> +		goto out;
> +	}
> +	if (ret > 0)
> +		priv->caps = got;
> +	else
> +		ret = -EACCES;
> +
> +	rreq->io_streams[0].sreq_max_len = fsc->mount_options->rsize;
> +out:
> +	return ret;
> +}
> +
> +static void ceph_netfs_free_request(struct netfs_io_request *rreq)
> +{
> +	struct ceph_io_request *creq = container_of(rreq, struct ceph_io_request, rreq);
> +
> +	if (creq->caps)
> +		ceph_put_cap_refs(ceph_inode(rreq->inode), creq->caps);
> +}
> +
> +const struct netfs_request_ops ceph_netfs_ops = {
> +	.init_request		= ceph_init_request,
> +	.free_request		= ceph_netfs_free_request,
> +	.expand_readahead	= ceph_netfs_expand_readahead,
> +	.prepare_read		= ceph_netfs_prepare_read,
> +	.issue_read		= ceph_netfs_issue_read,
> +	.rmw_read_done		= ceph_rmw_read_done,
> +	.post_modify		= ceph_netfs_post_modify,
> +	.prepare_write		= ceph_prepare_write,
> +	.issue_write		= ceph_issue_write,
> +};
> +
> +/*
> + * Get ref for the oldest snapc for an inode with dirty data... that is, the
> + * only snap context we are allowed to write back.
> + */
> +static struct ceph_snap_context *
> +ceph_get_oldest_context(struct inode *inode, struct ceph_writeback_ctl *ctl,
> +			struct ceph_snap_context *folio_snapc)
> +{
> +	struct ceph_snap_context *snapc = NULL;
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	struct ceph_cap_snap *capsnap = NULL;
> +	struct ceph_client *cl = ceph_inode_to_client(inode);
> +
> +	spin_lock(&ci->i_ceph_lock);
> +	list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) {
> +		doutc(cl, " capsnap %p snapc %p has %d dirty pages\n",
> +		      capsnap, capsnap->context, capsnap->dirty_pages);
> +		if (!capsnap->dirty_pages)
> +			continue;
> +
> +		/* get i_size, truncate_{seq,size} for folio_snapc? */
> +		if (snapc && capsnap->context != folio_snapc)
> +			continue;
> +
> +		if (ctl) {
> +			if (capsnap->writing) {
> +				ctl->i_size = i_size_read(inode);
> +				ctl->size_stable = false;
> +			} else {
> +				ctl->i_size = capsnap->size;
> +				ctl->size_stable = true;
> +			}
> +			ctl->truncate_size = capsnap->truncate_size;
> +			ctl->truncate_seq = capsnap->truncate_seq;
> +			ctl->head_snapc = false;
> +		}
> +
> +		if (snapc)
> +			break;
> +
> +		snapc = ceph_get_snap_context(capsnap->context);
> +		if (!folio_snapc ||
> +		    folio_snapc == snapc ||
> +		    folio_snapc->seq > snapc->seq)
> +			break;
> +	}
> +	if (!snapc && ci->i_wrbuffer_ref_head) {
> +		snapc = ceph_get_snap_context(ci->i_head_snapc);
> +		doutc(cl, " head snapc %p has %d dirty pages\n", snapc,
> +		      ci->i_wrbuffer_ref_head);
> +		if (ctl) {
> +			ctl->i_size = i_size_read(inode);
> +			ctl->truncate_size = ci->i_truncate_size;
> +			ctl->truncate_seq = ci->i_truncate_seq;
> +			ctl->size_stable = false;
> +			ctl->head_snapc = true;
> +		}
> +	}
> +	spin_unlock(&ci->i_ceph_lock);
> +	return snapc;
> +}
> +
> +/*
> + * Flush dirty data.  We have to start with the oldest snap as that's the only
> + * one we're allowed to write back.
> + */
> +static int ceph_writepages(struct address_space *mapping,
> +			   struct writeback_control *wbc)
> +{
> +	struct ceph_writeback_ctl ceph_wbc;
> +	struct ceph_snap_context *snapc;
> +	struct ceph_inode_info *ci = ceph_inode(mapping->host);
> +	loff_t actual_start = wbc->range_start, actual_end = wbc->range_end;
> +	int ret;
> +
> +	do {
> +		snapc = ceph_get_oldest_context(mapping->host, &ceph_wbc, NULL);
> +		if (snapc == ci->i_head_snapc) {
> +			wbc->range_start = actual_start;
> +			wbc->range_end = actual_end;
> +		} else {
> +			/* Do not respect wbc->range_{start,end}.  Dirty pages
> +			 * in that range can be associated with newer snapc.
> +			 * They are not writeable until we write all dirty
> +			 * pages associated with an older snapc get written.
> +			 */
> +			wbc->range_start = 0;
> +			wbc->range_end = LLONG_MAX;
> +		}
> +
> +		ret = netfs_writepages_group(mapping, wbc, &snapc->group, &ceph_wbc);
> +		ceph_put_snap_context(snapc);
> +		if (snapc == ci->i_head_snapc)
> +			break;
> +	} while (ret == 0 && wbc->nr_to_write > 0);
> +
> +	return ret;
> +}
> +
> +const struct address_space_operations ceph_aops = {
> +	.read_folio	= netfs_read_folio,
> +	.readahead	= netfs_readahead,
> +	.writepages	= ceph_writepages,
> +	.dirty_folio	= ceph_dirty_folio,
> +	.invalidate_folio = netfs_invalidate_folio,
> +	.release_folio	= netfs_release_folio,
> +	.direct_IO	= noop_direct_IO,
> +	.migrate_folio	= filemap_migrate_folio,
> +};
> +
> +/*
> + * Wrap generic_file_aio_read with checks for cap bits on the inode.
> + * Atomically grab references, so that those bits are not released
> + * back to the MDS mid-read.
> + *
> + * Hmm, the sync read case isn't actually async... should it be?
> + */
> +ssize_t ceph_netfs_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *filp = iocb->ki_filp;
> +	struct inode *inode = file_inode(filp);
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	struct ceph_file_info *fi = filp->private_data;
> +	struct ceph_client *cl = ceph_inode_to_client(inode);
> +	ssize_t ret;
> +	size_t len = iov_iter_count(to);
> +	bool dio = iocb->ki_flags & IOCB_DIRECT;
> +	int want = 0, got = 0;
> +
> +	doutc(cl, "%llu~%zu trying to get caps on %p %llx.%llx\n",
> +	      iocb->ki_pos, len, inode, ceph_vinop(inode));
> +
> +	if (ceph_inode_is_shutdown(inode))
> +		return -ESTALE;
> +
> +	if (dio)
> +		ret = netfs_start_io_direct(inode);
> +	else
> +		ret = netfs_start_io_read(inode);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (!(fi->flags & CEPH_F_SYNC) && !dio)
> +		want |= CEPH_CAP_FILE_CACHE;
> +	if (fi->fmode & CEPH_FILE_MODE_LAZY)
> +		want |= CEPH_CAP_FILE_LAZYIO;
> +
> +	ret = ceph_get_caps(filp, CEPH_CAP_FILE_RD, want, -1, &got);
> +	if (ret < 0)
> +		goto out;
> +
> +	if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
> +	    dio ||
> +	    (fi->flags & CEPH_F_SYNC)) {
> +		doutc(cl, "sync %p %llx.%llx %llu~%zu got cap refs on %s\n",
> +		      inode, ceph_vinop(inode), iocb->ki_pos, len,
> +		      ceph_cap_string(got));
> +
> +		ret = netfs_unbuffered_read_iter(iocb, to);
> +	} else {
> +		doutc(cl, "async %p %llx.%llx %llu~%zu got cap refs on %s\n",
> +		      inode, ceph_vinop(inode), iocb->ki_pos, len,
> +		      ceph_cap_string(got));
> +		ret = filemap_read(iocb, to, 0);
> +	}
> +
> +	doutc(cl, "%p %llx.%llx dropping cap refs on %s = %zd\n",
> +	      inode, ceph_vinop(inode), ceph_cap_string(got), ret);
> +	ceph_put_cap_refs(ci, got);
> +
> +out:
> +	if (dio)
> +		netfs_end_io_direct(inode);
> +	else
> +		netfs_end_io_read(inode);
> +	return ret;
> +}
> +
> +/*
> + * Get the most recent snap context in the list to which the inode subscribes.
> + * This is the only one we are allowed to modify.  If a folio points to an
> + * earlier snapshot, it must be flushed first.
> + */
> +static struct ceph_snap_context *ceph_get_most_recent_snapc(struct inode *inode)
> +{
> +	struct ceph_snap_context *snapc;
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +
> +	/* Get the snap this write is going to belong to. */
> +	spin_lock(&ci->i_ceph_lock);
> +	if (__ceph_have_pending_cap_snap(ci)) {
> +		struct ceph_cap_snap *capsnap =
> +			list_last_entry(&ci->i_cap_snaps,
> +					struct ceph_cap_snap, ci_item);
> +
> +		snapc = ceph_get_snap_context(capsnap->context);
> +	} else {
> +		BUG_ON(!ci->i_head_snapc);
> +		snapc = ceph_get_snap_context(ci->i_head_snapc);
> +	}
> +	spin_unlock(&ci->i_ceph_lock);
> +
> +	return snapc;
> +}
> +
> +/*
> + * Take cap references to avoid releasing caps to MDS mid-write.
> + *
> + * If we are synchronous, and write with an old snap context, the OSD
> + * may return EOLDSNAPC.  In that case, retry the write.. _after_
> + * dropping our cap refs and allowing the pending snap to logically
> + * complete _before_ this write occurs.
> + *
> + * If we are near ENOSPC, write synchronously.
> + */
> +ssize_t ceph_netfs_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct inode *inode = file_inode(file);
> +	struct ceph_snap_context *snapc;
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> +	struct ceph_file_info *fi = file->private_data;
> +	struct ceph_osd_client *osdc = &fsc->client->osdc;
> +	struct ceph_cap_flush *prealloc_cf;
> +	struct ceph_client *cl = fsc->client;
> +	ssize_t count, written = 0;
> +	loff_t limit = max(i_size_read(inode), fsc->max_file_size);

Do we need to take into account the quota max bytes here?

struct ceph_inode_info {
<skipped>

	/* quotas */
	u64 i_max_bytes, i_max_files;

<skipped>
};

> +	loff_t pos;
> +	bool direct_lock = false;
> +	u64 pool_flags;
> +	u32 map_flags;
> +	int err, want = 0, got;
> +
> +	if (ceph_inode_is_shutdown(inode))
> +		return -ESTALE;
> +
> +	if (ceph_snap(inode) != CEPH_NOSNAP)
> +		return -EROFS;
> +
> +	prealloc_cf = ceph_alloc_cap_flush();
> +	if (!prealloc_cf)
> +		return -ENOMEM;
> +
> +	if ((iocb->ki_flags & (IOCB_DIRECT | IOCB_APPEND)) == IOCB_DIRECT)
> +		direct_lock = true;
> +
> +retry_snap:
> +	if (direct_lock)
> +		netfs_start_io_direct(inode);
> +	else
> +		netfs_start_io_write(inode);
> +
> +	if (iocb->ki_flags & IOCB_APPEND) {
> +		err = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false);
> +		if (err < 0)
> +			goto out;
> +	}
> +
> +	err = generic_write_checks(iocb, from);
> +	if (err <= 0)
> +		goto out;
> +
> +	pos = iocb->ki_pos;
> +	if (unlikely(pos >= limit)) {
> +		err = -EFBIG;
> +		goto out;
> +	} else {
> +		iov_iter_truncate(from, limit - pos);
> +	}
> +
> +	count = iov_iter_count(from);
> +	if (ceph_quota_is_max_bytes_exceeded(inode, pos + count)) {
> +		err = -EDQUOT;
> +		goto out;
> +	}
> +
> +	down_read(&osdc->lock);
> +	map_flags = osdc->osdmap->flags;
> +	pool_flags = ceph_pg_pool_flags(osdc->osdmap, ci->i_layout.pool_id);
> +	up_read(&osdc->lock);
> +	if ((map_flags & CEPH_OSDMAP_FULL) ||
> +	    (pool_flags & CEPH_POOL_FLAG_FULL)) {
> +		err = -ENOSPC;
> +		goto out;
> +	}
> +
> +	err = file_remove_privs(file);
> +	if (err)
> +		goto out;
> +
> +	doutc(cl, "%p %llx.%llx %llu~%zd getting caps. i_size %llu\n",
> +	      inode, ceph_vinop(inode), pos, count,
> +	      i_size_read(inode));
> +	if (!(fi->flags & CEPH_F_SYNC) && !direct_lock)
> +		want |= CEPH_CAP_FILE_BUFFER;
> +	if (fi->fmode & CEPH_FILE_MODE_LAZY)
> +		want |= CEPH_CAP_FILE_LAZYIO;
> +	got = 0;
> +	err = ceph_get_caps(file, CEPH_CAP_FILE_WR, want, pos + count, &got);
> +	if (err < 0)
> +		goto out;
> +
> +	err = file_update_time(file);
> +	if (err)
> +		goto out_caps;
> +
> +	inode_inc_iversion_raw(inode);
> +
> +	doutc(cl, "%p %llx.%llx %llu~%zd got cap refs on %s\n",
> +	      inode, ceph_vinop(inode), pos, count, ceph_cap_string(got));
> +
> +	/* Get the snap this write is going to belong to. */
> +	snapc = ceph_get_most_recent_snapc(inode);
> +
> +	if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 ||
> +	    (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) ||
> +	    (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) {
> +		struct iov_iter data;
> +
> +		/* we might need to revert back to that point */
> +		data = *from;
> +		written = netfs_unbuffered_write_iter_locked(iocb, &data, &snapc->group);
> +		if (direct_lock)
> +			netfs_end_io_direct(inode);
> +		else
> +			netfs_end_io_write(inode);
> +		if (written > 0)
> +			iov_iter_advance(from, written);
> +		ceph_put_snap_context(snapc);
> +	} else {
> +		/*
> +		 * No need to acquire the i_truncate_mutex.  Because the MDS
> +		 * revokes Fwb caps before sending truncate message to us.  We
> +		 * can't get Fwb cap while there are pending vmtruncate.  So
> +		 * write and vmtruncate can not run at the same time
> +		 */
> +		written = netfs_perform_write(iocb, from, &snapc->group, &prealloc_cf);
> +		netfs_end_io_write(inode);
> +	}
> +
> +	if (written >= 0) {
> +		int dirty;
> +
> +		spin_lock(&ci->i_ceph_lock);
> +		dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR,
> +					       &prealloc_cf);
> +		spin_unlock(&ci->i_ceph_lock);
> +		if (dirty)
> +			__mark_inode_dirty(inode, dirty);
> +		if (ceph_quota_is_max_bytes_approaching(inode, iocb->ki_pos))
> +			ceph_check_caps(ci, CHECK_CAPS_FLUSH);
> +	}
> +
> +	doutc(cl, "%p %llx.%llx %llu~%u  dropping cap refs on %s\n",
> +	      inode, ceph_vinop(inode), pos, (unsigned)count,
> +	      ceph_cap_string(got));
> +	ceph_put_cap_refs(ci, got);
> +
> +	if (written == -EOLDSNAPC) {
> +		doutc(cl, "%p %llx.%llx %llu~%u" "got EOLDSNAPC, retrying\n",
> +		      inode, ceph_vinop(inode), pos, (unsigned)count);
> +		goto retry_snap;
> +	}
> +
> +	if (written >= 0) {
> +		if ((map_flags & CEPH_OSDMAP_NEARFULL) ||
> +		    (pool_flags & CEPH_POOL_FLAG_NEARFULL))
> +			iocb->ki_flags |= IOCB_DSYNC;
> +		written = generic_write_sync(iocb, written);
> +	}
> +
> +	goto out_unlocked;
> +out_caps:
> +	ceph_put_cap_refs(ci, got);
> +out:
> +	if (direct_lock)
> +		netfs_end_io_direct(inode);
> +	else
> +		netfs_end_io_write(inode);
> +out_unlocked:
> +	ceph_free_cap_flush(prealloc_cf);
> +	return written ? written : err;
> +}
> +
> +vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
> +{
> +	struct ceph_snap_context *snapc;
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct inode *inode = file_inode(vma->vm_file);
> +	struct ceph_client *cl = ceph_inode_to_client(inode);
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	struct ceph_file_info *fi = vma->vm_file->private_data;
> +	struct ceph_cap_flush *prealloc_cf;
> +	struct folio *folio = page_folio(vmf->page);
> +	loff_t size = i_size_read(inode);
> +	loff_t off = folio_pos(folio);
> +	size_t len = folio_size(folio);
> +	int want, got, err;
> +	vm_fault_t ret = VM_FAULT_SIGBUS;
> +
> +	if (ceph_inode_is_shutdown(inode))
> +		return ret;
> +
> +	prealloc_cf = ceph_alloc_cap_flush();
> +	if (!prealloc_cf)
> +		return -ENOMEM;
> +
> +	doutc(cl, "%llx.%llx %llu~%zd getting caps i_size %llu\n",
> +	      ceph_vinop(inode), off, len, size);
> +	if (fi->fmode & CEPH_FILE_MODE_LAZY)
> +		want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
> +	else
> +		want = CEPH_CAP_FILE_BUFFER;
> +
> +	got = 0;
> +	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, off + len, &got);
> +	if (err < 0)
> +		goto out_free;
> +
> +	doutc(cl, "%llx.%llx %llu~%zd got cap refs on %s\n", ceph_vinop(inode),
> +	      off, len, ceph_cap_string(got));
> +
> +	/* Get the snap this write is going to belong to. */
> +	snapc = ceph_get_most_recent_snapc(inode);
> +
> +	ret = netfs_page_mkwrite(vmf, &snapc->group, &prealloc_cf);
> +
> +	doutc(cl, "%llx.%llx %llu~%zd dropping cap refs on %s ret %x\n",
> +	      ceph_vinop(inode), off, len, ceph_cap_string(got), ret);
> +	ceph_put_cap_refs_async(ci, got);
> +out_free:
> +	ceph_free_cap_flush(prealloc_cf);
> +	if (err < 0)
> +		ret = vmf_error(err);
> +	return ret;
> +}
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 14784ad86670..acd5c4821ded 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -470,7 +470,7 @@ struct ceph_inode_info {
>  #endif
>  };
>  
> -struct ceph_netfs_request_data {
> +struct ceph_netfs_request_data { // TODO: Remove
>  	int caps;
>  
>  	/*
> @@ -483,6 +483,29 @@ struct ceph_netfs_request_data {
>  	bool file_ra_disabled;
>  };
>  
> +struct ceph_io_request {
> +	struct netfs_io_request rreq;
> +	u64 rmw_assert_version;
> +	int caps;
> +
> +	/*
> +	 * Maximum size of a file readahead request.
> +	 * The fadvise could update the bdi's default ra_pages.
> +	 */
> +	unsigned int file_ra_pages;
> +
> +	/* Set it if fadvise disables file readahead entirely */
> +	bool file_ra_disabled;
> +};
> +
> +struct ceph_io_subrequest {
> +	union {
> +		struct netfs_io_subrequest sreq;
> +		struct ceph_io_request *creq;
> +	};
> +	struct ceph_osd_request *req;
> +};
> +
>  static inline struct ceph_inode_info *
>  ceph_inode(const struct inode *inode)
>  {
> @@ -1237,8 +1260,10 @@ extern void __ceph_touch_fmode(struct ceph_inode_info *ci,
>  			       struct ceph_mds_client *mdsc, int fmode);
>  
>  /* addr.c */
> -extern const struct address_space_operations ceph_aops;
> +#if 0 // TODO: Remove after netfs conversion
>  extern const struct netfs_request_ops ceph_netfs_ops;
> +#endif // TODO: Remove after netfs conversion
> +bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio);
>  extern int ceph_mmap(struct file *file, struct vm_area_struct *vma);
>  extern int ceph_uninline_data(struct file *file);
>  extern int ceph_pool_perm_check(struct inode *inode, int need);
> @@ -1253,6 +1278,14 @@ static inline bool ceph_has_inline_data(struct ceph_inode_info *ci)
>  	return true;
>  }
>  
> +/* rdwr.c */
> +extern const struct netfs_request_ops ceph_netfs_ops;
> +extern const struct address_space_operations ceph_aops;
> +
> +ssize_t ceph_netfs_read_iter(struct kiocb *iocb, struct iov_iter *to);
> +ssize_t ceph_netfs_write_iter(struct kiocb *iocb, struct iov_iter *from);
> +vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf);
> +
>  /* file.c */
>  extern const struct file_operations ceph_file_fops;
>  
> @@ -1260,9 +1293,11 @@ extern int ceph_renew_caps(struct inode *inode, int fmode);
>  extern int ceph_open(struct inode *inode, struct file *file);
>  extern int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>  			    struct file *file, unsigned flags, umode_t mode);
> +#if 0 // TODO: Remove after netfs conversion
>  extern ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>  				struct iov_iter *to, int *retry_op,
>  				u64 *last_objver);
> +#endif
>  extern int ceph_release(struct inode *inode, struct file *filp);
>  extern void ceph_fill_inline_data(struct inode *inode, struct page *locked_page,
>  				  char *data, size_t len);
> diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
> index 9724d5a1ddc7..a82eb3be9737 100644
> --- a/fs/netfs/internal.h
> +++ b/fs/netfs/internal.h
> @@ -264,9 +264,9 @@ static inline bool netfs_is_cache_enabled(struct netfs_inode *ctx)
>  }
>  
>  /*
> - * Check to see if a buffer aligns with the crypto block size.  If it doesn't
> - * the crypto layer is going to copy all the data - in which case relying on
> - * the crypto op for a free copy is pointless.
> + * Check to see if a buffer aligns with the crypto unit block size.  If it
> + * doesn't the crypto layer is going to copy all the data - in which case
> + * relying on the crypto op for a free copy is pointless.
>   */
>  static inline bool netfs_is_crypto_aligned(struct netfs_io_request *rreq,
>  					   struct iov_iter *iter)
> diff --git a/fs/netfs/main.c b/fs/netfs/main.c
> index 0900dea53e4a..d431ba261920 100644
> --- a/fs/netfs/main.c
> +++ b/fs/netfs/main.c
> @@ -139,7 +139,7 @@ static int __init netfs_init(void)
>  		goto error_folio_pool;
>  
>  	netfs_request_slab = kmem_cache_create("netfs_request",
> -					       sizeof(struct netfs_io_request), 0,
> +					       NETFS_DEF_IO_REQUEST_SIZE, 0,
>  					       SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
>  					       NULL);
>  	if (!netfs_request_slab)
> @@ -149,7 +149,7 @@ static int __init netfs_init(void)
>  		goto error_reqpool;
>  
>  	netfs_subrequest_slab = kmem_cache_create("netfs_subrequest",
> -						  sizeof(struct netfs_io_subrequest) + 16, 0,
> +						  NETFS_DEF_IO_SUBREQUEST_SIZE, 0,
>  						  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
>  						  NULL);
>  	if (!netfs_subrequest_slab)
> diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
> index 9b8d99477405..091328596533 100644
> --- a/fs/netfs/write_issue.c
> +++ b/fs/netfs/write_issue.c
> @@ -652,7 +652,8 @@ int netfs_writepages_group(struct address_space *mapping,
>  		if (netfs_folio_group(folio) != NETFS_FOLIO_COPY_TO_CACHE &&
>  		    unlikely(!test_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))) {
>  			set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags);
> -			wreq->netfs_ops->begin_writeback(wreq);
> +			if (wreq->netfs_ops->begin_writeback)
> +				wreq->netfs_ops->begin_writeback(wreq);
>  		}
>  
>  		error = netfs_write_folio(wreq, wbc, folio);
> @@ -967,7 +968,8 @@ int netfs_writeback_single(struct address_space *mapping,
>  	trace_netfs_write(wreq, netfs_write_trace_writeback);
>  	netfs_stat(&netfs_n_wh_writepages);
>  
> -	if (__test_and_set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))
> +	if (__test_and_set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags) &&
> +	    wreq->netfs_ops->begin_writeback)
>  		wreq->netfs_ops->begin_writeback(wreq);
>  
>  	for (fq = (struct folio_queue *)iter->folioq; fq; fq = fq->next) {
> diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
> index 733e7f93db66..0c626a7d32f4 100644
> --- a/include/linux/ceph/libceph.h
> +++ b/include/linux/ceph/libceph.h
> @@ -16,6 +16,7 @@
>  #include <linux/writeback.h>
>  #include <linux/slab.h>
>  #include <linux/refcount.h>
> +#include <linux/netfs.h>
>  
>  #include <linux/ceph/types.h>
>  #include <linux/ceph/messenger.h>
> @@ -161,7 +162,7 @@ static inline bool ceph_msgr2(struct ceph_client *client)
>   * dirtied.
>   */
>  struct ceph_snap_context {
> -	refcount_t nref;
> +	struct netfs_group group;
>  	u64 seq;
>  	u32 num_snaps;
>  	u64 snaps[];
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 7eff589711cc..7f8d28b2c41b 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -246,6 +246,7 @@ struct ceph_osd_request {
>  	struct completion r_completion;       /* private to osd_client.c */
>  	ceph_osdc_callback_t r_callback;
>  
> +	struct netfs_io_subrequest *r_subreq;
>  	struct inode *r_inode;         	      /* for use by callbacks */
>  	struct list_head r_private_item;      /* ditto */
>  	void *r_priv;			      /* ditto */
> diff --git a/include/linux/netfs.h b/include/linux/netfs.h
> index 4049c985b9b4..3253352fcbfa 100644
> --- a/include/linux/netfs.h
> +++ b/include/linux/netfs.h
> @@ -26,6 +26,14 @@ enum netfs_sreq_ref_trace;
>  typedef struct mempool_s mempool_t;
>  struct folio_queue;
>  
> +/*
> + * Size of allocations for default netfs_io_(sub)request object slabs and
> + * mempools.  If a filesystem's request and subrequest objects fit within this
> + * size, they can use these otherwise they must provide their own.
> + */
> +#define NETFS_DEF_IO_REQUEST_SIZE (sizeof(struct netfs_io_request) + 24)

Why do we hardcode 24 here? What's about named constant? And why namely 24?

> +#define NETFS_DEF_IO_SUBREQUEST_SIZE (sizeof(struct netfs_io_subrequest) + 16)

The same question about 16.


Thanks,
Slava.

> +
>  /**
>   * folio_start_private_2 - Start an fscache write on a folio.  [DEPRECATED]
>   * @folio: The folio.
> @@ -184,7 +192,10 @@ struct netfs_io_subrequest {
>  	struct list_head	rreq_link;	/* Link in req/stream::subrequests */
>  	struct list_head	ioq_link;	/* Link in io_stream::io_queue */
>  	union {
> -		struct iov_iter	io_iter;	/* Iterator for this subrequest */
> +		struct {
> +			struct iov_iter	io_iter;	/* Iterator for this subrequest */
> +			void	*fs_private;	/* Filesystem specific */
> +		};
>  		struct {
>  			struct scatterlist src_sg; /* Source for crypto subreq */
>  			struct scatterlist dst_sg; /* Dest for crypto subreq */
> diff --git a/net/ceph/snapshot.c b/net/ceph/snapshot.c
> index e24315937c45..92f63cbca183 100644
> --- a/net/ceph/snapshot.c
> +++ b/net/ceph/snapshot.c
> @@ -17,6 +17,11 @@
>   * the entire structure is freed.
>   */
>  
> +static void ceph_snap_context_kfree(struct netfs_group *group)
> +{
> +	kfree(group);
> +}
> +
>  /*
>   * Create a new ceph snapshot context large enough to hold the
>   * indicated number of snapshot ids (which can be 0).  Caller has
> @@ -36,8 +41,9 @@ struct ceph_snap_context *ceph_create_snap_context(u32 snap_count,
>  	if (!snapc)
>  		return NULL;
>  
> -	refcount_set(&snapc->nref, 1);
> -	snapc->num_snaps = snap_count;
> +	refcount_set(&snapc->group.ref, 1);
> +	snapc->group.free = ceph_snap_context_kfree;
> +	snapc->num_snaps  = snap_count;
>  
>  	return snapc;
>  }
> @@ -46,18 +52,14 @@ EXPORT_SYMBOL(ceph_create_snap_context);
>  struct ceph_snap_context *ceph_get_snap_context(struct ceph_snap_context *sc)
>  {
>  	if (sc)
> -		refcount_inc(&sc->nref);
> +		netfs_get_group(&sc->group);
>  	return sc;
>  }
>  EXPORT_SYMBOL(ceph_get_snap_context);
>  
>  void ceph_put_snap_context(struct ceph_snap_context *sc)
>  {
> -	if (!sc)
> -		return;
> -	if (refcount_dec_and_test(&sc->nref)) {
> -		/*printk(" deleting snap_context %p\n", sc);*/
> -		kfree(sc);
> -	}
> +	if (sc)
> +		netfs_put_group(&sc->group);
>  }
>  EXPORT_SYMBOL(ceph_put_snap_context);
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 22/35] libceph, rbd: Convert ceph_osdc_notify() reply to ceph_databuf
  2025-03-13 23:33 ` [RFC PATCH 22/35] libceph, rbd: Convert ceph_osdc_notify() reply to ceph_databuf David Howells
  2025-03-19  0:08   ` Viacheslav Dubeyko
@ 2025-03-20 14:44   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-20 14:44 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> >  		} else if (!completion_done(&lreq->notify_finish_wait)) {
> > -			struct ceph_msg_data *data =
> > -			    msg->num_data_items ? &msg->data[0] : NULL;
> > -
> > -			if (data) {
> > -				if (lreq->preply_pages) {
> > -					WARN_ON(data->type !=
> > -							CEPH_MSG_DATA_PAGES);
> > -					*lreq->preply_pages = data->pages;
> > -					*lreq->preply_len = data->length;
> > -					data->own_pages = false;
> > -				}
> > +			if (msg->num_data_items && lreq->reply) {
> > +				struct ceph_msg_data *data = &msg->data[0];
> 
> This low-level access slightly worry me. I don't see any real problem
> here. But, maybe, we need to hide this access into some iterator-like
> function? However, it could be not feasible for the scope of this patchset.

Yeah.  This is something that precedes my changes and I think it needs fixing
apart from it.

David


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Why use plain numbers and totals rather than predef'd constants for RPC sizes?
  2025-03-13 23:33 ` [RFC PATCH 23/35] rbd: Use ceph_databuf_enc_start/stop() David Howells
  2025-03-19  0:32   ` Viacheslav Dubeyko
@ 2025-03-20 14:59   ` David Howells
  2025-03-20 21:48     ` Viacheslav Dubeyko
  1 sibling, 1 reply; 72+ messages in thread
From: David Howells @ 2025-03-20 14:59 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Ilya Dryomov
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, jlayton@kernel.org,
	linux-fsdevel@vger.kernel.org, ceph-devel@vger.kernel.org,
	dongsheng.yang@easystack.cn, linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > -	dbuf = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO);
> > -	if (!dbuf)
> > +	request = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO);
> 
> Ditto. Why do we have 8 + sizeof(struct ceph_timespec) here?

Because that's the size of the composite protocol element.

As to why it's using a total of plain integers and sizeofs rather than
constant macros, Ilya is the person to ask according to git blame;-).

I would probably prefer sizeof(__le64) here over 8, but I didn't want to
change it too far from the existing code.

If you want macro constants for these sorts of things, someone else who knows
the protocol better needs to do that.  You could probably write something to
generate them (akin to rpcgen).

David

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 28/35] netfs: Adjust group handling
  2025-03-13 23:33 ` [RFC PATCH 28/35] netfs: Adjust group handling David Howells
  2025-03-19 18:57   ` Viacheslav Dubeyko
@ 2025-03-20 15:22   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-20 15:22 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> >  		if (unlikely(group != netfs_group) &&
> > -		    group != NETFS_FOLIO_COPY_TO_CACHE)
> > +		    group != NETFS_FOLIO_COPY_TO_CACHE &&
> > +		    (group || folio_test_dirty(folio)))
> 
> I am trying to follow to this complex condition. Is it possible case that
> folio is dirty but we don't flush the content?

It's slightly complicated by fscache.

The way I have made local caching work for things that use netfslib fully is
that the writeback code copies the data to the cache.  We achieve this by
marking the pages dirty when we read them from the server.

However, so that we don't *also* write the clean data back to the server, the
writeback group[*] field is set to a special value (NETFS_FOLIO_COPY_TO_CACHE)
and we make the assumption that the writeback group is only actually going to
be used by the filesystem if the page is actually modified - in which case the
writeback group field is overwritten.

[*] This is either folio->private or in a netfs_folio struct attached to
folio->private.  Note that folio->private is set to be removed in the future.

In the event that a page is modified it will be written back to the server(s)
and the cache, assuming there is a cache.  Also note the netfs_io_stream
struct.  There are two in the netfs_io_request struct and these are used to
separately manage and divide up the writes to a server and to the cache.  I've
also left the possibility open that we can have more than two streams in the
event that we need to write the data to multiple servers.

Further, another reason for making writeback write the data to both the cache
and the server is that if you are using content encryption, the data is
encrypted and then the ciphertext is written to both the server and the cache.

> Is it possible case that folio is dirty but we don't flush the content?

Anyway, to answer the question more specifically, yes.  If the folio is dirty
and in the same writeback group (e.g. most recent ceph snap context), then we
can presumably keep modifying it.

And if the folio is marked dirty and is marked NETFS_FOLIO_COPY_TO_CACHE, then
we can just overwrite it, replace or clear the NETFS_FOLIO_COPY_TO_CACHE mark
and then it just becomes a regular dirty page.  It will get written to fscache
either way.

> > +		if ((++flush_counter & 0xf) == 0xf)
> > +			msleep(10);
> 
> Do we really need to use sleep? And why is it 10 ms? And even if we would
> like to use sleep, then it is better to introduce the named constant. And
> what is teh justification for 10 ms?

At the moment, debugging and stopping it from running wild in a tight loop
when a mistake is made.  Remember: at this point, this is a WIP.

But in reality, we might see this if we're indulging in cache ping-pong
between two clients.  I'm not sure how this might be mitigated in the ceph
environment - if that's not already done.

> > -		kdebug("wrong group");
> > +		kdebug("wrong group %px != %px", fgroup, wreq->group);
> 
> I believe to use the %px is not very good practice. Do we really need to show
> the real pointer?

At some point I need to test interference from someone cranking the snaps and
I'll probably need this then - though it might be better to make a tracepoint
for it.

> > +/*
> > + * Get a ref on a netfs group attached to a dirty page (e.g. a ceph snap).
> > + */
> > +static inline struct netfs_group *netfs_get_group(struct netfs_group *netfs_group)
> > +{
> > +	if (netfs_group && netfs_group != NETFS_FOLIO_COPY_TO_CACHE)
> 
> The netfs_group is a pointer. Is it correct comparison of pointer with the
> NETFS_FOLIO_COPY_TO_CACHE constant?

This constant?

#define NETFS_FOLIO_COPY_TO_CACHE ((struct netfs_group *)0x356UL) /* Write to the cache only */

Yes.  See explanation above.

David

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 32/35] netfs: Add some more RMW support for ceph
  2025-03-13 23:33 ` [RFC PATCH 32/35] netfs: Add some more RMW support for ceph David Howells
  2025-03-19 19:14   ` Viacheslav Dubeyko
@ 2025-03-20 15:25   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-20 15:25 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > +	rreq->buffer.iter	= *iter;
> 
> The struct iov_iter structure is complex enough and we assign it by value to
> rreq->buffer.iter. So, the initial pointer will not receive any changes
> then. Is it desired behavior here?

Yes.  The buffer described by the iterator is going to get partitioned across
a number of subrequests, each of which will get a copy of the iterator
suitably advanced and truncated.  As they may run in parallel, there's no way
for them to share the original iterator.

David


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 33/35] ceph: Use netfslib [INCOMPLETE]
  2025-03-13 23:33 ` [RFC PATCH 33/35] ceph: Use netfslib [INCOMPLETE] David Howells
  2025-03-19 19:54   ` Viacheslav Dubeyko
@ 2025-03-20 15:38   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-20 15:38 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > +		fret = ceph_fscrypt_decrypt_pages(inode, &page[pgidx],
> > +						 off + pgsoff, ext->len);
> > +		dout("%s: [%d] 0x%llx~0x%llx fret %d\n", __func__, i,
> > +				ext->off, ext->len, fret);
> > +		if (fret < 0) {
> 
> Possibly, I am missing some logic here. But do we really need to introduce fret?
> Why we cannot user ret here? 
> 
> > +			if (ret == 0)
> > +				ret = fret;
> > +			break;
> > +		}
> > +		ret = pgsoff + fret;

Because ret holds the amount of data so far decrypted.  We should only return
an error if we didn't decrypt any (ie. ret == 0 at the time of error).

> > +static int ceph_init_request(struct netfs_io_request *rreq, struct file *file)
> > +{
> > +	struct ceph_io_request *priv = container_of(rreq, struct ceph_io_request, rreq);
> > +	struct inode *inode = rreq->inode;
> > +	struct ceph_client *cl = ceph_inode_to_client(inode);
> > +	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> > +	int got = 0, want = CEPH_CAP_FILE_CACHE;
> > +	int ret = 0;
> > +
> > +	rreq->rsize = 1024 * 1024;
> 
> Why do we hardcoded rreq->rsize value?
> 
> struct ceph_mount_options {
> 	unsigned int flags;
> 
> 	unsigned int wsize;            /* max write size */
> 	unsigned int rsize;            /* max read size */
> 	unsigned int rasize;           /* max readahead */
> 	unsigned int congestion_kb;    /* max writeback in flight */
> 	unsigned int caps_wanted_delay_min, caps_wanted_delay_max;
> 	int caps_max;
> 	unsigned int max_readdir;       /* max readdir result (entries) */
> 	unsigned int max_readdir_bytes; /* max readdir result (bytes) */
> 
> 	bool new_dev_syntax;
> 
> 	/*
> 	 * everything above this point can be memcmp'd; everything below
> 	 * is handled in compare_mount_options()
> 	 */
> 
> 	char *snapdir_name;   /* default ".snap" */
> 	char *mds_namespace;  /* default NULL */
> 	char *server_path;    /* default NULL (means "/") */
> 	char *fscache_uniq;   /* default NULL */
> 	char *mon_addr;
> 	struct fscrypt_dummy_policy dummy_enc_policy;
> };
> 
> Why we don't use fsc->mount_options->rsize?

Actually, I should get rid of rreq->rsize since there's now a function,
->prepare_read() to deal with this.

> > +	loff_t limit = max(i_size_read(inode), fsc->max_file_size);
> 
> Do we need to take into account the quota max bytes here?

Could be.

> > +/*
> > + * Size of allocations for default netfs_io_(sub)request object slabs and
> > + * mempools.  If a filesystem's request and subrequest objects fit within this
> > + * size, they can use these otherwise they must provide their own.
> > + */
> > +#define NETFS_DEF_IO_REQUEST_SIZE (sizeof(struct netfs_io_request) + 24)
> 
> Why do we hardcode 24 here? What's about named constant? And why namely 24?
> 
> > +#define NETFS_DEF_IO_SUBREQUEST_SIZE (sizeof(struct netfs_io_subrequest) + 16)
> 
> The same question about 16.

See the comment.  24 allows three extra words and 16 two.  Actually, I should
probably express it as a multiple of sizeof(long).  But this allows netfslib
to allocate (sub)request structs for ceph from the default mempool by allowing
a little bit of extra space.

David


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [RFC PATCH 04/35] ceph: Convert ceph_mds_request::r_pagelist to a databuf
  2025-03-17 11:52   ` David Howells
@ 2025-03-20 20:34     ` Viacheslav Dubeyko
  2025-03-20 22:01     ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-20 20:34 UTC (permalink / raw)
  To: slava@dubeyko.com, David Howells
  Cc: dongsheng.yang@easystack.cn, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Alex Markuze, jlayton@kernel.org, idryomov@gmail.com,
	linux-block@vger.kernel.org

On Mon, 2025-03-17 at 11:52 +0000, David Howells wrote:
> slava@dubeyko.com wrote:
> 
> > > -		err = ceph_pagelist_reserve(pagelist, len +
> > > val_size1 + 8);
> > > +		err = ceph_databuf_reserve(dbuf, len + val_size1 +
> > > 8,
> > > +					   GFP_KERNEL);
> > 
> > I know that it's simple change. But this len + val_size1 + 8 looks
> > confusing, anyway. What this hardcoded 8 means? :)
> 
> You tell me.  The '8' is pre-existing.
> 

Yeah, I know. I am simply thinking aloud that we need to rework the CephFS code
somehow to make it more clear and easy understandable. But it has no relations
with your change. 

> > > -	if (req->r_pagelist) {
> > > -		iinfo.xattr_len = req->r_pagelist->length;
> > > -		iinfo.xattr_data = req->r_pagelist->mapped_tail;
> > > +	if (req->r_dbuf) {
> > > +		iinfo.xattr_len = ceph_databuf_len(req->r_dbuf);
> > > +		iinfo.xattr_data = kmap_ceph_databuf_page(req-
> > > > r_dbuf, 0);
> > 
> > Possibly, it's in another patch. Have we removed req->r_pagelist from
> > the structure?
> 
> See patch 20 "libceph: Remove ceph_pagelist".
> 
> It cannot be removed here as the kernel must still compile and work at this
> point.
> 
> > Do we always have memory pages in ceph_databuf? How
> > kmap_ceph_databuf_page() will behave if it's not memory page.
> 
> Are there other sorts of pages?
> 

My point is simple. I assumed that if ceph_databuf can handle multiple types of
memory representations, then it could be not only memory pages. Potentially, CXL
memory would require some special management in the future (maybe not). :) But
if we always use regular memory pages under ceph_databuf abstraction, then I
don't see any problem here.

> > Maybe, we need to hide kunmap_local() into something like
> > kunmap_ceph_databuf_page()?
> 
> Actually, probably better to rename kmap_ceph_databuf_page() to
> kmap_local_ceph_databuf().
> 
> > Maybe, it makes sense to call something like ceph_databuf_length()
> > instead of low level access to dbuf->nr_bvec?
> 
> Sounds reasonable.  Better to hide the internal workings.
> 
> > > +	if (as_ctx->dbuf) {
> > > +		req->r_dbuf = as_ctx->dbuf;
> > > +		as_ctx->dbuf = NULL;
> > 
> > Maybe, we need something like swap() method? :)
> 
> I could point out that you were complaining about ceph_databuf_get() returning
> a pointer than a void;-).
> 
> > > +	dbuf = ceph_databuf_req_alloc(2, 0, GFP_KERNEL);
> > 
> > So, do we allocate 2 items of zero length here?
> 
> You don't.  One is the bvec[] count (2) and one is that amount of memory to
> preallocate (0) and attach to that bvec[].
> 

Aaah. I see now. Thanks.

> Now, it may make sense to split the API calls to handle a number of different
> scenarios, e.g.: request with just protocol, no pages; request with just
> pages; request with both protocol bits and page list.
> 
> > > +	if (ceph_databuf_insert_frag(dbuf, 0, sizeof(*header),
> > > GFP_KERNEL) < 0)
> > > +		goto out;
> > > +	if (ceph_databuf_insert_frag(dbuf, 1, PAGE_SIZE, GFP_KERNEL)
> > > < 0)
> > >  		goto out;
> > >  
> > > +	iov_iter_bvec(&iter, ITER_DEST, &dbuf->bvec[1], 1, len);
> > 
> > Is it correct &dbuf->bvec[1]? Why do we work with item #1? I think it
> > looks confusing.
> 
> Because you have a protocol element (in dbuf->bvec[0]) and a buffer (in
> dbuf->bvec[1]).

It sounds to me that we need to have two declarations (something like this):

#define PROTOCOL_ELEMENT_INDEX    0
#define BUFFER_INDEX              1

> 
> An iterator is attached to the buffer and the iterator then conveys it to
> __ceph_sync_read() as the destination.
> 
> If you look a few lines further on in the patch, you can see the first
> fragment being accessed:
> 
> > +	header = kmap_ceph_databuf_page(dbuf, 0);
> > +
> 
> Note that, because the read buffer is very likely a whole page, I split them
> into separate sections rather than trying to allocate an order-1 page as that
> would be more likely to fail.
> 
> > > -		header.data_len = cpu_to_le32(8 + 8 + 4);
> > > -		header.file_offset = 0;
> > > +		header->data_len = cpu_to_le32(8 + 8 + 4);
> > 
> > The same problem of understanding here for me. What this hardcoded 8 +
> > 8 + 4 value means? :)
> 
> You need to ask a ceph expert.  This is nothing specifically to do with my
> changes.  However, I suspect it's the size of the message element.
> 

Yeah, I see. :)

> > > -		memset(iov.iov_base + boff, 0, PAGE_SIZE - boff);
> > > +		p = kmap_ceph_databuf_page(dbuf, 1);
> > 
> > Maybe, we need to introduce some constants to address #0 and #1 pages?
> > Because, #0 it's header and I assume #1 is some content.
> 
> Whilst that might be useful, I don't know that the 0 and 1... being header and
> content respectively always hold.  I haven't checked, but there could even be
> a protocol trailer in some cases as well.
> 
> > > -	err = ceph_pagelist_reserve(pagelist,
> > > -				    4 * 2 + name_len + as_ctx-
> > > > lsmctx.len);
> > > +	err = ceph_databuf_reserve(dbuf, 4 * 2 + name_len + as_ctx-
> > > > lsmctx.len,
> > > +				   GFP_KERNEL);
> > 
> > The 4 * 2 + name_len + as_ctx->lsmctx.len looks unclear to me. It wil
> > be good to have some well defined constants here.
> 
> Again, nothing specifically to do with my changes.
> 

I completely agree.

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [RFC PATCH 17/35] libceph, rbd: Use ceph_databuf encoding start/stop
  2025-03-18 22:19   ` David Howells
@ 2025-03-20 21:45     ` Viacheslav Dubeyko
  0 siblings, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-20 21:45 UTC (permalink / raw)
  To: David Howells
  Cc: dongsheng.yang@easystack.cn, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, slava@dubeyko.com,
	linux-kernel@vger.kernel.org, Alex Markuze, jlayton@kernel.org,
	idryomov@gmail.com, linux-block@vger.kernel.org

On Tue, 2025-03-18 at 22:19 +0000, David Howells wrote:
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> 
> > > -		ceph_encode_string(&p, end, buf, len);
> > > +		BUG_ON(p + sizeof(__le32) + len > end);
> > 
> > Frankly speaking, it's hard to follow why sizeof(__le32) should be in the
> > equation. Maybe, it make sense to introduce some constant? The name of
> > constant makes understanding of this calculation more clear.
> 
> Look through the patch.  It's done all over the place, even on parts I haven't
> touched.  However, it's probably because of the way the string is encoded
> (4-byte LE length followed by the characters).
> 
> It probably would make sense to use a calculation wrapper for this.  I have
> this in fs/afs/yfsclient.c for example:
> 
> 	static size_t xdr_strlen(unsigned int len)
> 	{
> 		return sizeof(__be32) + round_up(len, sizeof(__be32));
> 	}
> 
> > > +	BUG_ON(sizeof(__le64) + sizeof(__le32) + wsize > req->request->front_alloc_len);
> > 
> > The same problem is here. It's hard to follow to this check by involving
> > sizeof(__le64) and sizeof(__le32) in calculation. What these numbers mean here?
> 
> Presumably the sizes of the protocol elements in the marshalled data.  If you
> want to clean all those up in some way, I can add your patch into my
> series;-).
> 

Yeah, I am considering to make the likewise cleanup. :)

Thanks,
Slava.



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re:  Why use plain numbers and totals rather than predef'd constants for RPC sizes?
  2025-03-20 14:59   ` Why use plain numbers and totals rather than predef'd constants for RPC sizes? David Howells
@ 2025-03-20 21:48     ` Viacheslav Dubeyko
  0 siblings, 0 replies; 72+ messages in thread
From: Viacheslav Dubeyko @ 2025-03-20 21:48 UTC (permalink / raw)
  To: idryomov@gmail.com, David Howells
  Cc: dongsheng.yang@easystack.cn, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, slava@dubeyko.com,
	linux-kernel@vger.kernel.org, Alex Markuze, jlayton@kernel.org,
	linux-block@vger.kernel.org

On Thu, 2025-03-20 at 14:59 +0000, David Howells wrote:
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> 
> > > -	dbuf = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO);
> > > -	if (!dbuf)
> > > +	request = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO);
> > 
> > Ditto. Why do we have 8 + sizeof(struct ceph_timespec) here?
> 
> Because that's the size of the composite protocol element.
> 
> As to why it's using a total of plain integers and sizeofs rather than
> constant macros, Ilya is the person to ask according to git blame;-).
> 
> I would probably prefer sizeof(__le64) here over 8, but I didn't want to
> change it too far from the existing code.
> 
> If you want macro constants for these sorts of things, someone else who knows
> the protocol better needs to do that.  You could probably write something to
> generate them (akin to rpcgen).
> 

Yes, make sense. I totally agree with you. :)

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 04/35] ceph: Convert ceph_mds_request::r_pagelist to a databuf
  2025-03-17 11:52   ` David Howells
  2025-03-20 20:34     ` Viacheslav Dubeyko
@ 2025-03-20 22:01     ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-03-20 22:01 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, slava@dubeyko.com, dongsheng.yang@easystack.cn,
	linux-fsdevel@vger.kernel.org, ceph-devel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Alex Markuze, jlayton@kernel.org,
	idryomov@gmail.com, linux-block@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > > > +	if (ceph_databuf_insert_frag(dbuf, 0, sizeof(*header),
> > > > GFP_KERNEL) < 0)
> > > > +		goto out;
> > > > +	if (ceph_databuf_insert_frag(dbuf, 1, PAGE_SIZE, GFP_KERNEL)
> > > > < 0)
> > > >  		goto out;
> > > >  
> > > > +	iov_iter_bvec(&iter, ITER_DEST, &dbuf->bvec[1], 1, len);
> > > 
> > > Is it correct &dbuf->bvec[1]? Why do we work with item #1? I think it
> > > looks confusing.
> > 
> > Because you have a protocol element (in dbuf->bvec[0]) and a buffer (in
> > dbuf->bvec[1]).
> 
> It sounds to me that we need to have two declarations (something like this):
> 
> #define PROTOCOL_ELEMENT_INDEX    0
> #define BUFFER_INDEX              1

But that's specific to this particular usage.  There may or may not be a
frag/page allocated to a protocol element and there may or may not be buffer
parts and may be multiple buffer parts.  There could even be multiple pages
allocated to protocol elements.

David


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 06/35] rbd: Use ceph_databuf for rbd_obj_read_sync()
  2025-03-13 23:32 ` [RFC PATCH 06/35] rbd: Use ceph_databuf for rbd_obj_read_sync() David Howells
  2025-03-17 19:08   ` Viacheslav Dubeyko
@ 2025-04-11 13:48   ` David Howells
  1 sibling, 0 replies; 72+ messages in thread
From: David Howells @ 2025-04-11 13:48 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, slava@dubeyko.com,
	linux-block@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, dongsheng.yang@easystack.cn,
	linux-kernel@vger.kernel.org

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > +	dbuf = ceph_databuf_req_alloc(1, sizeof(*ondisk), GFP_KERNEL);
> 
> I am slightly worried about such using of ondisk variable. We have garbage as a
> value of ondisk pointer on this step yet. And pointer dereferencing could look
> confusing here. Also, potentially, compiler and static analysis tools could
> complain. I don't see a problem here but anyway I am feeling worried. :)

It's a sizeof() construction.  We do this all the time:

	struct fred *p;

	p = kmalloc(sizeof(*p), GFP_KERNEL);

David


^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2025-04-11 13:48 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
2025-03-13 23:32 ` [RFC PATCH 01/35] ceph: Fix incorrect flush end position calculation David Howells
2025-03-13 23:32 ` [RFC PATCH 02/35] libceph: Rename alignment to offset David Howells
2025-03-14 19:04   ` Viacheslav Dubeyko
2025-03-14 20:01   ` David Howells
2025-03-13 23:32 ` [RFC PATCH 03/35] libceph: Add a new data container type, ceph_databuf David Howells
2025-03-14 20:06   ` Viacheslav Dubeyko
2025-03-17 11:27   ` David Howells
2025-03-13 23:32 ` [RFC PATCH 04/35] ceph: Convert ceph_mds_request::r_pagelist to a databuf David Howells
2025-03-14 22:27   ` slava
2025-03-17 11:52   ` David Howells
2025-03-20 20:34     ` Viacheslav Dubeyko
2025-03-20 22:01     ` David Howells
2025-03-13 23:32 ` [RFC PATCH 05/35] libceph: Add functions to add ceph_databufs to requests David Howells
2025-03-13 23:32 ` [RFC PATCH 06/35] rbd: Use ceph_databuf for rbd_obj_read_sync() David Howells
2025-03-17 19:08   ` Viacheslav Dubeyko
2025-04-11 13:48   ` David Howells
2025-03-13 23:32 ` [RFC PATCH 07/35] libceph: Change ceph_osdc_call()'s reply to a ceph_databuf David Howells
2025-03-17 19:41   ` Viacheslav Dubeyko
2025-03-17 22:12   ` David Howells
2025-03-13 23:33 ` [RFC PATCH 08/35] libceph: Unexport osd_req_op_cls_request_data_pages() David Howells
2025-03-13 23:33 ` [RFC PATCH 09/35] libceph: Remove osd_req_op_cls_response_data_pages() David Howells
2025-03-13 23:33 ` [RFC PATCH 10/35] libceph: Convert notify_id_pages to a ceph_databuf David Howells
2025-03-13 23:33 ` [RFC PATCH 11/35] ceph: Use ceph_databuf in DIO David Howells
2025-03-17 20:03   ` Viacheslav Dubeyko
2025-03-17 22:26   ` David Howells
2025-03-13 23:33 ` [RFC PATCH 12/35] libceph: Bypass the messenger-v1 Tx loop for databuf/iter data blobs David Howells
2025-03-13 23:33 ` [RFC PATCH 13/35] rbd: Switch from using bvec_iter to iov_iter David Howells
2025-03-18 19:38   ` Viacheslav Dubeyko
2025-03-18 22:13   ` David Howells
2025-03-13 23:33 ` [RFC PATCH 14/35] libceph: Remove bvec and bio data container types David Howells
2025-03-13 23:33 ` [RFC PATCH 15/35] libceph: Make osd_req_op_cls_init() use a ceph_databuf and map it David Howells
2025-03-13 23:33 ` [RFC PATCH 16/35] libceph: Convert req_page of ceph_osdc_call() to ceph_databuf David Howells
2025-03-13 23:33 ` [RFC PATCH 17/35] libceph, rbd: Use ceph_databuf encoding start/stop David Howells
2025-03-18 19:59   ` Viacheslav Dubeyko
2025-03-18 22:19   ` David Howells
2025-03-20 21:45     ` Viacheslav Dubeyko
2025-03-13 23:33 ` [RFC PATCH 18/35] libceph, rbd: Convert some page arrays to ceph_databuf David Howells
2025-03-18 20:02   ` Viacheslav Dubeyko
2025-03-18 22:25   ` David Howells
2025-03-13 23:33 ` [RFC PATCH 19/35] libceph, ceph: Convert users of ceph_pagelist " David Howells
2025-03-18 20:09   ` Viacheslav Dubeyko
2025-03-18 22:27   ` David Howells
2025-03-13 23:33 ` [RFC PATCH 20/35] libceph: Remove ceph_pagelist David Howells
2025-03-13 23:33 ` [RFC PATCH 21/35] libceph: Make notify code use ceph_databuf_enc_start/stop David Howells
2025-03-18 20:12   ` Viacheslav Dubeyko
2025-03-18 22:36   ` David Howells
2025-03-13 23:33 ` [RFC PATCH 22/35] libceph, rbd: Convert ceph_osdc_notify() reply to ceph_databuf David Howells
2025-03-19  0:08   ` Viacheslav Dubeyko
2025-03-20 14:44   ` David Howells
2025-03-13 23:33 ` [RFC PATCH 23/35] rbd: Use ceph_databuf_enc_start/stop() David Howells
2025-03-19  0:32   ` Viacheslav Dubeyko
2025-03-20 14:59   ` Why use plain numbers and totals rather than predef'd constants for RPC sizes? David Howells
2025-03-20 21:48     ` Viacheslav Dubeyko
2025-03-13 23:33 ` [RFC PATCH 24/35] ceph: Make ceph_calc_file_object_mapping() return size as size_t David Howells
2025-03-13 23:33 ` [RFC PATCH 25/35] ceph: Wrap POSIX_FADV_WILLNEED to get caps David Howells
2025-03-13 23:33 ` [RFC PATCH 26/35] ceph: Kill ceph_rw_context David Howells
2025-03-13 23:33 ` [RFC PATCH 27/35] netfs: Pass extra write context to write functions David Howells
2025-03-13 23:33 ` [RFC PATCH 28/35] netfs: Adjust group handling David Howells
2025-03-19 18:57   ` Viacheslav Dubeyko
2025-03-20 15:22   ` David Howells
2025-03-13 23:33 ` [RFC PATCH 29/35] netfs: Allow fs-private data to be handed through to request alloc David Howells
2025-03-13 23:33 ` [RFC PATCH 30/35] netfs: Make netfs_page_mkwrite() use folio_mkwrite_check_truncate() David Howells
2025-03-13 23:33 ` [RFC PATCH 31/35] netfs: Fix netfs_unbuffered_read() to return ssize_t rather than int David Howells
2025-03-13 23:33 ` [RFC PATCH 32/35] netfs: Add some more RMW support for ceph David Howells
2025-03-19 19:14   ` Viacheslav Dubeyko
2025-03-20 15:25   ` David Howells
2025-03-13 23:33 ` [RFC PATCH 33/35] ceph: Use netfslib [INCOMPLETE] David Howells
2025-03-19 19:54   ` Viacheslav Dubeyko
2025-03-20 15:38   ` David Howells
2025-03-13 23:33 ` [RFC PATCH 34/35] ceph: Enable multipage folios for ceph files David Howells
2025-03-13 23:33 ` [RFC PATCH 35/35] ceph: Remove old I/O API bits David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).