From: David Howells <dhowells@redhat.com>
To: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Cc: dhowells@redhat.com, Alex Markuze <amarkuze@redhat.com>,
"slava@dubeyko.com" <slava@dubeyko.com>,
"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
"idryomov@gmail.com" <idryomov@gmail.com>,
"jlayton@kernel.org" <jlayton@kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>,
"dongsheng.yang@easystack.cn" <dongsheng.yang@easystack.cn>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 03/35] libceph: Add a new data container type, ceph_databuf
Date: Mon, 17 Mar 2025 11:27:37 +0000 [thread overview]
Message-ID: <2160750.1742210857@warthog.procyon.org.uk> (raw)
In-Reply-To: <1bab7ad752df6f2fa953fbf8eed8370e10344ff7.camel@ibm.com>
Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> > +struct ceph_databuf {
> > + struct bio_vec *bvec; /* List of pages */
>
> So, maybe we need to think about folios now?
Yeah, I know... but struct bio_vec has a page pointer and may point to
non-folio type pages. This stuff is still undergoing evolution as Willy works
on reducing struct page.
What I'm pondering is changing struct folio_queue to take a list of { folio,
offset, len } rather than using a folio_batch with a simple list of folios.
It doesn't necessarily help with DIO, though, but there we're given an
iterator we're required to use.
One of the things I'd like to look at for ceph as well is using the page frag
allocator[*] to get small pieces of memory for stashing protocol data in
rather than allocating full-page buffers.
[*] Memory allocated from the page frag allocator can be used with
MSG_SPLICE_PAGES as its lifetime is controlled by the refcount. Now, we could
probably have a page frag allocator that uses folios rather than non-folio
pages for network filesystem use. That could be of use to afs and cifs also.
As I mentioned, in a previous reply, how to better integrate folioq/bvec is
hopefully up for discussion at LSF/MM next week.
> > +static inline void ceph_databuf_append_page(struct ceph_databuf *dbuf,
> > + struct page *page,
> > + unsigned int offset,
> > + unsigned int len)
> > +{
> > + BUG_ON(dbuf->nr_bvec >= dbuf->max_bvec);
> > + bvec_set_page(&dbuf->bvec[dbuf->nr_bvec++], page, len, offset);
> > + dbuf->iter.count += len;
> > + dbuf->iter.nr_segs++;
>
> Why do we assign len to dbuf->iter.count but only increment
> dbuf->iter.nr_segs?
Um, because it doesn't? It adds len to dbuf->iter.count.
> > enum ceph_msg_data_type {
> > CEPH_MSG_DATA_NONE, /* message contains no data payload */
> > + CEPH_MSG_DATA_DATABUF, /* data source/destination is a data buffer */
> > CEPH_MSG_DATA_PAGES, /* data source/destination is a page array */
> > CEPH_MSG_DATA_PAGELIST, /* data source/destination is a pagelist */
>
> So, the final replacement on databuf will be in the future?
The result of each patch has to compile and work, right? But yes, various of
the patches in this series reduce the use of those other data types. I have
patches in progress to finally remove PAGES and PAGELIST, but they're not
quite compiling yet.
> > + dbuf = kzalloc(sizeof(*dbuf), gfp);
> > + if (!dbuf)
> > + return NULL;
>
> I am guessing... Should we return error code here?
The only error this function can return is ENOMEM, so it just returns NULL
like many other alloc functions.
> > + } else if (min_bvec) {
> > + min_bvec = umax(min_bvec, 16);
>
> Why 16 here? Maybe, do we need to introduce some well explained constant?
Fair point.
> > + dbuf->max_bvec = min_bvec;
>
> Why do we assign min_bvec to max_bvec? I am simply slightly confused why
> argument of function is named as min_bvec, but finally we are saving min_bvec
> value into max_bvec.
The 'min_bvec' argument is the minimum number of bvecs that the caller needs
to be allocated. This may get rounded up to include all of the piece of
memory we're going to be given by the slab.
'dbuf->max_bvec' is the maximum number of entries that can be used in
dbuf->bvec[] and is a property of the databuf object.
> > +struct ceph_databuf *ceph_databuf_get(struct ceph_databuf *dbuf)
>
> I see the point here. But do we really need to return pointer? Why not simply:
>
> void ceph_databuf_get(struct ceph_databuf *dbuf)
It means I can do:
foo->databuf = ceph_databuf_get(dbuf);
rather than:
ceph_databuf_get(dbuf);
foo->databuf = dbuf;
> > +static int ceph_databuf_expand(struct ceph_databuf *dbuf, size_t req_bvec,
> > + gfp_t gfp)
> > +{
> > + struct bio_vec *bvec = dbuf->bvec, *old = bvec;
>
> I think that assigning (*old = bvec) looks confusing if we keep it on the same
> line as bvec declaration and initialization. Why do not declare and not
> initialize it on the next line?
>
> > + size_t size, max_bvec, off = dbuf->iter.bvec - old;
>
> I think it's too much declarations on the same line. Why not:
>
> size_t size, max_bvec;
> size_t off = dbuf->iter.bvec - old;
A matter of personal preference, I guess.
> > + bvec = dbuf->bvec;
> > + while (dbuf->nr_bvec < req_bvec) {
> > + struct page *pages[16];
>
> Why do we hardcoded 16 here but using some well defined constant?
Because this is only about stack usage. alloc_pages_bulk() gets an straight
array of page*; we have a bvec[], so we need an intermediate. Now, I could
actually just overlay the array over the tail of the bvec[] and do a single
bulk allocation since sizeof(struct page*) > sizeof(struct bio_vec).
> And, again, why not folio?
I don't think there's a bulk folio allocator. Quite possibly there *should*
be so that readahead can use it - one that allocates different sizes of folios
to fill the space required.
> > + size_t want = min(req_bvec, ARRAY_SIZE(pages)), got;
> > +
> > + memset(pages, 0, sizeof(pages));
> > + got = alloc_pages_bulk(gfp, want, pages);
> > + if (!got)
> > + return -ENOMEM;
> > + for (i = 0; i < got; i++)
>
> Why do we use size_t for i and got? Why not int, for example?
alloc_pages_bulk() doesn't return an int. Now, one could legitimately argue
that I should use "unsigned long" rather than "size_t", but I wouldn't use int
here. int is smaller and signed. Granted, it's unlikely we'll be asked >2G
pages, but if we're going to assign it down to an int, it probably needs to be
checked first.
> > + bvec_set_page(&bvec[dbuf->nr_bvec + i], pages[i],
> > + PAGE_SIZE, 0);
> > + dbuf->iter.nr_segs += got;
> > + dbuf->nr_bvec += got;
>
> If I understood correctly, the ceph_databuf_append_page() uses slightly
> different logic.
Can you elaborate?
> + dbuf->iter.count += len;
> + dbuf->iter.nr_segs++;
>
> But here we assign number of allocated pages to nr_segs. It is slightly
> confusing. I think I am missing something here.
Um - it's an incremement?
I think part of the problem might be that we're mixing two things within the
same container: Partial pages that get kmapped and accessed directly
(e.g. protocol bits) and pages that get accessed indirectly (e.g. data
buffers). Maybe this needs to be made more explicit in the API.
David
next prev parent reply other threads:[~2025-03-17 11:27 UTC|newest]
Thread overview: 72+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-13 23:32 [RFC PATCH 00/35] ceph, rbd, netfs: Make ceph fully use netfslib David Howells
2025-03-13 23:32 ` [RFC PATCH 01/35] ceph: Fix incorrect flush end position calculation David Howells
2025-03-13 23:32 ` [RFC PATCH 02/35] libceph: Rename alignment to offset David Howells
2025-03-14 19:04 ` Viacheslav Dubeyko
2025-03-14 20:01 ` David Howells
2025-03-13 23:32 ` [RFC PATCH 03/35] libceph: Add a new data container type, ceph_databuf David Howells
2025-03-14 20:06 ` Viacheslav Dubeyko
2025-03-17 11:27 ` David Howells [this message]
2025-03-13 23:32 ` [RFC PATCH 04/35] ceph: Convert ceph_mds_request::r_pagelist to a databuf David Howells
2025-03-14 22:27 ` slava
2025-03-17 11:52 ` David Howells
2025-03-20 20:34 ` Viacheslav Dubeyko
2025-03-20 22:01 ` David Howells
2025-03-13 23:32 ` [RFC PATCH 05/35] libceph: Add functions to add ceph_databufs to requests David Howells
2025-03-13 23:32 ` [RFC PATCH 06/35] rbd: Use ceph_databuf for rbd_obj_read_sync() David Howells
2025-03-17 19:08 ` Viacheslav Dubeyko
2025-04-11 13:48 ` David Howells
2025-03-13 23:32 ` [RFC PATCH 07/35] libceph: Change ceph_osdc_call()'s reply to a ceph_databuf David Howells
2025-03-17 19:41 ` Viacheslav Dubeyko
2025-03-17 22:12 ` David Howells
2025-03-13 23:33 ` [RFC PATCH 08/35] libceph: Unexport osd_req_op_cls_request_data_pages() David Howells
2025-03-13 23:33 ` [RFC PATCH 09/35] libceph: Remove osd_req_op_cls_response_data_pages() David Howells
2025-03-13 23:33 ` [RFC PATCH 10/35] libceph: Convert notify_id_pages to a ceph_databuf David Howells
2025-03-13 23:33 ` [RFC PATCH 11/35] ceph: Use ceph_databuf in DIO David Howells
2025-03-17 20:03 ` Viacheslav Dubeyko
2025-03-17 22:26 ` David Howells
2025-03-13 23:33 ` [RFC PATCH 12/35] libceph: Bypass the messenger-v1 Tx loop for databuf/iter data blobs David Howells
2025-03-13 23:33 ` [RFC PATCH 13/35] rbd: Switch from using bvec_iter to iov_iter David Howells
2025-03-18 19:38 ` Viacheslav Dubeyko
2025-03-18 22:13 ` David Howells
2025-03-13 23:33 ` [RFC PATCH 14/35] libceph: Remove bvec and bio data container types David Howells
2025-03-13 23:33 ` [RFC PATCH 15/35] libceph: Make osd_req_op_cls_init() use a ceph_databuf and map it David Howells
2025-03-13 23:33 ` [RFC PATCH 16/35] libceph: Convert req_page of ceph_osdc_call() to ceph_databuf David Howells
2025-03-13 23:33 ` [RFC PATCH 17/35] libceph, rbd: Use ceph_databuf encoding start/stop David Howells
2025-03-18 19:59 ` Viacheslav Dubeyko
2025-03-18 22:19 ` David Howells
2025-03-20 21:45 ` Viacheslav Dubeyko
2025-03-13 23:33 ` [RFC PATCH 18/35] libceph, rbd: Convert some page arrays to ceph_databuf David Howells
2025-03-18 20:02 ` Viacheslav Dubeyko
2025-03-18 22:25 ` David Howells
2025-03-13 23:33 ` [RFC PATCH 19/35] libceph, ceph: Convert users of ceph_pagelist " David Howells
2025-03-18 20:09 ` Viacheslav Dubeyko
2025-03-18 22:27 ` David Howells
2025-03-13 23:33 ` [RFC PATCH 20/35] libceph: Remove ceph_pagelist David Howells
2025-03-13 23:33 ` [RFC PATCH 21/35] libceph: Make notify code use ceph_databuf_enc_start/stop David Howells
2025-03-18 20:12 ` Viacheslav Dubeyko
2025-03-18 22:36 ` David Howells
2025-03-13 23:33 ` [RFC PATCH 22/35] libceph, rbd: Convert ceph_osdc_notify() reply to ceph_databuf David Howells
2025-03-19 0:08 ` Viacheslav Dubeyko
2025-03-20 14:44 ` David Howells
2025-03-13 23:33 ` [RFC PATCH 23/35] rbd: Use ceph_databuf_enc_start/stop() David Howells
2025-03-19 0:32 ` Viacheslav Dubeyko
2025-03-20 14:59 ` Why use plain numbers and totals rather than predef'd constants for RPC sizes? David Howells
2025-03-20 21:48 ` Viacheslav Dubeyko
2025-03-13 23:33 ` [RFC PATCH 24/35] ceph: Make ceph_calc_file_object_mapping() return size as size_t David Howells
2025-03-13 23:33 ` [RFC PATCH 25/35] ceph: Wrap POSIX_FADV_WILLNEED to get caps David Howells
2025-03-13 23:33 ` [RFC PATCH 26/35] ceph: Kill ceph_rw_context David Howells
2025-03-13 23:33 ` [RFC PATCH 27/35] netfs: Pass extra write context to write functions David Howells
2025-03-13 23:33 ` [RFC PATCH 28/35] netfs: Adjust group handling David Howells
2025-03-19 18:57 ` Viacheslav Dubeyko
2025-03-20 15:22 ` David Howells
2025-03-13 23:33 ` [RFC PATCH 29/35] netfs: Allow fs-private data to be handed through to request alloc David Howells
2025-03-13 23:33 ` [RFC PATCH 30/35] netfs: Make netfs_page_mkwrite() use folio_mkwrite_check_truncate() David Howells
2025-03-13 23:33 ` [RFC PATCH 31/35] netfs: Fix netfs_unbuffered_read() to return ssize_t rather than int David Howells
2025-03-13 23:33 ` [RFC PATCH 32/35] netfs: Add some more RMW support for ceph David Howells
2025-03-19 19:14 ` Viacheslav Dubeyko
2025-03-20 15:25 ` David Howells
2025-03-13 23:33 ` [RFC PATCH 33/35] ceph: Use netfslib [INCOMPLETE] David Howells
2025-03-19 19:54 ` Viacheslav Dubeyko
2025-03-20 15:38 ` David Howells
2025-03-13 23:33 ` [RFC PATCH 34/35] ceph: Enable multipage folios for ceph files David Howells
2025-03-13 23:33 ` [RFC PATCH 35/35] ceph: Remove old I/O API bits David Howells
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2160750.1742210857@warthog.procyon.org.uk \
--to=dhowells@redhat.com \
--cc=Slava.Dubeyko@ibm.com \
--cc=amarkuze@redhat.com \
--cc=ceph-devel@vger.kernel.org \
--cc=dongsheng.yang@easystack.cn \
--cc=idryomov@gmail.com \
--cc=jlayton@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=slava@dubeyko.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox