public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
* Ceph and Netfslib
@ 2024-12-18 18:33 David Howells
  2024-12-18 18:47 ` Patrick Donnelly
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: David Howells @ 2024-12-18 18:33 UTC (permalink / raw)
  To: Alex Markuze, Viacheslav Dubeyko
  Cc: dhowells, Ilya Dryomov, Xiubo Li, Jeff Layton, ceph-devel, netfs,
	linux-fsdevel

Hi Alex, Slava,

I don't know whether you know, but I'm working on netfslib-ising ceph with an
eye to moving all the VFS/VM normal I/O interfaces to netfslib (->read_iter,
->write_iter, ->readahead, ->read_folio, ->page_mkwrite, ->writepages), though
with wrapping/methods by which each network filesystem can add its own
distinctive flavour.

Also, that would include doing things like content encryption, since that is
generally useful in filesystems and I have plans to support it in both AFS and
CIFS as well.

This means that fs/ceph/ will have practically nothing to do with page structs
or folio structs.  All that will be offloaded to netfslib and netfslib will
just hand iov_iters to the client filesystems, including ceph.

This will also allow me to massively simplify the networking code in
net/ceph/.  My aim is to replace all the page array, page lists, bio,
etc. data types in libceph with a single type that just conveys an iov_iter
and I have a ceph_databuf type that holds a list of pages in the form of a
bio_vec[] and I can extract an iov_iter from that to pass to the networking.

Then, for the transmission side, the iov_iter will be passed to the TCP socket
with MSG_SPLICE_PAGES rather than iterating over the data type and passing a
page fragment at a time.  We fixed this up for nfsd and Chuck Lever reported a
improvement in throughput (15% if I remember correctly).

The patches I have so far can be found here:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=ceph-iter

Note that I have rbd working with the changes I've made to that point.


Anyway, ... I need to pick someone's brain about whether the way per-page
tracking of snapshots within fs/ceph/ can be simplified.

Firstly, note that there may be a bug in ceph writeback cleanup as it stands.
It calls folio_detach_private() without holding the folio lock (it holds the
writeback lock, but that's not sufficient by MM rules).  This means you have a
race between { setting ->private, setting PG_private and inc refcount } on one
hand and { clearing ->private, clearing PG_private and dec refcount } on the
other.

Unfortunately, you cannot just take the page lock from writeback cleanup
without running the risk of deadlocking against ->writepages() wanting to take
PG_lock and then PG_writeback.  And you cannot drop PG_writeback first as the
moment you do that, the page can be deallocated.


Secondly, there's a counter, ci->i_wrbuffer_ref, that might actually be
redundant if we do it right as I_PINNING_NETFS_WB offers an alternative way we
might do things.  If we set this bit, ->write_inode() will be called with
wbc->unpinned_netfs_wb set when all currently dirty pages have been cleaned up
(see netfs_unpin_writeback()).  netfslib currently uses this to pin the
fscache objects but it could perhaps also be used to pin the writeback cap for
ceph.


Thirdly, I was under the impression that, for any given page/folio, only the
head snapshot could be altered - and that any older snapshot must be flushed
before we could allow that.


Fourthly, the ceph_snap_context struct holds a list of snaps.  Does it really
need to, or is just the most recent snap for which the folio holds changes
sufficient?

Thanks,
David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ceph and Netfslib
  2024-12-18 18:33 Ceph and Netfslib David Howells
@ 2024-12-18 18:47 ` Patrick Donnelly
  2024-12-18 19:36   ` David Howells
  2024-12-18 19:06 ` Viacheslav Dubeyko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Patrick Donnelly @ 2024-12-18 18:47 UTC (permalink / raw)
  To: David Howells
  Cc: Alex Markuze, Viacheslav Dubeyko, Ilya Dryomov, Xiubo Li,
	Jeff Layton, ceph-devel, netfs, linux-fsdevel

Hello David,

On Wed, Dec 18, 2024 at 1:33 PM David Howells <dhowells@redhat.com> wrote:
> Also, that would include doing things like content encryption, since that is
> generally useful in filesystems and I have plans to support it in both AFS and
> CIFS as well.

Would this be done with fscrypt? Can you expand on this part?

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re:  Ceph and Netfslib
  2024-12-18 18:33 Ceph and Netfslib David Howells
  2024-12-18 18:47 ` Patrick Donnelly
@ 2024-12-18 19:06 ` Viacheslav Dubeyko
  2024-12-18 19:48   ` David Howells
  2024-12-18 19:43 ` David Howells
  2025-03-05 16:34 ` Is EOLDSNAPC actually generated? -- " David Howells
  3 siblings, 1 reply; 22+ messages in thread
From: Viacheslav Dubeyko @ 2024-12-18 19:06 UTC (permalink / raw)
  To: Alex Markuze, David Howells
  Cc: linux-fsdevel@vger.kernel.org, idryomov@gmail.com, Xiubo Li,
	jlayton@kernel.org, ceph-devel@vger.kernel.org,
	netfs@lists.linux.dev

Hi David,

On Wed, 2024-12-18 at 18:33 +0000, David Howells wrote:
> Hi Alex, Slava,
> 
> I don't know whether you know, but I'm working on netfslib-ising ceph
> with an
> eye to moving all the VFS/VM normal I/O interfaces to netfslib (-
> >read_iter,
> ->write_iter, ->readahead, ->read_folio, ->page_mkwrite, -
> >writepages), though
> with wrapping/methods by which each network filesystem can add its
> own
> distinctive flavour.
> 
> Also, that would include doing things like content encryption, since
> that is
> generally useful in filesystems and I have plans to support it in
> both AFS and
> CIFS as well.
> 
> This means that fs/ceph/ will have practically nothing to do with
> page structs
> or folio structs.  All that will be offloaded to netfslib and
> netfslib will
> just hand iov_iters to the client filesystems, including ceph.
> 
> This will also allow me to massively simplify the networking code in
> net/ceph/.  My aim is to replace all the page array, page lists, bio,
> etc. data types in libceph with a single type that just conveys an
> iov_iter
> and I have a ceph_databuf type that holds a list of pages in the form
> of a
> bio_vec[] and I can extract an iov_iter from that to pass to the
> networking.
> 

Sounds like a really good idea to me. Makes a lot of sense.

> Then, for the transmission side, the iov_iter will be passed to the
> TCP socket
> with MSG_SPLICE_PAGES rather than iterating over the data type and
> passing a
> page fragment at a time.  We fixed this up for nfsd and Chuck Lever
> reported a
> improvement in throughput (15% if I remember correctly).
> 
> The patches I have so far can be found here:
> 
> 	
> https://git.kernel.org/p 
> ub_scm_linux_kernel_git_dhowells_linux-2Dfs.git_log_-3Fh-3Dceph-
> 2Diter&d=DwIFAg&c=BSDicqBQBDjDI9RkVyTcHQ&r=q5bIm4AXMzc8NJu1_RGmnQ2fMW
> Kq4Y4RAkElvUgSs00&m=IdnhHmTDiQZNzP_zbJHD5PQFfO3U8UaEuGpDubyf8fFXBu4KQ
> 7NFE-0OklCCoqtp&s=v_tEim-OriGJ7-Mwdc9jHMW6Aj_7RKr5ZwGwjg5gfy8&e= 
> 
> Note that I have rbd working with the changes I've made to that
> point.
> 
> 
> Anyway, ... I need to pick someone's brain about whether the way per-
> page
> tracking of snapshots within fs/ceph/ can be simplified.
> 
> Firstly, note that there may be a bug in ceph writeback cleanup as it
> stands.
> It calls folio_detach_private() without holding the folio lock (it
> holds the
> writeback lock, but that's not sufficient by MM rules).  This means
> you have a
> race between { setting ->private, setting PG_private and inc refcount
> } on one
> hand and { clearing ->private, clearing PG_private and dec refcount }
> on the
> other.
> 

I assume you imply ceph_invalidate_folio() method. Am I correct here?

> Unfortunately, you cannot just take the page lock from writeback
> cleanup
> without running the risk of deadlocking against ->writepages()
> wanting to take
> PG_lock and then PG_writeback.  And you cannot drop PG_writeback
> first as the
> moment you do that, the page can be deallocated.
> 
> 
> Secondly, there's a counter, ci->i_wrbuffer_ref, that might actually
> be
> redundant if we do it right as I_PINNING_NETFS_WB offers an
> alternative way we
> might do things.  If we set this bit, ->write_inode() will be called
> with
> wbc->unpinned_netfs_wb set when all currently dirty pages have been
> cleaned up
> (see netfs_unpin_writeback()).  netfslib currently uses this to pin
> the
> fscache objects but it could perhaps also be used to pin the
> writeback cap for
> ceph.
> 

Yeah, ci->i_wrbuffer_ref looks like not very reliable programming
pattern and if we can do it in other way, then it could be more safe
solution. However, this counter is used in multiple places of ceph
code. It needs to find a solution to get rid of this counter in safe
and easy way.

> 
> Thirdly, I was under the impression that, for any given page/folio,
> only the
> head snapshot could be altered - and that any older snapshot must be
> flushed
> before we could allow that.
> 
> 
> Fourthly, the ceph_snap_context struct holds a list of snaps.  Does
> it really
> need to, or is just the most recent snap for which the folio holds
> changes
> sufficient?
> 

Let me dive into the implementation details. Maybe, Alex can share more
details here.

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ceph and Netfslib
  2024-12-18 18:47 ` Patrick Donnelly
@ 2024-12-18 19:36   ` David Howells
  0 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2024-12-18 19:36 UTC (permalink / raw)
  To: Patrick Donnelly
  Cc: dhowells, Alex Markuze, Viacheslav Dubeyko, Ilya Dryomov,
	Xiubo Li, Jeff Layton, ceph-devel, netfs, linux-fsdevel

Patrick Donnelly <pdonnell@redhat.com> wrote:

> On Wed, Dec 18, 2024 at 1:33 PM David Howells <dhowells@redhat.com> wrote:
> > Also, that would include doing things like content encryption, since that
> > is generally useful in filesystems and I have plans to support it in both
> > AFS and CIFS as well.
> 
> Would this be done with fscrypt? Can you expand on this part?

Since ceph already uses fscrypt, I would need to maintain that.

If you look here:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-crypt

I have patches to provide content crypto for netfslib (not upstream yet).  If
you look at the afs patch at the top, you can see my hacked up AFS testing
implementation - though it isn't a complete solution as I don't currently have
a way to store the actual EOF there[*].

I add two new methods, encrypt_block and decrypt_block, for netfslib to call
out to the filesystem to actually do the encryption.

On the write/encryption side, netfslib creates one or more subrequests to do
the encryption, allowing it to be done asynchronously (I have it offloading to
my Intel QAT card), collecting the completed encryption subreqs as they're
done and dispatching the write I/O subrequests to the server and/or the local
cache as the ciphertext becomes available.

[Note: one reason I'm doing this in netfslib is so that the same
content-encrypted data is written to both the server and the local cache]

Currently, I create one subreq per fscrypt block to be encrypted (e.g. 4KiB),
but that feels a bit on the heavy side and some throttling is probably
necessary.

Netfslib takes care of encrypting multipage folios that are spanned by
multiple crypto blocks, calling out to the filesystem for each.

How and where the filesystem does the crypto is up to the filesystem - it can
offload it to fscrypt.  It may be possible to have the filesystem load what it
needs into netfs_io_request for fscrypt to pick up and then make the method
pointers point directly to fscrypt.  I don't really want to make the netfslib
module directly dependent on fscrypt, though this could be done to improve
performance.


On the read/decryption side, netfslib currently expects the filesystem to
synchronously decrypt the data, though I should also look at making that
asynchronous now that I have patches to do move read collection for a single
request to working in a single work item instead of a bunch of work items all
competing with each other.  One thing I want to avoid is scheduling async
decryption out of order just because the RPC ops finish out of order.

David

[*] Actually, I may have thought of a way to store the EOF in AFS today.  The
problem is that I have to round up the EOF to a full crypto block (e.g. 16 or
32 bytes for AES) but don't have anywhere to store the real EOF (no xattrs for
example).

So what I'm thinking is that I can just store a trailer of the crypto block
size of all zeros and then encrypt whatever of it that I need.  The trailer
will always be there (unless a zero-length file), always be right after the
EOF and always be the same length, so the AFS fs can trivially calculate the
real EOF size if it knows the algorithm used.

Further, AFS has a StoreData op that "atomically" does a truncate and write,
so maintaining a trailer is pretty straightforward.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ceph and Netfslib
  2024-12-18 18:33 Ceph and Netfslib David Howells
  2024-12-18 18:47 ` Patrick Donnelly
  2024-12-18 19:06 ` Viacheslav Dubeyko
@ 2024-12-18 19:43 ` David Howells
  2025-03-05 16:34 ` Is EOLDSNAPC actually generated? -- " David Howells
  3 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2024-12-18 19:43 UTC (permalink / raw)
  Cc: dhowells, Alex Markuze, Viacheslav Dubeyko, Ilya Dryomov,
	Xiubo Li, Jeff Layton, ceph-devel, netfs, linux-fsdevel

David Howells <dhowells@redhat.com> wrote:

> 
> I don't know whether you know, but I'm working on netfslib-ising ceph with an
> eye to moving all the VFS/VM normal I/O interfaces to netfslib (->read_iter,
> ->write_iter, ->readahead, ->read_folio, ->page_mkwrite, ->writepages), though
> with wrapping/methods by which each network filesystem can add its own
> distinctive flavour.
> 
> Also, that would include doing things like content encryption, since that is
> generally useful in filesystems and I have plans to support it in both AFS and
> CIFS as well.

I should also mention that netfslib brings multipage folio support with it
too.  All the filesystem has to do is to flip the bit:

	mapping_set_large_folios(inode->i_mapping);

On reading, netfs will tile subrequests drawn from both the server and the
local cache across a sequence of multipage folios.  On writing, it will write
the data in parallel to both the server and the local cache, tiling
subrequests for both separately across the folios to be written.

Subrequests don't have to match folios in size or alignment and nor, when
writing, do subrequests going to the server have to match subrequests going to
the local cache.

Whilst on writing it only permits two parallel streams - one to the server and
one to the cache - this is just an array with a hard-coded size at the moment
and could made expandable later to allow simultaneous writing to multiple
servers with different I/O characteristics.

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ceph and Netfslib
  2024-12-18 19:06 ` Viacheslav Dubeyko
@ 2024-12-18 19:48   ` David Howells
  2024-12-23 23:13     ` Viacheslav Dubeyko
  0 siblings, 1 reply; 22+ messages in thread
From: David Howells @ 2024-12-18 19:48 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, Alex Markuze, linux-fsdevel@vger.kernel.org,
	idryomov@gmail.com, Xiubo Li, jlayton@kernel.org,
	ceph-devel@vger.kernel.org, netfs@lists.linux.dev

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > Firstly, note that there may be a bug in ceph writeback cleanup as it
> > stands.
> > It calls folio_detach_private() without holding the folio lock (it
> > holds the
> > writeback lock, but that's not sufficient by MM rules).  This means
> > you have a
> > race between { setting ->private, setting PG_private and inc refcount
> > } on one
> > hand and { clearing ->private, clearing PG_private and dec refcount }
> > on the
> > other.
> > 
> 
> I assume you imply ceph_invalidate_folio() method. Am I correct here?

Actually, no, writepages_finish() is the culprit.

ceph_invalidate_folio() is called with the folio locked and can freely wrangle
folio->private.

> > Secondly, there's a counter, ci->i_wrbuffer_ref, that might actually be
> > redundant if we do it right as I_PINNING_NETFS_WB offers an alternative
> > way we might do things.  If we set this bit, ->write_inode() will be
> > called with wbc->unpinned_netfs_wb set when all currently dirty pages have
> > been cleaned up (see netfs_unpin_writeback()).  netfslib currently uses
> > this to pin the fscache objects but it could perhaps also be used to pin
> > the writeback cap for ceph.
> 
> Yeah, ci->i_wrbuffer_ref looks like not very reliable programming
> pattern and if we can do it in other way, then it could be more safe
> solution. However, this counter is used in multiple places of ceph
> code. It needs to find a solution to get rid of this counter in safe
> and easy way.
> 
> > 
> > Thirdly, I was under the impression that, for any given page/folio,
> > only the
> > head snapshot could be altered - and that any older snapshot must be
> > flushed
> > before we could allow that.
> > 
> > 
> > Fourthly, the ceph_snap_context struct holds a list of snaps.  Does
> > it really
> > need to, or is just the most recent snap for which the folio holds
> > changes
> > sufficient?
> > 
> 
> Let me dive into the implementation details. Maybe, Alex can share more
> details here.

Thanks.

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: Ceph and Netfslib
  2024-12-18 19:48   ` David Howells
@ 2024-12-23 23:13     ` Viacheslav Dubeyko
  2024-12-24 12:56       ` Matthew Wilcox
  0 siblings, 1 reply; 22+ messages in thread
From: Viacheslav Dubeyko @ 2024-12-23 23:13 UTC (permalink / raw)
  To: David Howells
  Cc: Xiubo Li, ceph-devel@vger.kernel.org, Alex Markuze,
	linux-fsdevel@vger.kernel.org, jlayton@kernel.org,
	idryomov@gmail.com, netfs@lists.linux.dev

On Wed, 2024-12-18 at 19:48 +0000, David Howells wrote:
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> > > 
<skipped>

> > > 
> > 
> > > 
> > > Thirdly, I was under the impression that, for any given
> > > page/folio,
> > > only the
> > > head snapshot could be altered - and that any older snapshot must
> > > be
> > > flushed
> > > before we could allow that.
> > > 
> > > 

As far as I can see, ceph_dirty_folio() attaches [1] to folio->private
a pointer on struct ceph_snap_context. So, it sounds that folio could
not have any associated snapshot context until it will be marked as
dirty.

Oppositely, ceph_invalidate_folio() detaches [2] the ceph_snap_context
from folio->private, or writepage_nounlock() detaches [3] the
ceph_snap_context from page, or writepages_finish() detaches [4] the
ceph_snap_context from page. So, technically speaking, folio/page
should have the associated snapshot context only in dirty state.

The struct ceph_snap_context represents a set of existing snapshots:

struct ceph_snap_context {
	refcount_t nref;
	u64 seq;
	u32 num_snaps;
	u64 snaps[];
};

The snapshot context is prepared by build_snap_context() and the set of
existing snapshots include: (1) parent inode's snapshots [5], (2)
inode's snapshots [6], (3) prior parent snapshots [7].


 * When a snapshot is taken (that is, when the client receives
 * notification that a snapshot was taken), each inode with caps and
 * with dirty pages (dirty pages implies there is a cap) gets a new
 * ceph_cap_snap in the i_cap_snaps list (which is sorted in ascending
 * order, new snaps go to the tail).

So, ceph_dirty_folio() takes the latest ceph_cap_snap:

	if (__ceph_have_pending_cap_snap(ci)) {
		struct ceph_cap_snap *capsnap =
				list_last_entry(&ci->i_cap_snaps,
						struct ceph_cap_snap,
						ci_item);
		snapc = ceph_get_snap_context(capsnap->context);
		capsnap->dirty_pages++;
	} else {
		BUG_ON(!ci->i_head_snapc);
		snapc = ceph_get_snap_context(ci->i_head_snapc);
		++ci->i_wrbuffer_ref_head;
	}


 * On writeback, we must submit writes to the osd IN SNAP ORDER.  So,
 * we look for the first capsnap in i_cap_snaps and write out pages in
 * that snap context _only_.  Then we move on to the next capsnap,
 * eventually reaching the "live" or "head" context (i.e., pages that
 * are not yet snapped) and are writing the most recently dirtied
 * pages

For example, writepage_nounlock() executes such logic [8]:

	oldest = get_oldest_context(inode, &ceph_wbc, snapc);
	if (snapc->seq > oldest->seq) {
		doutc(cl, "%llx.%llx page %p snapc %p not writeable -
noop\n",
		      ceph_vinop(inode), page, snapc);
		/* we should only noop if called by kswapd */
		WARN_ON(!(current->flags & PF_MEMALLOC));
		ceph_put_snap_context(oldest);
		redirty_page_for_writepage(wbc, page);
		return 0;
	}
	ceph_put_snap_context(oldest);

So, we should flush all dirty pages/folios in the snapshots order. But
I am not sure that we modify a snapshot by making pages/folios dirty. I
think we simply adding capsnap in the list and making a new snapshot
context in the case of new snapshot creation.


> > > Fourthly, the ceph_snap_context struct holds a list of snaps. 
> > > Does
> > > it really
> > > need to, or is just the most recent snap for which the folio
> > > holds
> > > changes
> > > sufficient?
> > > 
> > 
> > 

As far as I can see, the main goal of ceph_snap_context is the
accounting of all snapshots that has particular inode and all its
parents. And all these guys could have dirty pages. So, the
responsibility of of ceph_snap_context is to flush dirty folios/pages
with the goal to flush it in snapshots order for all inodes in the
hierarchy.


I could miss some details. :) But I hope the answer could help.

Thanks,
Slava.

[1]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L127
[2]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L157
[3]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L800
[4]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L911
[5]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/snap.c#L391
[6]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/snap.c#L399
[7]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/snap.c#L402
[8]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L695



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ceph and Netfslib
  2024-12-23 23:13     ` Viacheslav Dubeyko
@ 2024-12-24 12:56       ` Matthew Wilcox
  2024-12-24 21:52         ` Viacheslav Dubeyko
  2025-01-09  0:53         ` Viacheslav Dubeyko
  0 siblings, 2 replies; 22+ messages in thread
From: Matthew Wilcox @ 2024-12-24 12:56 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: David Howells, Xiubo Li, ceph-devel@vger.kernel.org, Alex Markuze,
	linux-fsdevel@vger.kernel.org, jlayton@kernel.org,
	idryomov@gmail.com, netfs@lists.linux.dev

On Mon, Dec 23, 2024 at 11:13:47PM +0000, Viacheslav Dubeyko wrote:
>  * On writeback, we must submit writes to the osd IN SNAP ORDER.  So,
>  * we look for the first capsnap in i_cap_snaps and write out pages in
>  * that snap context _only_.  Then we move on to the next capsnap,
>  * eventually reaching the "live" or "head" context (i.e., pages that
>  * are not yet snapped) and are writing the most recently dirtied
>  * pages

Speaking of writeback, ceph doesn't need a writepage operation.  We're
removing ->writepage from filesystems in favour of using ->migrate_folio
for migration and ->writepages for writeback.  As far as I can tell,
filemap_migrate_folio() will be perfect for ceph (as the ceph_snap_context
contains no references to the address of the memory).  And ceph already
has a ->writepages.  So I think this patch should work.  Can you give it
a try?

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 85936f6d2bf7..5a5a870b6aee 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -810,32 +810,6 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 	return err;
 }
 
-static int ceph_writepage(struct page *page, struct writeback_control *wbc)
-{
-	int err;
-	struct inode *inode = page->mapping->host;
-	BUG_ON(!inode);
-	ihold(inode);
-
-	if (wbc->sync_mode == WB_SYNC_NONE &&
-	    ceph_inode_to_fs_client(inode)->write_congested) {
-		redirty_page_for_writepage(wbc, page);
-		return AOP_WRITEPAGE_ACTIVATE;
-	}
-
-	folio_wait_private_2(page_folio(page)); /* [DEPRECATED] */
-
-	err = writepage_nounlock(page, wbc);
-	if (err == -ERESTARTSYS) {
-		/* direct memory reclaimer was killed by SIGKILL. return 0
-		 * to prevent caller from setting mapping/page error */
-		err = 0;
-	}
-	unlock_page(page);
-	iput(inode);
-	return err;
-}
-
 /*
  * async writeback completion handler.
  *
@@ -1584,7 +1558,6 @@ static int ceph_write_end(struct file *file, struct address_space *mapping,
 const struct address_space_operations ceph_aops = {
 	.read_folio = netfs_read_folio,
 	.readahead = netfs_readahead,
-	.writepage = ceph_writepage,
 	.writepages = ceph_writepages_start,
 	.write_begin = ceph_write_begin,
 	.write_end = ceph_write_end,
@@ -1592,6 +1565,7 @@ const struct address_space_operations ceph_aops = {
 	.invalidate_folio = ceph_invalidate_folio,
 	.release_folio = netfs_release_folio,
 	.direct_IO = noop_direct_IO,
+	.migrate_folio = filemap_migrate_folio,
 };
 
 static void ceph_block_sigs(sigset_t *oldset)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* RE: Ceph and Netfslib
  2024-12-24 12:56       ` Matthew Wilcox
@ 2024-12-24 21:52         ` Viacheslav Dubeyko
  2025-01-09  0:53         ` Viacheslav Dubeyko
  1 sibling, 0 replies; 22+ messages in thread
From: Viacheslav Dubeyko @ 2024-12-24 21:52 UTC (permalink / raw)
  To: willy@infradead.org
  Cc: Xiubo Li, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, David Howells, netfs@lists.linux.dev,
	Alex Markuze, jlayton@kernel.org, idryomov@gmail.com

On Tue, 2024-12-24 at 12:56 +0000, Matthew Wilcox wrote:
> On Mon, Dec 23, 2024 at 11:13:47PM +0000, Viacheslav Dubeyko wrote:
> >  * On writeback, we must submit writes to the osd IN SNAP ORDER. 
> > So,
> >  * we look for the first capsnap in i_cap_snaps and write out pages
> > in
> >  * that snap context _only_.  Then we move on to the next capsnap,
> >  * eventually reaching the "live" or "head" context (i.e., pages
> > that
> >  * are not yet snapped) and are writing the most recently dirtied
> >  * pages
> 
> Speaking of writeback, ceph doesn't need a writepage operation. 
> We're
> removing ->writepage from filesystems in favour of using -
> >migrate_folio
> for migration and ->writepages for writeback.  As far as I can tell,
> filemap_migrate_folio() will be perfect for ceph (as the
> ceph_snap_context
> contains no references to the address of the memory).  And ceph
> already
> has a ->writepages.  So I think this patch should work.  Can you give
> it
> a try?
> 

Sure. Let me spend some time for testing it.

Thanks,
Slava.

> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 85936f6d2bf7..5a5a870b6aee 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -810,32 +810,6 @@ static int writepage_nounlock(struct page *page,
> struct writeback_control *wbc)
>  	return err;
>  }
>  
> -static int ceph_writepage(struct page *page, struct
> writeback_control *wbc)
> -{
> -	int err;
> -	struct inode *inode = page->mapping->host;
> -	BUG_ON(!inode);
> -	ihold(inode);
> -
> -	if (wbc->sync_mode == WB_SYNC_NONE &&
> -	    ceph_inode_to_fs_client(inode)->write_congested) {
> -		redirty_page_for_writepage(wbc, page);
> -		return AOP_WRITEPAGE_ACTIVATE;
> -	}
> -
> -	folio_wait_private_2(page_folio(page)); /* [DEPRECATED] */
> -
> -	err = writepage_nounlock(page, wbc);
> -	if (err == -ERESTARTSYS) {
> -		/* direct memory reclaimer was killed by SIGKILL.
> return 0
> -		 * to prevent caller from setting mapping/page error
> */
> -		err = 0;
> -	}
> -	unlock_page(page);
> -	iput(inode);
> -	return err;
> -}
> -
>  /*
>   * async writeback completion handler.
>   *
> @@ -1584,7 +1558,6 @@ static int ceph_write_end(struct file *file,
> struct address_space *mapping,
>  const struct address_space_operations ceph_aops = {
>  	.read_folio = netfs_read_folio,
>  	.readahead = netfs_readahead,
> -	.writepage = ceph_writepage,
>  	.writepages = ceph_writepages_start,
>  	.write_begin = ceph_write_begin,
>  	.write_end = ceph_write_end,
> @@ -1592,6 +1565,7 @@ const struct address_space_operations ceph_aops
> = {
>  	.invalidate_folio = ceph_invalidate_folio,
>  	.release_folio = netfs_release_folio,
>  	.direct_IO = noop_direct_IO,
> +	.migrate_folio = filemap_migrate_folio,
>  };
>  
>  static void ceph_block_sigs(sigset_t *oldset)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: Ceph and Netfslib
  2024-12-24 12:56       ` Matthew Wilcox
  2024-12-24 21:52         ` Viacheslav Dubeyko
@ 2025-01-09  0:53         ` Viacheslav Dubeyko
  1 sibling, 0 replies; 22+ messages in thread
From: Viacheslav Dubeyko @ 2025-01-09  0:53 UTC (permalink / raw)
  To: willy@infradead.org
  Cc: Xiubo Li, linux-fsdevel@vger.kernel.org,
	ceph-devel@vger.kernel.org, David Howells, netfs@lists.linux.dev,
	Alex Markuze, jlayton@kernel.org, idryomov@gmail.com

On Tue, 2024-12-24 at 12:56 +0000, Matthew Wilcox wrote:
> On Mon, Dec 23, 2024 at 11:13:47PM +0000, Viacheslav Dubeyko wrote:
> >  * On writeback, we must submit writes to the osd IN SNAP ORDER. 
> > So,
> >  * we look for the first capsnap in i_cap_snaps and write out pages
> > in
> >  * that snap context _only_.  Then we move on to the next capsnap,
> >  * eventually reaching the "live" or "head" context (i.e., pages
> > that
> >  * are not yet snapped) and are writing the most recently dirtied
> >  * pages
> 
> Speaking of writeback, ceph doesn't need a writepage operation. 
> We're
> removing ->writepage from filesystems in favour of using -
> >migrate_folio
> for migration and ->writepages for writeback.  As far as I can tell,
> filemap_migrate_folio() will be perfect for ceph (as the
> ceph_snap_context
> contains no references to the address of the memory).  And ceph
> already
> has a ->writepages.  So I think this patch should work.  Can you give
> it
> a try?
> 

Sorry for some delay.

I did the testing of this modification. As far as I can see, as ceph
related as generic xfstests are going into ceph_writepages_start
(writepages). Even if I am creating a small file (<= 4096), then it is
processed by ceph_writepages_start() again. So, as far as I can see,
this modification should be safe enough. Running xfstests didn't reveal
any critical issues related to writepage family in Ceph. If I am
missing something, then I am ready to execute additional testing.

Thanks,
Slava.

> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 85936f6d2bf7..5a5a870b6aee 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -810,32 +810,6 @@ static int writepage_nounlock(struct page *page,
> struct writeback_control *wbc)
>  	return err;
>  }
>  
> -static int ceph_writepage(struct page *page, struct
> writeback_control *wbc)
> -{
> -	int err;
> -	struct inode *inode = page->mapping->host;
> -	BUG_ON(!inode);
> -	ihold(inode);
> -
> -	if (wbc->sync_mode == WB_SYNC_NONE &&
> -	    ceph_inode_to_fs_client(inode)->write_congested) {
> -		redirty_page_for_writepage(wbc, page);
> -		return AOP_WRITEPAGE_ACTIVATE;
> -	}
> -
> -	folio_wait_private_2(page_folio(page)); /* [DEPRECATED] */
> -
> -	err = writepage_nounlock(page, wbc);
> -	if (err == -ERESTARTSYS) {
> -		/* direct memory reclaimer was killed by SIGKILL.
> return 0
> -		 * to prevent caller from setting mapping/page error
> */
> -		err = 0;
> -	}
> -	unlock_page(page);
> -	iput(inode);
> -	return err;
> -}
> -
>  /*
>   * async writeback completion handler.
>   *
> @@ -1584,7 +1558,6 @@ static int ceph_write_end(struct file *file,
> struct address_space *mapping,
>  const struct address_space_operations ceph_aops = {
>  	.read_folio = netfs_read_folio,
>  	.readahead = netfs_readahead,
> -	.writepage = ceph_writepage,
>  	.writepages = ceph_writepages_start,
>  	.write_begin = ceph_write_begin,
>  	.write_end = ceph_write_end,
> @@ -1592,6 +1565,7 @@ const struct address_space_operations ceph_aops
> = {
>  	.invalidate_folio = ceph_invalidate_folio,
>  	.release_folio = netfs_release_folio,
>  	.direct_IO = noop_direct_IO,
> +	.migrate_folio = filemap_migrate_folio,
>  };
>  
>  static void ceph_block_sigs(sigset_t *oldset)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Is EOLDSNAPC actually generated? -- Re: Ceph and Netfslib
  2024-12-18 18:33 Ceph and Netfslib David Howells
                   ` (2 preceding siblings ...)
  2024-12-18 19:43 ` David Howells
@ 2025-03-05 16:34 ` David Howells
  2025-03-05 19:23   ` Alex Markuze
  3 siblings, 1 reply; 22+ messages in thread
From: David Howells @ 2025-03-05 16:34 UTC (permalink / raw)
  To: Alex Markuze, Viacheslav Dubeyko, Ilya Dryomov
  Cc: dhowells, Xiubo Li, Jeff Layton, ceph-devel, netfs, linux-fsdevel

Hi Alex, Slava,

I'm looking at making a ceph_netfs_write_iter() to handle writing to ceph
files through netfs.  One thing I'm wondering about is the way
ceph_write_iter() handles EOLDSNAPC.  In this case, it goes back to
retry_snap and renegotiates the caps (amongst other things).  Firstly, does it
actually need to do this?  And, secondly, I can't seem to find anything that
actually generates EOLDSNAPC (or anything relevant that generates ERESTART).

Is it possible that we could get rid of the code that handles EOLDSNAPC?

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is EOLDSNAPC actually generated? -- Re: Ceph and Netfslib
  2025-03-05 16:34 ` Is EOLDSNAPC actually generated? -- " David Howells
@ 2025-03-05 19:23   ` Alex Markuze
  2025-03-05 20:22     ` David Howells
                       ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Alex Markuze @ 2025-03-05 19:23 UTC (permalink / raw)
  To: David Howells
  Cc: Viacheslav Dubeyko, Ilya Dryomov, Xiubo Li, Jeff Layton,
	ceph-devel, netfs, linux-fsdevel, Gregory Farnum, Venky Shankar,
	Patrick Donnelly

That's a good point, though there is no code on the client that can
generate this error, I'm not convinced that this error can't be
received from the OSD or the MDS. I would rather some MDS experts
chime in, before taking any drastic measures.

+ Greg, Venky, Patrik

On Wed, Mar 5, 2025 at 6:34 PM David Howells <dhowells@redhat.com> wrote:
>
> Hi Alex, Slava,
>
> I'm looking at making a ceph_netfs_write_iter() to handle writing to ceph
> files through netfs.  One thing I'm wondering about is the way
> ceph_write_iter() handles EOLDSNAPC.  In this case, it goes back to
> retry_snap and renegotiates the caps (amongst other things).  Firstly, does it
> actually need to do this?  And, secondly, I can't seem to find anything that
> actually generates EOLDSNAPC (or anything relevant that generates ERESTART).
>
> Is it possible that we could get rid of the code that handles EOLDSNAPC?
>
> David
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is EOLDSNAPC actually generated? -- Re: Ceph and Netfslib
  2025-03-05 19:23   ` Alex Markuze
@ 2025-03-05 20:22     ` David Howells
  2025-03-06 13:19       ` Alex Markuze
  2025-03-06 13:58     ` Venky Shankar
  2025-03-06 15:55     ` David Howells
  2 siblings, 1 reply; 22+ messages in thread
From: David Howells @ 2025-03-05 20:22 UTC (permalink / raw)
  To: Alex Markuze
  Cc: dhowells, Viacheslav Dubeyko, Ilya Dryomov, Xiubo Li, Jeff Layton,
	ceph-devel, netfs, linux-fsdevel, Gregory Farnum, Venky Shankar,
	Patrick Donnelly

Alex Markuze <amarkuze@redhat.com> wrote:

> That's a good point, though there is no code on the client that can
> generate this error, I'm not convinced that this error can't be
> received from the OSD or the MDS. I would rather some MDS experts
> chime in, before taking any drastic measures.
> 
> + Greg, Venky, Patrik

Note that the value of EOLDSNAPC is different on different arches, so it
probably can't be simply cast from a network integer.

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is EOLDSNAPC actually generated? -- Re: Ceph and Netfslib
  2025-03-05 20:22     ` David Howells
@ 2025-03-06 13:19       ` Alex Markuze
  2025-03-06 13:48         ` David Howells
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Markuze @ 2025-03-06 13:19 UTC (permalink / raw)
  To: David Howells
  Cc: Viacheslav Dubeyko, Ilya Dryomov, Xiubo Li, Jeff Layton,
	ceph-devel, netfs, linux-fsdevel, Gregory Farnum, Venky Shankar,
	Patrick Donnelly

Yes, that won't work on sparc/parsic/alpha and mips.
Both the Block device server and the meta data server may return a
code 85 to a client's request.
Notice in this example that the rc value is taken from the request
struct which in turn was serialised from the network.

static void ceph_aio_complete_req(struct ceph_osd_request *req)
{
int rc = req->r_result;

On Wed, Mar 5, 2025 at 10:22 PM David Howells <dhowells@redhat.com> wrote:
>
> Alex Markuze <amarkuze@redhat.com> wrote:
>
> > That's a good point, though there is no code on the client that can
> > generate this error, I'm not convinced that this error can't be
> > received from the OSD or the MDS. I would rather some MDS experts
> > chime in, before taking any drastic measures.
> >
> > + Greg, Venky, Patrik
>
> Note that the value of EOLDSNAPC is different on different arches, so it
> probably can't be simply cast from a network integer.
>
> David
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is EOLDSNAPC actually generated? -- Re: Ceph and Netfslib
  2025-03-06 13:19       ` Alex Markuze
@ 2025-03-06 13:48         ` David Howells
  2025-03-06 13:55           ` Alex Markuze
  0 siblings, 1 reply; 22+ messages in thread
From: David Howells @ 2025-03-06 13:48 UTC (permalink / raw)
  To: Alex Markuze
  Cc: dhowells, Viacheslav Dubeyko, Ilya Dryomov, Xiubo Li, Jeff Layton,
	ceph-devel, netfs, linux-fsdevel, Gregory Farnum, Venky Shankar,
	Patrick Donnelly

Alex Markuze <amarkuze@redhat.com> wrote:

> Yes, that won't work on sparc/parsic/alpha and mips.
> Both the Block device server and the meta data server may return a
> code 85 to a client's request.
> Notice in this example that the rc value is taken from the request
> struct which in turn was serialised from the network.
> 
> static void ceph_aio_complete_req(struct ceph_osd_request *req)
> {
> int rc = req->r_result;

The term "ewww" springs to mind :-)

Does that mean that EOLDSNAPC can arrive by this function?

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is EOLDSNAPC actually generated? -- Re: Ceph and Netfslib
  2025-03-06 13:48         ` David Howells
@ 2025-03-06 13:55           ` Alex Markuze
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Markuze @ 2025-03-06 13:55 UTC (permalink / raw)
  To: David Howells
  Cc: Viacheslav Dubeyko, Ilya Dryomov, Xiubo Li, Jeff Layton,
	ceph-devel, netfs, linux-fsdevel, Gregory Farnum, Venky Shankar,
	Patrick Donnelly

That's just one example, presumably any request that is serailsed from
the network may contain this error code.
Specifically the ser/des of the messages happens in net/ceph
(libcephfs) code and fs/ceph handles the fs logic.

On Thu, Mar 6, 2025 at 3:48 PM David Howells <dhowells@redhat.com> wrote:
>
> Alex Markuze <amarkuze@redhat.com> wrote:
>
> > Yes, that won't work on sparc/parsic/alpha and mips.
> > Both the Block device server and the meta data server may return a
> > code 85 to a client's request.
> > Notice in this example that the rc value is taken from the request
> > struct which in turn was serialised from the network.
> >
> > static void ceph_aio_complete_req(struct ceph_osd_request *req)
> > {
> > int rc = req->r_result;
>
> The term "ewww" springs to mind :-)
>
> Does that mean that EOLDSNAPC can arrive by this function?
>
> David
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is EOLDSNAPC actually generated? -- Re: Ceph and Netfslib
  2025-03-05 19:23   ` Alex Markuze
  2025-03-05 20:22     ` David Howells
@ 2025-03-06 13:58     ` Venky Shankar
  2025-03-06 14:13       ` David Howells
  2025-03-06 15:55     ` David Howells
  2 siblings, 1 reply; 22+ messages in thread
From: Venky Shankar @ 2025-03-06 13:58 UTC (permalink / raw)
  To: Alex Markuze
  Cc: David Howells, Viacheslav Dubeyko, Ilya Dryomov, Xiubo Li,
	Jeff Layton, ceph-devel, netfs, linux-fsdevel, Gregory Farnum,
	Patrick Donnelly

On Thu, Mar 6, 2025 at 12:54 AM Alex Markuze <amarkuze@redhat.com> wrote:
>
> That's a good point, though there is no code on the client that can
> generate this error, I'm not convinced that this error can't be
> received from the OSD or the MDS. I would rather some MDS experts
> chime in, before taking any drastic measures.

The OSDs could possibly return this to the client, so I don't think it
can be done away with.

>
> + Greg, Venky, Patrik
>
> On Wed, Mar 5, 2025 at 6:34 PM David Howells <dhowells@redhat.com> wrote:
> >
> > Hi Alex, Slava,
> >
> > I'm looking at making a ceph_netfs_write_iter() to handle writing to ceph
> > files through netfs.  One thing I'm wondering about is the way
> > ceph_write_iter() handles EOLDSNAPC.  In this case, it goes back to
> > retry_snap and renegotiates the caps (amongst other things).  Firstly, does it
> > actually need to do this?  And, secondly, I can't seem to find anything that
> > actually generates EOLDSNAPC (or anything relevant that generates ERESTART).
> >
> > Is it possible that we could get rid of the code that handles EOLDSNAPC?
> >
> > David
> >
>


-- 
Cheers,
Venky


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is EOLDSNAPC actually generated? -- Re: Ceph and Netfslib
  2025-03-06 13:58     ` Venky Shankar
@ 2025-03-06 14:13       ` David Howells
  2025-03-06 14:23         ` Alex Markuze
  2025-03-06 16:21         ` Gregory Farnum
  0 siblings, 2 replies; 22+ messages in thread
From: David Howells @ 2025-03-06 14:13 UTC (permalink / raw)
  To: Venky Shankar
  Cc: dhowells, Alex Markuze, Viacheslav Dubeyko, Ilya Dryomov,
	Xiubo Li, Jeff Layton, ceph-devel, netfs, linux-fsdevel,
	Gregory Farnum, Patrick Donnelly

Venky Shankar <vshankar@redhat.com> wrote:

> > That's a good point, though there is no code on the client that can
> > generate this error, I'm not convinced that this error can't be
> > received from the OSD or the MDS. I would rather some MDS experts
> > chime in, before taking any drastic measures.
> 
> The OSDs could possibly return this to the client, so I don't think it
> can be done away with.

Okay... but then I think ceph has a bug in that you're assuming that the error
codes on the wire are consistent between arches as mentioned with Alex.  I
think you need to interject a mapping table.

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is EOLDSNAPC actually generated? -- Re: Ceph and Netfslib
  2025-03-06 14:13       ` David Howells
@ 2025-03-06 14:23         ` Alex Markuze
  2025-03-06 16:21         ` Gregory Farnum
  1 sibling, 0 replies; 22+ messages in thread
From: Alex Markuze @ 2025-03-06 14:23 UTC (permalink / raw)
  To: David Howells
  Cc: Venky Shankar, Viacheslav Dubeyko, Ilya Dryomov, Xiubo Li,
	Jeff Layton, ceph-devel, netfs, linux-fsdevel, Gregory Farnum,
	Patrick Donnelly

It's best to rely on protocol specific error codes rather than system
error codes in these cases for sure.

I'll make a refactor.

On Thu, Mar 6, 2025 at 4:13 PM David Howells <dhowells@redhat.com> wrote:
>
> Venky Shankar <vshankar@redhat.com> wrote:
>
> > > That's a good point, though there is no code on the client that can
> > > generate this error, I'm not convinced that this error can't be
> > > received from the OSD or the MDS. I would rather some MDS experts
> > > chime in, before taking any drastic measures.
> >
> > The OSDs could possibly return this to the client, so I don't think it
> > can be done away with.
>
> Okay... but then I think ceph has a bug in that you're assuming that the error
> codes on the wire are consistent between arches as mentioned with Alex.  I
> think you need to interject a mapping table.
>
> David
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ceph and Netfslib
  2025-03-05 19:23   ` Alex Markuze
  2025-03-05 20:22     ` David Howells
  2025-03-06 13:58     ` Venky Shankar
@ 2025-03-06 15:55     ` David Howells
  2 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2025-03-06 15:55 UTC (permalink / raw)
  To: Alex Markuze
  Cc: dhowells, Viacheslav Dubeyko, Ilya Dryomov, Xiubo Li, Jeff Layton,
	ceph-devel, netfs, linux-fsdevel, Gregory Farnum, Venky Shankar,
	Patrick Donnelly

Does ceph_write_iter() actually need to drop the inode I/O lock in order to
handle EOLDSNAPC?  I'm wondering if I can deal with it in the netfs request
retry code - but that means dealing with it whilst the I/O lock is held.

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is EOLDSNAPC actually generated? -- Re: Ceph and Netfslib
  2025-03-06 14:13       ` David Howells
  2025-03-06 14:23         ` Alex Markuze
@ 2025-03-06 16:21         ` Gregory Farnum
  2025-03-06 17:18           ` Alex Markuze
  1 sibling, 1 reply; 22+ messages in thread
From: Gregory Farnum @ 2025-03-06 16:21 UTC (permalink / raw)
  To: David Howells
  Cc: Venky Shankar, Alex Markuze, Viacheslav Dubeyko, Ilya Dryomov,
	Xiubo Li, Jeff Layton, ceph-devel, netfs, linux-fsdevel,
	Patrick Donnelly

On Thu, Mar 6, 2025 at 6:13 AM David Howells <dhowells@redhat.com> wrote:
>
> Venky Shankar <vshankar@redhat.com> wrote:
>
> > > That's a good point, though there is no code on the client that can
> > > generate this error, I'm not convinced that this error can't be
> > > received from the OSD or the MDS. I would rather some MDS experts
> > > chime in, before taking any drastic measures.
> >
> > The OSDs could possibly return this to the client, so I don't think it
> > can be done away with.
>
> Okay... but then I think ceph has a bug in that you're assuming that the error
> codes on the wire are consistent between arches as mentioned with Alex.  I
> think you need to interject a mapping table.

Without looking at the kernel code, Ceph in general wraps all error
codes to a defined arch-neutral endianness for the wire protocol and
unwraps them into the architecture-native format when decoding. Is
that not happening here? It should happen transparently as part of the
network decoding, so when I look in fs/ceph/file.c the usage seems
fine to me, and I see include/linux/ceph/decode.h is full of functions
that specify "le" and translating that to the cpu, so it seems fine.
And yes, the OSD can return EOLDSNAPC if the client is out of date
(and certain other conditions are true).
-Greg


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is EOLDSNAPC actually generated? -- Re: Ceph and Netfslib
  2025-03-06 16:21         ` Gregory Farnum
@ 2025-03-06 17:18           ` Alex Markuze
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Markuze @ 2025-03-06 17:18 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: David Howells, Venky Shankar, Viacheslav Dubeyko, Ilya Dryomov,
	Xiubo Li, Jeff Layton, ceph-devel, netfs, linux-fsdevel,
	Patrick Donnelly

It's not about endians. It's just about the fact that some linux
arches define the error code of EOLDSNAPC/ERETRY to a different
number.

On Thu, Mar 6, 2025 at 6:22 PM Gregory Farnum <gfarnum@redhat.com> wrote:
>
> On Thu, Mar 6, 2025 at 6:13 AM David Howells <dhowells@redhat.com> wrote:
> >
> > Venky Shankar <vshankar@redhat.com> wrote:
> >
> > > > That's a good point, though there is no code on the client that can
> > > > generate this error, I'm not convinced that this error can't be
> > > > received from the OSD or the MDS. I would rather some MDS experts
> > > > chime in, before taking any drastic measures.
> > >
> > > The OSDs could possibly return this to the client, so I don't think it
> > > can be done away with.
> >
> > Okay... but then I think ceph has a bug in that you're assuming that the error
> > codes on the wire are consistent between arches as mentioned with Alex.  I
> > think you need to interject a mapping table.
>
> Without looking at the kernel code, Ceph in general wraps all error
> codes to a defined arch-neutral endianness for the wire protocol and
> unwraps them into the architecture-native format when decoding. Is
> that not happening here? It should happen transparently as part of the
> network decoding, so when I look in fs/ceph/file.c the usage seems
> fine to me, and I see include/linux/ceph/decode.h is full of functions
> that specify "le" and translating that to the cpu, so it seems fine.
> And yes, the OSD can return EOLDSNAPC if the client is out of date
> (and certain other conditions are true).
> -Greg
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-03-06 17:19 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-18 18:33 Ceph and Netfslib David Howells
2024-12-18 18:47 ` Patrick Donnelly
2024-12-18 19:36   ` David Howells
2024-12-18 19:06 ` Viacheslav Dubeyko
2024-12-18 19:48   ` David Howells
2024-12-23 23:13     ` Viacheslav Dubeyko
2024-12-24 12:56       ` Matthew Wilcox
2024-12-24 21:52         ` Viacheslav Dubeyko
2025-01-09  0:53         ` Viacheslav Dubeyko
2024-12-18 19:43 ` David Howells
2025-03-05 16:34 ` Is EOLDSNAPC actually generated? -- " David Howells
2025-03-05 19:23   ` Alex Markuze
2025-03-05 20:22     ` David Howells
2025-03-06 13:19       ` Alex Markuze
2025-03-06 13:48         ` David Howells
2025-03-06 13:55           ` Alex Markuze
2025-03-06 13:58     ` Venky Shankar
2025-03-06 14:13       ` David Howells
2025-03-06 14:23         ` Alex Markuze
2025-03-06 16:21         ` Gregory Farnum
2025-03-06 17:18           ` Alex Markuze
2025-03-06 15:55     ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox