public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
To: David Howells <dhowells@redhat.com>
Cc: Xiubo Li <xiubli@redhat.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>,
	Alex Markuze <amarkuze@redhat.com>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"jlayton@kernel.org" <jlayton@kernel.org>,
	"idryomov@gmail.com" <idryomov@gmail.com>,
	"netfs@lists.linux.dev" <netfs@lists.linux.dev>
Subject: RE: Ceph and Netfslib
Date: Mon, 23 Dec 2024 23:13:47 +0000	[thread overview]
Message-ID: <690826facef0310d7f44cf522deeed979b6ff287.camel@ibm.com> (raw)
In-Reply-To: <3992139.1734551286@warthog.procyon.org.uk>

On Wed, 2024-12-18 at 19:48 +0000, David Howells wrote:
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> > > 
<skipped>

> > > 
> > 
> > > 
> > > Thirdly, I was under the impression that, for any given
> > > page/folio,
> > > only the
> > > head snapshot could be altered - and that any older snapshot must
> > > be
> > > flushed
> > > before we could allow that.
> > > 
> > > 

As far as I can see, ceph_dirty_folio() attaches [1] to folio->private
a pointer on struct ceph_snap_context. So, it sounds that folio could
not have any associated snapshot context until it will be marked as
dirty.

Oppositely, ceph_invalidate_folio() detaches [2] the ceph_snap_context
from folio->private, or writepage_nounlock() detaches [3] the
ceph_snap_context from page, or writepages_finish() detaches [4] the
ceph_snap_context from page. So, technically speaking, folio/page
should have the associated snapshot context only in dirty state.

The struct ceph_snap_context represents a set of existing snapshots:

struct ceph_snap_context {
	refcount_t nref;
	u64 seq;
	u32 num_snaps;
	u64 snaps[];
};

The snapshot context is prepared by build_snap_context() and the set of
existing snapshots include: (1) parent inode's snapshots [5], (2)
inode's snapshots [6], (3) prior parent snapshots [7].


 * When a snapshot is taken (that is, when the client receives
 * notification that a snapshot was taken), each inode with caps and
 * with dirty pages (dirty pages implies there is a cap) gets a new
 * ceph_cap_snap in the i_cap_snaps list (which is sorted in ascending
 * order, new snaps go to the tail).

So, ceph_dirty_folio() takes the latest ceph_cap_snap:

	if (__ceph_have_pending_cap_snap(ci)) {
		struct ceph_cap_snap *capsnap =
				list_last_entry(&ci->i_cap_snaps,
						struct ceph_cap_snap,
						ci_item);
		snapc = ceph_get_snap_context(capsnap->context);
		capsnap->dirty_pages++;
	} else {
		BUG_ON(!ci->i_head_snapc);
		snapc = ceph_get_snap_context(ci->i_head_snapc);
		++ci->i_wrbuffer_ref_head;
	}


 * On writeback, we must submit writes to the osd IN SNAP ORDER.  So,
 * we look for the first capsnap in i_cap_snaps and write out pages in
 * that snap context _only_.  Then we move on to the next capsnap,
 * eventually reaching the "live" or "head" context (i.e., pages that
 * are not yet snapped) and are writing the most recently dirtied
 * pages

For example, writepage_nounlock() executes such logic [8]:

	oldest = get_oldest_context(inode, &ceph_wbc, snapc);
	if (snapc->seq > oldest->seq) {
		doutc(cl, "%llx.%llx page %p snapc %p not writeable -
noop\n",
		      ceph_vinop(inode), page, snapc);
		/* we should only noop if called by kswapd */
		WARN_ON(!(current->flags & PF_MEMALLOC));
		ceph_put_snap_context(oldest);
		redirty_page_for_writepage(wbc, page);
		return 0;
	}
	ceph_put_snap_context(oldest);

So, we should flush all dirty pages/folios in the snapshots order. But
I am not sure that we modify a snapshot by making pages/folios dirty. I
think we simply adding capsnap in the list and making a new snapshot
context in the case of new snapshot creation.


> > > Fourthly, the ceph_snap_context struct holds a list of snaps. 
> > > Does
> > > it really
> > > need to, or is just the most recent snap for which the folio
> > > holds
> > > changes
> > > sufficient?
> > > 
> > 
> > 

As far as I can see, the main goal of ceph_snap_context is the
accounting of all snapshots that has particular inode and all its
parents. And all these guys could have dirty pages. So, the
responsibility of of ceph_snap_context is to flush dirty folios/pages
with the goal to flush it in snapshots order for all inodes in the
hierarchy.


I could miss some details. :) But I hope the answer could help.

Thanks,
Slava.

[1]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L127
[2]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L157
[3]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L800
[4]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L911
[5]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/snap.c#L391
[6]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/snap.c#L399
[7]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/snap.c#L402
[8]
https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L695



  reply	other threads:[~2024-12-23 23:14 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-18 18:33 Ceph and Netfslib David Howells
2024-12-18 18:47 ` Patrick Donnelly
2024-12-18 19:36   ` David Howells
2024-12-18 19:06 ` Viacheslav Dubeyko
2024-12-18 19:48   ` David Howells
2024-12-23 23:13     ` Viacheslav Dubeyko [this message]
2024-12-24 12:56       ` Matthew Wilcox
2024-12-24 21:52         ` Viacheslav Dubeyko
2025-01-09  0:53         ` Viacheslav Dubeyko
2024-12-18 19:43 ` David Howells
2025-03-05 16:34 ` Is EOLDSNAPC actually generated? -- " David Howells
2025-03-05 19:23   ` Alex Markuze
2025-03-05 20:22     ` David Howells
2025-03-06 13:19       ` Alex Markuze
2025-03-06 13:48         ` David Howells
2025-03-06 13:55           ` Alex Markuze
2025-03-06 13:58     ` Venky Shankar
2025-03-06 14:13       ` David Howells
2025-03-06 14:23         ` Alex Markuze
2025-03-06 16:21         ` Gregory Farnum
2025-03-06 17:18           ` Alex Markuze
2025-03-06 15:55     ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=690826facef0310d7f44cf522deeed979b6ff287.camel@ibm.com \
    --to=slava.dubeyko@ibm.com \
    --cc=amarkuze@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=dhowells@redhat.com \
    --cc=idryomov@gmail.com \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=netfs@lists.linux.dev \
    --cc=xiubli@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox