Re: ZONE_DEVICE refcounting

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dan Williams <dan.j.williams@intel.com>
To: Alistair Popple <apopple@nvidia.com>, Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>, <linux-mm@kvack.org>,
	<jhubbard@nvidia.com>, <rcampbell@nvidia.com>,
	<willy@infradead.org>, <jgg@nvidia.com>,
	<linux-fsdevel@vger.kernel.org>, <jack@suse.cz>,
	<djwong@kernel.org>, <hch@lst.de>, <david@redhat.com>,
	<ruansy.fnst@fujitsu.com>
Subject: Re: ZONE_DEVICE refcounting
Date: Thu, 21 Mar 2024 23:58:44 -0700	[thread overview]
Message-ID: <65fd2c2448dbf_2690d294fe@dwillia2-mobl3.amr.corp.intel.com.notmuch> (raw)
In-Reply-To: <87v85e6cj2.fsf@nvdebian.thelocal>

Alistair Popple wrote:
> 
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Fri, Mar 22, 2024 at 11:01:25AM +1100, Alistair Popple wrote:
> >> 
> >> Dan Williams <dan.j.williams@intel.com> writes:
> >> 
> >> > Alistair Popple wrote:
> >> >> 
> >> >> Alistair Popple <apopple@nvidia.com> writes:
> >> >> 
> >> >> > Dan Williams <dan.j.williams@intel.com> writes:
> >> >> >
> >> >> >> Alistair Popple wrote:
> >> >> >
> >> >> > I also noticed folio_anon() is not safe to call on a FS DAX page due to
> >> >> > sharing PAGE_MAPPING_DAX_SHARED.
> >> >> 
> >> >> Also it feels like I could be missing something here. AFAICT the
> >> >> page->mapping and page->index fields can't actually be used outside of
> >> >> fs/dax because they are overloaded for the shared case. Therefore
> >> >> setting/clearing them could be skipped and the only reason for doing so
> >> >> is so dax_associate_entry()/dax_disassociate_entry() can generate
> >> >> warnings which should never occur anyway. So all that code is
> >> >> functionally unnecessary.
> >> >
> >> > What do you mean outside of fs/dax, do you literally mean outside of
> >> > fs/dax.c, or the devdax case (i.e. dax without fs-entanglements)?
> >> 
> >> Only the cases fs dax pages might need it. ie. Not devdax which I
> >> haven't looked at closely yet.
> >> 
> >> > Memory
> >> > failure needs ->mapping and ->index to rmap dax pages. See
> >> > mm/memory-failure.c::__add_to_kill() and
> >> > mm/memory-failure.c::__add_to_kill_fsdax() where that latter one is for
> >> > cases where the fs needs has signed up to react to dax page failure.
> >> 
> >> How does that work for reflink/shared pages which overwrite
> >> page->mapping and page->index?
> >
> > Via reverse mapping in the *filesystem*, not the mm rmap stuff.
> >
> > pmem_pagemap_memory_failure()
> >   dax_holder_notify_failure()
> >     .notify_failure()
> >       xfs_dax_notify_failure()
> >         xfs_dax_notify_ddev_failure()
> > 	  xfs_rmap_query_range(xfs_dax_failure_fn)
> > 	     xfs_dax_failure_fn(rmap record)
> > 	       <grabs inode from cache>
> > 	       <converts range to file offset>
> > 	       mf_dax_kill_procs(inode->mapping, pgoff)
> > 	         collect_procs_fsdax(mapping, page)
> > 		   add_to_kill_fsdax(task)
> > 		     __add_to_kill(task)
> > 		 unmap_and_kill_tasks()
> >
> > Remember: in FSDAX, the pages are the storage media physically owned
> > by the filesystem, not the mm subsystem. Hence answering questions
> > like "who owns this page" can only be answered correctly by asking
> > the filesystem.
> 
> Thanks Dave for writing that up, it really helped solidify my
> understanding of how this is all supposed to work.

Yes, thanks Dave!

> > We shortcut that for pages that only have one owner - we just store
> > the owner information in the page as a {mapping, offset} tuple. But
> > when we have multiple owners, the only way to find all the {mapping,
> > offset} tuples is to ask the filesystem to find all the owners of
> > that page.
> >
> > Hence the special case values for page->mapping/page->index for
> > pages over shared filesystem extents. These shared extents are
> > communicated to the fsdax layer via the IOMAP_F_SHARED flag
> > in the iomaps returned by the filesystem. This flag is the trigger
> > for the special mapping share count behaviour to be used. e.g. see
> > dax_insert_entry(iomap_iter) -> dax_associate_entry(shared) ->
> > dax_page_share_get()....
> >
> >> Eg. in __add_to_kill() if *p is a shared fs
> >> dax page then p->mapping == PAGE_MAPPING_DAX_SHARED and
> >> page_address_in_vma(vma, p) will probably crash.
> >
> > As per above, we don't get the mapping from the page in those cases.
> 
> Yep, that all makes sense and I see where I was getting confsued. It was
> because in __add_to_kill() we do actually use page->mapping when
> page_address_in_vma(vma, p) is called. And because
> folio_test_anon(page_folio(p)) is also true for shared FS DAX pages
> (p->mapping == PAGE_MAPPING_DAX_SHARED/PAGE_MAPPING_DAX_SHARED) I
> thought things would go bad there.
> 
> However after re-reading the page_address_in_vma() implementation I
> noticed that isn't the case, because folio_anon_vma(page_folio(p)) will
> still return NULL which was the detail I had missed.
> 
> So to go back to my original point it seems page->mapping and
> page->index is largely unused on fs dax pages, even for memory
> failure. Because for memory failure the mapping and fsdax_pgoff comes
> from the filesystem not the page.

Only for XFS, all other filesystems depend on typical rmap, and I
believe XFS only uses that path if the fs is mounted with reflink
support.

You had asked about tests before and I do have a regression test for
memory-failure in the ndctl regression suite, that should be validating
this for xfs, and ext4:

https://github.com/pmem/ndctl/blob/main/test/dax-poison.c#L47

meson test -C build --suite ndctl:ndctl

There are setup instructions here:
https://github.com/pmem/ndctl/blob/main/README.md

I typically run it in its own VM. This run_qemu script can build and
boot an enironment to run it:
https://github.com/pmem/run_qemu

There is also kdevops that includes it, but I have not tried that:
https://github.com/linux-kdevops/kdevops

     prev parent reply	other threads:[~2024-03-22  6:58 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-08  4:24 ZONE_DEVICE refcounting Alistair Popple
2024-03-08 13:44 ` Jason Gunthorpe
2024-03-13  6:32 ` Dan Williams
2024-03-20  5:20   ` Alistair Popple
2024-03-21  5:26     ` Alistair Popple
2024-03-21  6:03       ` Dan Williams
2024-03-22  0:01         ` Alistair Popple
2024-03-22  3:18           ` Dave Chinner
2024-03-22  5:34             ` Alistair Popple
2024-03-22  6:58               ` Dan Williams [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=65fd2c2448dbf_2690d294fe@dwillia2-mobl3.amr.corp.intel.com.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=apopple@nvidia.com \
    --cc=david@fromorbit.com \
    --cc=david@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rcampbell@nvidia.com \
    --cc=ruansy.fnst@fujitsu.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).