linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Alistair Popple <apopple@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>
Cc: <linux-mm@kvack.org>, <vishal.l.verma@intel.com>,
	<dave.jiang@intel.com>, <logang@deltatee.com>,
	<bhelgaas@google.com>, <jack@suse.cz>, <jgg@ziepe.ca>,
	<catalin.marinas@arm.com>, <will@kernel.org>,
	<mpe@ellerman.id.au>, <npiggin@gmail.com>,
	<dave.hansen@linux.intel.com>, <ira.weiny@intel.com>,
	<willy@infradead.org>, <djwong@kernel.org>, <tytso@mit.edu>,
	<linmiaohe@huawei.com>, <david@redhat.com>, <peterx@redhat.com>,
	<linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<linuxppc-dev@lists.ozlabs.org>, <nvdimm@lists.linux.dev>,
	<linux-cxl@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>,
	<linux-ext4@vger.kernel.org>, <linux-xfs@vger.kernel.org>,
	<jhubbard@nvidia.com>, <hch@lst.de>, <david@fromorbit.com>
Subject: Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages
Date: Thu, 24 Oct 2024 21:35:02 -0700	[thread overview]
Message-ID: <671b1ff62c64d_10e5929465@dwillia2-xfh.jf.intel.com.notmuch> (raw)
In-Reply-To: <87seskvqt2.fsf@nvdebian.thelocal>

Alistair Popple wrote:
[..]
>> I'm not really following this scenario, or at least how it relates to
> >> the comment above. If the page is pinned for DMA it will have taken a
> >> refcount on it and so the page won't be considered free/idle per
> >> dax_wait_page_idle() or any of the other mm code.
> >
> > [ tl;dr: I think we're ok, analysis below, but I did talk myself into
> > the proposed dax_busy_page() changes indeed being broken and needing to
> > remain checking for refcount > 1, not > 0 ]
> >
> > It's not the mm code I am worried about. It's the filesystem block
> > allocator staying in-sync with the allocation state of the page.
> >
> > fs/dax.c is charged with converting idle storage blocks to pfns to
> > mapped folios. Once they are mapped, DMA can pin the folio, but nothing
> > in fs/dax.c pins the mapping. In the pagecache case the page reference
> > is sufficient to keep the DMA-busy page from being reused. In the dax
> > case something needs to arrange for DMA to be idle before
> > dax_delete_mapping_entry().
> 
> Ok. How does that work today? My current mental model is that something
> has to call dax_layout_busy_page() whilst holding the correct locks to
> prevent a new mapping being established prior to calling
> dax_delete_mapping_entry(). Is that correct?

Correct. dax_delete_mapping_entry() is invoked by the filesystem with
inode locks held. See xfs_file_fallocate() where it takes the lock,
calls xfs_break_layouts() and if that succeeds performs
xfs_file_free_space() with the lock held.

xfs_file_free_space() triggers dax_delete_mapping_entry() with knowledge
that the mapping cannot be re-established until the lock is dropped.

> > However, looking at XFS it indeed makes that guarantee. First it does
> > xfs_break_dax_layouts() then it does truncate_inode_pages() =>
> > dax_delete_mapping_entry().
> >
> > It follows that that the DMA-idle condition still needs to look for the
> > case where the refcount is > 1 rather than 0 since refcount == 1 is the
> > page-mapped-but-DMA-idle condition.
> 
> Sorry, but I'm still not following this line of reasoning. If the
> refcount == 1 the page is either mapped xor DMA-busy.

No, my expectation is the refcount is 1 while the page has a mapping
entry, analagous to an idle / allocated page cache page, and the
refcount is 2 or more for DMA, get_user_pages(), or any page walker that
takes a transient page pin.

> is enough to conclude that the page cannot be reused because it is
> either being accessed from userspace via a CPU mapping or from some
> device DMA or some other in kernel user.

Userspace access is not a problem, that access can always be safely
revoked by unmapping the page, and that's what dax_layout_busy_page()
does to force a fault and re-taking the inode + mmap locks so that the
truncate path knows it has temporary exclusive access to the page, pfn,
and storage-block association.

> The current proposal is that dax_busy_page() returns true if refcount >=
> 1, and dax_wait_page_idle() will wait until the refcount ==
> 0. dax_busy_page() will try and force the refcount == 0 by unmapping it,
> but obviously can't force other pinners to release their reference hence
> the need to wait. Callers should already be holding locks to ensure new
> mappings can't be established and hence can't become DMA-busy after the
> unmap.

Am I missing a page_ref_dec() somewhere? Are you saying that
dax_layout_busy_page() will find entries with ->mapping non-NULL and
refcount == 0?

[..]
> >> >> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> >> >>  	if (dax_fault_is_synchronous(iter, vmf->vma))
> >> >>  		return dax_fault_synchronous_pfnp(pfnp, pfn);
> >> >>  
> >> >> -	/* insert PMD pfn */
> >> >> +	page = pfn_t_to_page(pfn);
> >> >
> >> > I think this is clearer if dax_insert_entry() returns folios with an
> >> > elevated refrence count that is dropped when the folio is invalidated
> >> > out of the mapping.
> >> 
> >> I presume this comment is for the next line:
> >> 
> >> +	page_ref_inc(page);
> >>  
> >> I can move that into dax_insert_entry(), but we would still need to
> >> drop it after calling vmf_insert_*() to ensure we get the 1 -> 0
> >> transition when the page is unmapped and therefore
> >> freed. Alternatively we can make it so vmf_insert_*() don't take
> >> references on the page, and instead ownership of the reference is
> >> transfered to the mapping. Personally I prefered having those
> >> functions take their own reference but let me know what you think.
> >
> > Oh, the model I was thinking was that until vmf_insert_XXX() succeeds
> > then the page was never allocated because it was never mapped. What
> > happens with the code as proposed is that put_page() triggers page-free
> > semantics on vmf_insert_XXX() failures, right?
> 
> Right. And actually that means I can't move the page_ref_inc(page) into
> what will be called dax_create_folio(), because an entry may have been
> created previously that had a failed vmf_insert_XXX() which will
> therefore have a zero refcount folio associated with it.

I would expect a full cleanup on on vmf_insert_XXX() failure, not
leaving a zero-referenced entry.

> But I think that model is wrong. I think the model needs to be the page
> gets allocated when the entry is first created (ie. when
> dax_create_folio() is called). A subsequent free (ether due to
> vmf_insert_XXX() failing or the page being unmapped or becoming
> DMA-idle) should then delete the entry.
>
> I think that makes the semantics around dax_busy_page() nicer as well -
> no need for the truncate to have a special path to call
> dax_delete_mapping_entry().

I agree it would be lovely if the final put could clean up the mapping
entry and not depend on truncate_inode_pages_range() to do that.

...but I do not immediately see how to get there when block, pfn, and
page are so tightly coupled with dax. That's a whole new project to
introduce that paradigm, no? The page cache case gets away with
it by safely disconnecting the pfn+page from the block and then letting
DMA final put_page() take its time.

> > There is no need to invoke the page-free / final-put path on
> > vmf_insert_XXX() error because the storage-block / pfn never actually
> > transitioned into a page / folio.
> 
> It's not mapping a page/folio that transitions a pfn into a page/folio
> it is the allocation of the folio that happens in dax_create_folio()
> (aka. dax_associate_new_entry()). So we need to delete the entry (as
> noted above I don't do that currently) if the insertion fails.

Yeah, deletion on insert failure makes sense.

[..]
> >> >> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page)
> >> >>  	lock_page(page);
> >> >>  }
> >> >>  EXPORT_SYMBOL_GPL(zone_device_page_init);
> >> >> -
> >> >> -#ifdef CONFIG_FS_DAX
> >> >> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
> >> >> -{
> >> >> -	if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
> >> >> -		return false;
> >> >> -
> >> >> -	/*
> >> >> -	 * fsdax page refcounts are 1-based, rather than 0-based: if
> >> >> -	 * refcount is 1, then the page is free and the refcount is
> >> >> -	 * stable because nobody holds a reference on the page.
> >> >> -	 */
> >> >> -	if (folio_ref_sub_return(folio, refs) == 1)
> >> >> -		wake_up_var(&folio->_refcount);
> >> >> -	return true;
> >> >
> >> > It follow from the refcount disvussion above that I think there is an
> >> > argument to still keep this wakeup based on the 2->1 transitition.
> >> > pagecache pages are refcount==1 when they are dma-idle but still
> >> > allocated. To keep the same semantics for dax a dax_folio would have an
> >> > elevated refcount whenever it is referenced by mapping entry.
> >> 
> >> I'm not sold on keeping it as it doesn't seem to offer any benefit
> >> IMHO. I know both Jason and Christoph were keen to see it go so it be
> >> good to get their feedback too. Also one of the primary goals of this
> >> series was to refcount the page normally so we could remove the whole
> >> "page is free with a refcount of 1" semantics.
> >
> > The page is still free at refcount 0, no argument there. But, by
> > introducing a new "page refcount is elevated while mapped" (as it
> > should), it follows that "page is DMA idle at refcount == 1", right?
> 
> No. The page is either mapped xor DMA-busy - ie. not free. If we want
> (need?) to tell the difference we can use folio_maybe_dma_pinned(),
> assuming the driver doing DMA has called pin_user_pages() as it should.
> 
> That said I'm not sure why we care about the distinction between
> DMA-idle and mapped? If the page is not free from the mm perspective the
> block can't be reallocated by the filesystem.

"can't be reallocated", what enforces that in your view? I am hoping it
is something I am overlooking.

In my view the filesystem has no idea of this page-to-block
relationship. All it knows is that when it wants to destroy the
page-to-block association, dax notices and says "uh, oh, this is my last
chance to make sure the block can go back into the fs allocation pool so
I need to wait for the mm to say that the page is exclusive to me (dax
core) before dax_delete_mapping_entry() destroys the page-to-block
association and the fs reclaims the allocation".

> > Otherwise, the current assumption that fileystems can have
> > dax_layout_busy_page_range() poll on the state of the pfn in the mapping
> > is broken because page refcount == 0 also means no page to mapping
> > association.
> 
> And also means nothing from the mm (userspace mapping, DMA-busy, etc.)
> is using the page so the page isn't busy and is free to be reallocated
> right?

Lets take the 'map => start dma => truncate => end dma' scenario.

At the 'end dma' step, how does the filesystem learn that the block that
it truncated, potentially hours ago, is now a free block? The filesystem
thought it reclaimed the block when truncate completed. I.e. dax says,
thou shalt 'end dma' => 'truncate' in all cases.

Note "dma" can be replaced with "any non dax core page_ref".

  reply	other threads:[~2024-10-25  4:35 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
2024-09-10  4:14 ` [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
2024-09-22  1:00   ` Dan Williams
2024-09-10  4:14 ` [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
2024-09-10 13:47   ` Bjorn Helgaas
2024-09-11  1:07     ` Alistair Popple
2024-09-11 13:51       ` Bjorn Helgaas
2024-09-11  0:48   ` Logan Gunthorpe
2024-10-11  0:20     ` Alistair Popple
2024-09-22  1:00   ` Dan Williams
2024-10-11  0:17     ` Alistair Popple
2024-09-10  4:14 ` [PATCH 03/12] fs/dax: Refactor wait for dax idle page Alistair Popple
2024-09-22  1:01   ` Dan Williams
2024-09-10  4:14 ` [PATCH 04/12] mm: Allow compound zone device pages Alistair Popple
2024-09-10  4:47   ` Matthew Wilcox
2024-09-10  6:57     ` Alistair Popple
2024-09-10 13:41       ` Matthew Wilcox
2024-09-12 12:44   ` kernel test robot
2024-09-12 12:44   ` kernel test robot
2024-09-22  1:01   ` Dan Williams
2024-09-10  4:14 ` [PATCH 05/12] mm/memory: Add dax_insert_pfn Alistair Popple
2024-09-22  1:41   ` Dan Williams
2024-10-01 10:43     ` Gerald Schaefer
2024-09-10  4:14 ` [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages Alistair Popple
2024-09-22  2:07   ` Dan Williams
2024-10-14  6:33     ` Alistair Popple
2024-09-10  4:14 ` [PATCH 07/12] huge_memory: Allow mappings of PMD " Alistair Popple
2024-09-27  2:48   ` Dan Williams
2024-10-14  6:53     ` Alistair Popple
2024-10-23 23:14       ` Alistair Popple
2024-10-23 23:38         ` Dan Williams
2024-09-10  4:14 ` [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
2024-09-25  0:17   ` Dan Williams
2024-09-27  2:52     ` Dan Williams
2024-10-14  7:03       ` Alistair Popple
2024-09-10  4:14 ` [PATCH 09/12] mm: Update vm_normal_page() callers to accept " Alistair Popple
2024-09-27  7:15   ` Dan Williams
2024-10-14  7:16     ` Alistair Popple
2024-09-10  4:14 ` [PATCH 10/12] fs/dax: Properly refcount fs dax pages Alistair Popple
2024-09-27  7:59   ` Dan Williams
2024-10-24  7:52     ` Alistair Popple
2024-10-24 23:52       ` Dan Williams
2024-10-25  2:46         ` Alistair Popple
2024-10-25  4:35           ` Dan Williams [this message]
2024-10-28  4:24             ` Alistair Popple
2024-10-29  2:03               ` Dan Williams
2024-10-30  5:57                 ` Alistair Popple
2024-09-10  4:14 ` [PATCH 11/12] mm: Remove pXX_devmap callers Alistair Popple
2024-09-27 12:29   ` Alexander Gordeev
2024-10-14  7:14     ` Alistair Popple
2024-09-10  4:14 ` [PATCH 12/12] mm: Remove devmap related functions and page table bits Alistair Popple
2024-09-11  7:47   ` Chunyan Zhang
2024-09-12 12:55   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=671b1ff62c64d_10e5929465@dwillia2-xfh.jf.intel.com.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=apopple@nvidia.com \
    --cc=bhelgaas@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@fromorbit.com \
    --cc=david@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=logang@deltatee.com \
    --cc=mpe@ellerman.id.au \
    --cc=npiggin@gmail.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=peterx@redhat.com \
    --cc=tytso@mit.edu \
    --cc=vishal.l.verma@intel.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).