From: Alistair Popple <apopple@nvidia.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org, vishal.l.verma@intel.com,
dave.jiang@intel.com, logang@deltatee.com, bhelgaas@google.com,
jack@suse.cz, jgg@ziepe.ca, catalin.marinas@arm.com,
will@kernel.org, mpe@ellerman.id.au, npiggin@gmail.com,
dave.hansen@linux.intel.com, ira.weiny@intel.com,
willy@infradead.org, djwong@kernel.org, tytso@mit.edu,
linmiaohe@huawei.com, david@redhat.com, peterx@redhat.com,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linuxppc-dev@lists.ozlabs.org, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org,
jhubbard@nvidia.com, hch@lst.de, david@fromorbit.com
Subject: Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages
Date: Wed, 30 Oct 2024 16:57:09 +1100 [thread overview]
Message-ID: <87zfmmp1z7.fsf@nvdebian.thelocal> (raw)
In-Reply-To: <6720428aa1fcc_bc6a929439@dwillia2-xfh.jf.intel.com.notmuch>
Dan Williams <dan.j.williams@intel.com> writes:
> Alistair Popple wrote:
> [..]
>
>> >> > It follows that that the DMA-idle condition still needs to look for the
>> >> > case where the refcount is > 1 rather than 0 since refcount == 1 is the
>> >> > page-mapped-but-DMA-idle condition.
>>
>> Because if the DAX page-cache holds a reference the refcount won't go to
>> zero until dax_delete_mapping_entry() is called. However this interface
>> seems really strange to me - filesystems call
>> dax_layout_busy_page()/dax_wait_page_idle() to make sure both user-space
>> and DMA[1] have finished with the page, but not the DAX code which still
>> has references in it's page-cache.
>
> First, I appreciate the clarification that I was mixing up "mapped"
> (elevated map count) with, for lack of a better term, "tracked" (mapping
> entry valid).
>
> So, to repeat back to you what I understand now, the proposal is to
> attempt to allow _count==0 as the DMA idle condition, but still have the
> final return of the block to the allocator (fs allocator) occur after
> dax_delete_mapping_entry().
Right, that is what I would like to achieve if possible. The outstanding
question I think is "should the DAX page-cache have a reference on the
page?". Or to use your terminology below "if a pfn is tracked should
pfn_to_page(pfn)->_refcount == 0 or 1?"
This version implements it as being zero because altering that requires
re-ordering all the existing filesystem and mm users of
dax_layout_busy_range() and dax_delete_mapping_entry(). Based on this
discussion though I'm beginning to think it probably should be one, but
I haven't been able to make that work yet.
>> Is there some reason for this? In order words why can't the interface to
>> the filesystem be something like calling dax_break_layouts() which
>> ensures everything, including core FS DAX code, has finished with the
>> page(s) in question? I don't see why that wouldn't work for at least
>> EXT4 and XFS (FUSE seemed a bit different but I haven't dug too deeply).
>>
>> If we could do that dax_break_layouts() would essentially:
>> 1. unmap userspace via eg. unmap_mapping_pages() to drive the refcount
>> down.
>
> Am I missing where unmap_mapping_pages() drops the _count? I can see
> where it drops _mapcount. I don't think that matters for the proposal,
> but that's my last gap in tracking the proposed refcount model.
It is suitably obtuse due to MMU_GATHER. unmap_mapping_pages() drops the
folio/page reference after flushing the TLB. Ie:
=> tlb_finish_mmu
=> tlb_flush_mmu
=> __tlb_batch_free_encoded_pages
=> free_pages_and_swap_cache
=> folios_put_refs
>> 2. delete the DAX page-cache entry to remove its refcount.
>> 3. wait for DMA to complete by waiting for the refcount to hit zero.
>>
>> The problem with the filesystem truncate code at the moment is steps 2
>> and 3 are reversed so step 3 has to wait for a refcount of 1 as you
>> pointed out previously. But does that matter? Are there ever cases when
>> a filesystem needs to wait for the page to be idle but maintain it's DAX
>> page-cache entry?
>
> No, not that I can think of. The filesystem just cares that the page was
> seen as part of the file at some point and that it is holding locks to
> keep the block associated with that page allocated to the file until it
> can complete its operation.
>
> I think what we are talking about is a pfn-state not a page state. If
> the block-pfn-page lifecycle from allocation to free is deconstructed as:
>
> block free
> block allocated
> pfn untracked
> pfn tracked
> page free
> page busy
> page free
> pfn untracked
> block free
>
> ...then I can indeed see cases where there is pfn metadata live even
> though the page is free.
>
> So I think I was playing victim to the current implementation that
> assumes that "pfn tracked" means the page is allocated and that
> pfn_to_folio(pfn)->mapping is valid and not NULL.
>
> All this to say I am at least on the same page as you that _count == 0
> can be used as the page free state even if the pfn tracking goes through
> delayed cleanup.
Great, and I like this terminology of pfn tracked, etc.
> However, if vmf_insert_XXX is increasing _count then, per my
> unmap_mapping_pages() question above, I think dax_wait_page_idle() needs
> to call try_to_unmap() to drop that _count, right?
At the moment filesystems open-code their own version of
XXXX_break_layouts() which typically calls dax_layout_busy_page()
followed by dax_wait_page_idle(). The former will call
unmap_mapping_range(), which for shared mappings I thought should be
sufficient to find and unmap all page table references (and therefore
folio/page _refcounts) based on the address space / index.
I think try_to_unmap() would only be neccessary if we only had the folio
and not the address space / index and therefore needed to find them from
the mm (not fs!) rmap.
> Similar observation for the memory_failure_dev_pagemap() path, I think
> that path only calls unmap_mapping_range() not try_to_unmap() and
> leaves _count elevated.
As noted above unmap_mapping_range() will drop the refcount whenever it
clears a pte/pmd mapping the folio and I think it should find all the
pte's mapping it.
> Lastly walking through the code again I think this fix is valid today:
>
> diff --git a/fs/dax.c b/fs/dax.c
> index fcbe62bde685..48f2c85690e1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -660,7 +660,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
> pgoff_t end_idx;
> XA_STATE(xas, &mapping->i_pages, start_idx);
>
> - if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> + if (!dax_mapping(mapping))
> return NULL;
>
> /* If end == LLONG_MAX, all pages from start to till end of file */
>
>
> ...because unmap_mapping_pages() will mark the mapping as unmapped even
> though there are "pfn tracked + page busy" entries to clean up.
Yep, I noticed this today when I was trying to figure out why my
re-ordering of the unmap/wait/untrack pfn wasn't working as expected. It
still isn't for some other reason, and I'm still figuring out if the
above is correct/valid, but it is on my list of things to look more
closely at.
> Appreciate you grappling this with me!
Not at all! And thank you as well ... I feel like this has helped me a
lot in getting a slightly better understanding of the problems. Also
unless you react violently to anything I've said here I think I have
enough material to post (and perhaps even explain!) the next version of
this series.
- Alistair
next prev parent reply other threads:[~2024-10-30 5:57 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-10 4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
2024-09-10 4:14 ` [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
2024-09-22 1:00 ` Dan Williams
2024-09-10 4:14 ` [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
2024-09-10 13:47 ` Bjorn Helgaas
2024-09-11 1:07 ` Alistair Popple
2024-09-11 13:51 ` Bjorn Helgaas
2024-09-11 0:48 ` Logan Gunthorpe
2024-10-11 0:20 ` Alistair Popple
2024-09-22 1:00 ` Dan Williams
2024-10-11 0:17 ` Alistair Popple
2024-09-10 4:14 ` [PATCH 03/12] fs/dax: Refactor wait for dax idle page Alistair Popple
2024-09-22 1:01 ` Dan Williams
2024-09-10 4:14 ` [PATCH 04/12] mm: Allow compound zone device pages Alistair Popple
2024-09-10 4:47 ` Matthew Wilcox
2024-09-10 6:57 ` Alistair Popple
2024-09-10 13:41 ` Matthew Wilcox
2024-09-12 12:44 ` kernel test robot
2024-09-12 12:44 ` kernel test robot
2024-09-22 1:01 ` Dan Williams
2024-09-10 4:14 ` [PATCH 05/12] mm/memory: Add dax_insert_pfn Alistair Popple
2024-09-22 1:41 ` Dan Williams
2024-10-01 10:43 ` Gerald Schaefer
2024-09-10 4:14 ` [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages Alistair Popple
2024-09-22 2:07 ` Dan Williams
2024-10-14 6:33 ` Alistair Popple
2024-09-10 4:14 ` [PATCH 07/12] huge_memory: Allow mappings of PMD " Alistair Popple
2024-09-27 2:48 ` Dan Williams
2024-10-14 6:53 ` Alistair Popple
2024-10-23 23:14 ` Alistair Popple
2024-10-23 23:38 ` Dan Williams
2024-09-10 4:14 ` [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
2024-09-25 0:17 ` Dan Williams
2024-09-27 2:52 ` Dan Williams
2024-10-14 7:03 ` Alistair Popple
2024-09-10 4:14 ` [PATCH 09/12] mm: Update vm_normal_page() callers to accept " Alistair Popple
2024-09-27 7:15 ` Dan Williams
2024-10-14 7:16 ` Alistair Popple
2024-09-10 4:14 ` [PATCH 10/12] fs/dax: Properly refcount fs dax pages Alistair Popple
2024-09-27 7:59 ` Dan Williams
2024-10-24 7:52 ` Alistair Popple
2024-10-24 23:52 ` Dan Williams
2024-10-25 2:46 ` Alistair Popple
2024-10-25 4:35 ` Dan Williams
2024-10-28 4:24 ` Alistair Popple
2024-10-29 2:03 ` Dan Williams
2024-10-30 5:57 ` Alistair Popple [this message]
2024-09-10 4:14 ` [PATCH 11/12] mm: Remove pXX_devmap callers Alistair Popple
2024-09-27 12:29 ` Alexander Gordeev
2024-10-14 7:14 ` Alistair Popple
2024-09-10 4:14 ` [PATCH 12/12] mm: Remove devmap related functions and page table bits Alistair Popple
2024-09-11 7:47 ` Chunyan Zhang
2024-09-12 12:55 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87zfmmp1z7.fsf@nvdebian.thelocal \
--to=apopple@nvidia.com \
--cc=bhelgaas@google.com \
--cc=catalin.marinas@arm.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=dave.jiang@intel.com \
--cc=david@fromorbit.com \
--cc=david@redhat.com \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=ira.weiny@intel.com \
--cc=jack@suse.cz \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=linmiaohe@huawei.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=logang@deltatee.com \
--cc=mpe@ellerman.id.au \
--cc=npiggin@gmail.com \
--cc=nvdimm@lists.linux.dev \
--cc=peterx@redhat.com \
--cc=tytso@mit.edu \
--cc=vishal.l.verma@intel.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).