From: Alistair Popple <apopple@nvidia.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org, vishal.l.verma@intel.com,
dave.jiang@intel.com, logang@deltatee.com, bhelgaas@google.com,
jack@suse.cz, jgg@ziepe.ca, catalin.marinas@arm.com,
will@kernel.org, mpe@ellerman.id.au, npiggin@gmail.com,
dave.hansen@linux.intel.com, ira.weiny@intel.com,
willy@infradead.org, djwong@kernel.org, tytso@mit.edu,
linmiaohe@huawei.com, david@redhat.com, peterx@redhat.com,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linuxppc-dev@lists.ozlabs.org, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org,
jhubbard@nvidia.com, hch@lst.de, david@fromorbit.com
Subject: Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages
Date: Wed, 30 Oct 2024 16:57:09 +1100 [thread overview]
Message-ID: <87zfmmp1z7.fsf@nvdebian.thelocal> (raw)
In-Reply-To: <6720428aa1fcc_bc6a929439@dwillia2-xfh.jf.intel.com.notmuch>
Dan Williams <dan.j.williams@intel.com> writes:
> Alistair Popple wrote:
> [..]
>
>> >> > It follows that that the DMA-idle condition still needs to look for the
>> >> > case where the refcount is > 1 rather than 0 since refcount == 1 is the
>> >> > page-mapped-but-DMA-idle condition.
>>
>> Because if the DAX page-cache holds a reference the refcount won't go to
>> zero until dax_delete_mapping_entry() is called. However this interface
>> seems really strange to me - filesystems call
>> dax_layout_busy_page()/dax_wait_page_idle() to make sure both user-space
>> and DMA[1] have finished with the page, but not the DAX code which still
>> has references in it's page-cache.
>
> First, I appreciate the clarification that I was mixing up "mapped"
> (elevated map count) with, for lack of a better term, "tracked" (mapping
> entry valid).
>
> So, to repeat back to you what I understand now, the proposal is to
> attempt to allow _count==0 as the DMA idle condition, but still have the
> final return of the block to the allocator (fs allocator) occur after
> dax_delete_mapping_entry().
Right, that is what I would like to achieve if possible. The outstanding
question I think is "should the DAX page-cache have a reference on the
page?". Or to use your terminology below "if a pfn is tracked should
pfn_to_page(pfn)->_refcount == 0 or 1?"
This version implements it as being zero because altering that requires
re-ordering all the existing filesystem and mm users of
dax_layout_busy_range() and dax_delete_mapping_entry(). Based on this
discussion though I'm beginning to think it probably should be one, but
I haven't been able to make that work yet.
>> Is there some reason for this? In order words why can't the interface to
>> the filesystem be something like calling dax_break_layouts() which
>> ensures everything, including core FS DAX code, has finished with the
>> page(s) in question? I don't see why that wouldn't work for at least
>> EXT4 and XFS (FUSE seemed a bit different but I haven't dug too deeply).
>>
>> If we could do that dax_break_layouts() would essentially:
>> 1. unmap userspace via eg. unmap_mapping_pages() to drive the refcount
>> down.
>
> Am I missing where unmap_mapping_pages() drops the _count? I can see
> where it drops _mapcount. I don't think that matters for the proposal,
> but that's my last gap in tracking the proposed refcount model.
It is suitably obtuse due to MMU_GATHER. unmap_mapping_pages() drops the
folio/page reference after flushing the TLB. Ie:
=> tlb_finish_mmu
=> tlb_flush_mmu
=> __tlb_batch_free_encoded_pages
=> free_pages_and_swap_cache
=> folios_put_refs
>> 2. delete the DAX page-cache entry to remove its refcount.
>> 3. wait for DMA to complete by waiting for the refcount to hit zero.
>>
>> The problem with the filesystem truncate code at the moment is steps 2
>> and 3 are reversed so step 3 has to wait for a refcount of 1 as you
>> pointed out previously. But does that matter? Are there ever cases when
>> a filesystem needs to wait for the page to be idle but maintain it's DAX
>> page-cache entry?
>
> No, not that I can think of. The filesystem just cares that the page was
> seen as part of the file at some point and that it is holding locks to
> keep the block associated with that page allocated to the file until it
> can complete its operation.
>
> I think what we are talking about is a pfn-state not a page state. If
> the block-pfn-page lifecycle from allocation to free is deconstructed as:
>
> block free
> block allocated
> pfn untracked
> pfn tracked
> page free
> page busy
> page free
> pfn untracked
> block free
>
> ...then I can indeed see cases where there is pfn metadata live even
> though the page is free.
>
> So I think I was playing victim to the current implementation that
> assumes that "pfn tracked" means the page is allocated and that
> pfn_to_folio(pfn)->mapping is valid and not NULL.
>
> All this to say I am at least on the same page as you that _count == 0
> can be used as the page free state even if the pfn tracking goes through
> delayed cleanup.
Great, and I like this terminology of pfn tracked, etc.
> However, if vmf_insert_XXX is increasing _count then, per my
> unmap_mapping_pages() question above, I think dax_wait_page_idle() needs
> to call try_to_unmap() to drop that _count, right?
At the moment filesystems open-code their own version of
XXXX_break_layouts() which typically calls dax_layout_busy_page()
followed by dax_wait_page_idle(). The former will call
unmap_mapping_range(), which for shared mappings I thought should be
sufficient to find and unmap all page table references (and therefore
folio/page _refcounts) based on the address space / index.
I think try_to_unmap() would only be neccessary if we only had the folio
and not the address space / index and therefore needed to find them from
the mm (not fs!) rmap.
> Similar observation for the memory_failure_dev_pagemap() path, I think
> that path only calls unmap_mapping_range() not try_to_unmap() and
> leaves _count elevated.
As noted above unmap_mapping_range() will drop the refcount whenever it
clears a pte/pmd mapping the folio and I think it should find all the
pte's mapping it.
> Lastly walking through the code again I think this fix is valid today:
>
> diff --git a/fs/dax.c b/fs/dax.c
> index fcbe62bde685..48f2c85690e1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -660,7 +660,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
> pgoff_t end_idx;
> XA_STATE(xas, &mapping->i_pages, start_idx);
>
> - if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> + if (!dax_mapping(mapping))
> return NULL;
>
> /* If end == LLONG_MAX, all pages from start to till end of file */
>
>
> ...because unmap_mapping_pages() will mark the mapping as unmapped even
> though there are "pfn tracked + page busy" entries to clean up.
Yep, I noticed this today when I was trying to figure out why my
re-ordering of the unmap/wait/untrack pfn wasn't working as expected. It
still isn't for some other reason, and I'm still figuring out if the
above is correct/valid, but it is on my list of things to look more
closely at.
> Appreciate you grappling this with me!
Not at all! And thank you as well ... I feel like this has helped me a
lot in getting a slightly better understanding of the problems. Also
unless you react violently to anything I've said here I think I have
enough material to post (and perhaps even explain!) the next version of
this series.
- Alistair
next prev parent reply other threads:[~2024-10-30 5:59 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-10 4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
2024-09-10 4:14 ` [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
2024-09-22 1:00 ` Dan Williams
2024-09-10 4:14 ` [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
2024-09-10 13:47 ` Bjorn Helgaas
2024-09-11 1:07 ` Alistair Popple
2024-09-11 13:51 ` Bjorn Helgaas
2024-09-11 0:48 ` Logan Gunthorpe
2024-10-11 0:20 ` Alistair Popple
2024-09-22 1:00 ` Dan Williams
2024-10-11 0:17 ` Alistair Popple
2024-09-10 4:14 ` [PATCH 03/12] fs/dax: Refactor wait for dax idle page Alistair Popple
2024-09-22 1:01 ` Dan Williams
2024-09-10 4:14 ` [PATCH 04/12] mm: Allow compound zone device pages Alistair Popple
2024-09-10 4:47 ` Matthew Wilcox
2024-09-10 6:57 ` Alistair Popple
2024-09-10 13:41 ` Matthew Wilcox
2024-09-12 12:44 ` kernel test robot
2024-09-12 12:44 ` kernel test robot
2024-09-22 1:01 ` Dan Williams
2024-09-10 4:14 ` [PATCH 05/12] mm/memory: Add dax_insert_pfn Alistair Popple
2024-09-22 1:41 ` Dan Williams
2024-10-01 10:43 ` Gerald Schaefer
2024-09-10 4:14 ` [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages Alistair Popple
2024-09-22 2:07 ` Dan Williams
2024-10-14 6:33 ` Alistair Popple
2024-09-10 4:14 ` [PATCH 07/12] huge_memory: Allow mappings of PMD " Alistair Popple
2024-09-27 2:48 ` Dan Williams
2024-10-14 6:53 ` Alistair Popple
2024-10-23 23:14 ` Alistair Popple
2024-10-23 23:38 ` Dan Williams
2024-09-10 4:14 ` [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
2024-09-25 0:17 ` Dan Williams
2024-09-27 2:52 ` Dan Williams
2024-10-14 7:03 ` Alistair Popple
2024-09-10 4:14 ` [PATCH 09/12] mm: Update vm_normal_page() callers to accept " Alistair Popple
2024-09-27 7:15 ` Dan Williams
2024-10-14 7:16 ` Alistair Popple
2024-09-10 4:14 ` [PATCH 10/12] fs/dax: Properly refcount fs dax pages Alistair Popple
2024-09-27 7:59 ` Dan Williams
2024-10-24 7:52 ` Alistair Popple
2024-10-24 23:52 ` Dan Williams
2024-10-25 2:46 ` Alistair Popple
2024-10-25 4:35 ` Dan Williams
2024-10-28 4:24 ` Alistair Popple
2024-10-29 2:03 ` Dan Williams
2024-10-30 5:57 ` Alistair Popple [this message]
2024-09-10 4:14 ` [PATCH 11/12] mm: Remove pXX_devmap callers Alistair Popple
2024-09-27 12:29 ` Alexander Gordeev
2024-10-14 7:14 ` Alistair Popple
2024-09-10 4:14 ` [PATCH 12/12] mm: Remove devmap related functions and page table bits Alistair Popple
2024-09-11 7:47 ` Chunyan Zhang
2024-09-12 12:55 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87zfmmp1z7.fsf@nvdebian.thelocal \
--to=apopple@nvidia.com \
--cc=bhelgaas@google.com \
--cc=catalin.marinas@arm.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=dave.jiang@intel.com \
--cc=david@fromorbit.com \
--cc=david@redhat.com \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=ira.weiny@intel.com \
--cc=jack@suse.cz \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=linmiaohe@huawei.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=logang@deltatee.com \
--cc=mpe@ellerman.id.au \
--cc=npiggin@gmail.com \
--cc=nvdimm@lists.linux.dev \
--cc=peterx@redhat.com \
--cc=tytso@mit.edu \
--cc=vishal.l.verma@intel.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.