Re: [PATCH 09/12] mm: Update vm_normal_page() callers to accept FS DAX pages

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Alistair Popple <apopple@nvidia.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org, vishal.l.verma@intel.com,
	dave.jiang@intel.com, logang@deltatee.com, bhelgaas@google.com,
	jack@suse.cz, jgg@ziepe.ca, catalin.marinas@arm.com,
	will@kernel.org, mpe@ellerman.id.au, npiggin@gmail.com,
	dave.hansen@linux.intel.com, ira.weiny@intel.com,
	willy@infradead.org, djwong@kernel.org, tytso@mit.edu,
	linmiaohe@huawei.com, david@redhat.com, peterx@redhat.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org, nvdimm@lists.linux.dev,
	linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org,
	jhubbard@nvidia.com, hch@lst.de, david@fromorbit.com
Subject: Re: [PATCH 09/12] mm: Update vm_normal_page() callers to accept FS DAX pages
Date: Mon, 14 Oct 2024 18:16:15 +1100	[thread overview]
Message-ID: <87y12rm8id.fsf@nvdebian.thelocal> (raw)
In-Reply-To: <66f65ba5be661_964f229415@dwillia2-xfh.jf.intel.com.notmuch>


Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
>> Currently if a PTE points to a FS DAX page vm_normal_page() will
>> return NULL as these have their own special refcounting scheme. A
>> future change will allow FS DAX pages to be refcounted the same as any
>> other normal page.
>> 
>> Therefore vm_normal_page() will start returning FS DAX pages. To avoid
>> any change in behaviour callers that don't expect FS DAX pages will
>> need to explicitly check for this. As vm_normal_page() can already
>> return ZONE_DEVICE pages most callers already include a check for any
>> ZONE_DEVICE page.
>> 
>> However some callers don't, so add explicit checks where required.
>
> I would expect justification for each of these conversions, and
> hopefully with fsdax returning fully formed folios there is less need to
> sprinkle these checks around.
>
> At a minimum I think this patch needs to be broken up by file touched.

Good idea.

>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>  arch/x86/mm/pat/memtype.c |  4 +++-
>>  fs/proc/task_mmu.c        | 16 ++++++++++++----
>>  mm/memcontrol-v1.c        |  2 +-
>>  3 files changed, 16 insertions(+), 6 deletions(-)
>> 
>> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
>> index 1fa0bf6..eb84593 100644
>> --- a/arch/x86/mm/pat/memtype.c
>> +++ b/arch/x86/mm/pat/memtype.c
>> @@ -951,6 +951,7 @@ static void free_pfn_range(u64 paddr, unsigned long size)
>>  static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
>>  		resource_size_t *phys)
>>  {
>> +	struct folio *folio;
>>  	pte_t *ptep, pte;
>>  	spinlock_t *ptl;
>>  
>> @@ -960,7 +961,8 @@ static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
>>  	pte = ptep_get(ptep);
>>  
>>  	/* Never return PFNs of anon folios in COW mappings. */
>> -	if (vm_normal_folio(vma, vma->vm_start, pte)) {
>> +	folio = vm_normal_folio(vma, vma->vm_start, pte);
>> +	if (folio || (folio && !folio_is_device_dax(folio))) {
>
> ...for example, I do not immediately see why follow_phys() would need to
> be careful with fsdax pages?

The intent was to maintain the original behaviour as much as
possible, partly to reduce the chance of unintended bugs/consequences
and partly to maintain my sanity by not having to dig too deeply into
all the callers.

I see I got this a little bit wrong though - it only filters FSDAX pages
and not device DAX (my intent was to filter both).

> ...but I do see why copy_page_range() (which calls follow_phys() through
> track_pfn_copy()) might care. It just turns out that vma_needs_copy(),
> afaics, bypasses dax MAP_SHARED mappings.
>
> So this touch of memtype.c looks like it can be dropped.

Ok. Although it feels safer to leave it (along with a check for device
DAX). Someone can always remove it in future if they really do want DAX
pages but this is all x86 specific so will take your guidance here.

>>  		pte_unmap_unlock(ptep, ptl);
>>  		return -EINVAL;
>>  	}
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 5f171ad..456b010 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -816,6 +816,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>>  
>>  	if (pte_present(ptent)) {
>>  		page = vm_normal_page(vma, addr, ptent);
>> +		if (page && is_device_dax_page(page))
>> +			page = NULL;
>>  		young = pte_young(ptent);
>>  		dirty = pte_dirty(ptent);
>>  		present = true;
>> @@ -864,6 +866,8 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
>>  
>>  	if (pmd_present(*pmd)) {
>>  		page = vm_normal_page_pmd(vma, addr, *pmd);
>> +		if (page && is_device_dax_page(page))
>> +			page = NULL;
>>  		present = true;
>>  	} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
>>  		swp_entry_t entry = pmd_to_swp_entry(*pmd);
>
> The above can be replaced with a catch like
>
>    if (folio_test_device(folio))
> 	return;
>
> ...in smaps_account() since ZONE_DEVICE pages are not suitable to
> account as they do not reflect any memory pressure on the system memory
> pool.

Sounds good.

>> @@ -1385,7 +1389,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
>>  	if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)))
>>  		return false;
>>  	folio = vm_normal_folio(vma, addr, pte);
>> -	if (!folio)
>> +	if (!folio || folio_is_device_dax(folio))
>>  		return false;
>>  	return folio_maybe_dma_pinned(folio);
>
> The whole point of ZONE_DEVICE is to account for DMA so I see no reason
> for pte_is_pinned() to special case dax. The caller of pte_is_pinned()
> is doing it for soft_dirty reasons, and I believe soft_dirty is already
> disabled for vma_is_dax(). I assume MEMORY_DEVICE_PRIVATE also does not
> support soft-dirty, so I expect all ZONE_DEVICE already opt-out of this.

Actually soft-dirty is theoretically supported on DEVICE_PRIVATE pages
in the sense that the soft-dirty bits are copied around. Whether or not
it actually works is a different question though, I've certainly never
tried it.

Again, was just trying to maintain previous behaviour but can drop this
check.

>>  }
>> @@ -1710,6 +1714,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>>  			frame = pte_pfn(pte);
>>  		flags |= PM_PRESENT;
>>  		page = vm_normal_page(vma, addr, pte);
>> +		if (page && is_device_dax_page(page))
>> +			page = NULL;
>>  		if (pte_soft_dirty(pte))
>>  			flags |= PM_SOFT_DIRTY;
>>  		if (pte_uffd_wp(pte))
>> @@ -2096,7 +2102,8 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
>>  
>>  		if (p->masks_of_interest & PAGE_IS_FILE) {
>>  			page = vm_normal_page(vma, addr, pte);
>> -			if (page && !PageAnon(page))
>> +			if (page && !PageAnon(page) &&
>> +			    !is_device_dax_page(page))
>>  				categories |= PAGE_IS_FILE;
>>  		}
>>  
>> @@ -2158,7 +2165,8 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
>>  
>>  		if (p->masks_of_interest & PAGE_IS_FILE) {
>>  			page = vm_normal_page_pmd(vma, addr, pmd);
>> -			if (page && !PageAnon(page))
>> +			if (page && !PageAnon(page) &&
>> +			    !is_device_dax_page(page))
>>  				categories |= PAGE_IS_FILE;
>>  		}
>>  
>> @@ -2919,7 +2927,7 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
>>  		return NULL;
>>  
>>  	page = vm_normal_page_pmd(vma, addr, pmd);
>> -	if (!page)
>> +	if (!page || is_device_dax_page(page))
>>  		return NULL;
>
> I am not immediately seeing a reason to block pagemap_read() from
> interrogating dax-backed virtual mappings. I think these protections can
> be dropped.

Ok.

>>  
>>  	if (PageReserved(page))
>> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
>> index b37c0d8..e16053c 100644
>> --- a/mm/memcontrol-v1.c
>> +++ b/mm/memcontrol-v1.c
>> @@ -667,7 +667,7 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
>>  {
>>  	struct page *page = vm_normal_page(vma, addr, ptent);
>>  
>> -	if (!page)
>> +	if (!page || is_device_dax_page(page))
>>  		return NULL;
>>  	if (PageAnon(page)) {
>>  		if (!(mc.flags & MOVE_ANON))
>
> I think this better handled with something like this to disable all
> memcg accounting for ZONE_DEVICE pages:

Ok, thanks for the review.

> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index b37c0d870816..cfc43e8c59fe 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -940,8 +940,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>                  */
>                 if (folio_memcg(folio) == mc.from) {
>                         ret = MC_TARGET_PAGE;
> -                       if (folio_is_device_private(folio) ||
> -                           folio_is_device_coherent(folio))
> +                       if (folio_is_device(folio))
>                                 ret = MC_TARGET_DEVICE;
>                         if (target)
>                                 target->folio = folio;

next prev parent reply	other threads:[~2024-10-14  7:46 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
2024-09-10  4:14 ` [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
2024-09-22  1:00   ` Dan Williams
2024-09-10  4:14 ` [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
2024-09-10 13:47   ` Bjorn Helgaas
2024-09-11  1:07     ` Alistair Popple
2024-09-11 13:51       ` Bjorn Helgaas
2024-09-11  0:48   ` Logan Gunthorpe
2024-10-11  0:20     ` Alistair Popple
2024-09-22  1:00   ` Dan Williams
2024-10-11  0:17     ` Alistair Popple
2024-09-10  4:14 ` [PATCH 03/12] fs/dax: Refactor wait for dax idle page Alistair Popple
2024-09-22  1:01   ` Dan Williams
2024-09-10  4:14 ` [PATCH 04/12] mm: Allow compound zone device pages Alistair Popple
2024-09-10  4:47   ` Matthew Wilcox
2024-09-10  6:57     ` Alistair Popple
2024-09-10 13:41       ` Matthew Wilcox
2024-09-12 12:44   ` kernel test robot
2024-09-12 12:44   ` kernel test robot
2024-09-22  1:01   ` Dan Williams
2024-09-10  4:14 ` [PATCH 05/12] mm/memory: Add dax_insert_pfn Alistair Popple
2024-09-22  1:41   ` Dan Williams
2024-10-01 10:43     ` Gerald Schaefer
2024-09-10  4:14 ` [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages Alistair Popple
2024-09-22  2:07   ` Dan Williams
2024-10-14  6:33     ` Alistair Popple
2024-09-10  4:14 ` [PATCH 07/12] huge_memory: Allow mappings of PMD " Alistair Popple
2024-09-27  2:48   ` Dan Williams
2024-10-14  6:53     ` Alistair Popple
2024-10-23 23:14       ` Alistair Popple
2024-10-23 23:38         ` Dan Williams
2024-09-10  4:14 ` [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
2024-09-25  0:17   ` Dan Williams
2024-09-27  2:52     ` Dan Williams
2024-10-14  7:03       ` Alistair Popple
2024-09-10  4:14 ` [PATCH 09/12] mm: Update vm_normal_page() callers to accept " Alistair Popple
2024-09-27  7:15   ` Dan Williams
2024-10-14  7:16     ` Alistair Popple [this message]
2024-09-10  4:14 ` [PATCH 10/12] fs/dax: Properly refcount fs dax pages Alistair Popple
2024-09-27  7:59   ` Dan Williams
2024-10-24  7:52     ` Alistair Popple
2024-10-24 23:52       ` Dan Williams
2024-10-25  2:46         ` Alistair Popple
2024-10-25  4:35           ` Dan Williams
2024-10-28  4:24             ` Alistair Popple
2024-10-29  2:03               ` Dan Williams
2024-10-30  5:57                 ` Alistair Popple
2024-09-10  4:14 ` [PATCH 11/12] mm: Remove pXX_devmap callers Alistair Popple
2024-09-27 12:29   ` Alexander Gordeev
2024-10-14  7:14     ` Alistair Popple
2024-09-10  4:14 ` [PATCH 12/12] mm: Remove devmap related functions and page table bits Alistair Popple
2024-09-11  7:47   ` Chunyan Zhang
2024-09-12 12:55   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87y12rm8id.fsf@nvdebian.thelocal \
    --to=apopple@nvidia.com \
    --cc=bhelgaas@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@fromorbit.com \
    --cc=david@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=logang@deltatee.com \
    --cc=mpe@ellerman.id.au \
    --cc=npiggin@gmail.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=peterx@redhat.com \
    --cc=tytso@mit.edu \
    --cc=vishal.l.verma@intel.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.