Re: [PATCH] arch/x86: memory_hotplug, do not bump up max_pfn for device private pages

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Balbir Singh <balbirs@nvidia.com>
To: dan.j.williams@intel.com, linux-kernel@vger.kernel.org, x86@kernel.org
Cc: "Christian König" <christian.koenig@amd.com>,
	"Ingo Molnar" <mingo@kernel.org>, "Kees Cook" <kees@kernel.org>,
	"Bjorn Helgaas" <bhelgaas@google.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Andy Lutomirski" <luto@kernel.org>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Bert Karwatzki" <spasswolf@web.de>,
	"Madhavan Srinivasan" <maddy@linux.ibm.com>,
	"Nicholas Piggin" <npiggin@gmail.com>,
	gourry@gourry.net, linux-mm@kvack.org
Subject: Re: [PATCH] arch/x86: memory_hotplug, do not bump up max_pfn for device private pages
Date: Wed, 3 Dec 2025 09:40:11 +1100	[thread overview]
Message-ID: <daf462c5-83c2-4e85-ae7b-490d5aeb42ff@nvidia.com> (raw)
In-Reply-To: <692e1297e3b6_261c110064@dwillia2-mobl4.notmuch>

On 12/2/25 09:11, dan.j.williams@intel.com wrote:
> [ add Gregory and linux-mm ]
> 
> [ full context for new Cc: ]
> Balbir Singh wrote:
>> Commit 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")
>> exposed a bug with nokaslr and zone device
>> interaction, as seen on a system with an AMD iGPU and dGPU (see [1]).
>> The root cause of the issue is that, the gpu driver registers a zone
>> device private memory region. When kaslr is disabled or the above commit
>> is applied, the direct_map_physmem_end is set to much higher than 10 TiB
>> typically to the 64TiB address. When zone device private memory is added
>> to the system via add_pages(), it bumps up the max_pfn to the same
>> value. This causes dma_addressing_limited() to return true, since the
>> device cannot address memory all the way up to max_pfn.
>>
>> This caused a regression for games played on the iGPU, as it resulted in
>> the DMA32 zone being used for GPU allocations.
>>
>> Fix this by not bumping up max_pfn on x86 systems, when pgmap is passed
>> into add_pages(). The presence of pgmap is used to determine if device
>> private memory is being added via add_pages().
>>
>> More details:
>>
>> devm_request_mem_region() and request_free_mem_region() request for
>> device private memory. iomem_resource is passed as the base resource
>> with start and end parameters. iomem_resource's end depends on several
>> factors, including the platform and virtualization. On x86 for example
>> on bare metal, this value is set to boot_cpu_data.x86_phys_bits.
>> boot_cpu_data.x86_phys_bits can change depending on support for MKTME.
>> By default it is set to the same as log2(direct_map_physmem_end) which
>> is 46 to 52 bits depending on the number of levels in the page table.
>> The allocation routines used iomem_resource's end and
>> direct_map_physmem_end to figure out where to allocate the region.
>>
>> arch/powerpc is also impacted by this bug, this patch does not fix
>> the issue for powerpc.
>>
>> Testing:
>> 1. Tested on a virtual machine with test_hmm for zone device inseration
>> 2. A previous version of this patch was tested by Bert, please see [2]
>>
>> Link: https://lore.kernel.org/lkml/20250310112206.4168-1-spasswolf@web.de/ [1]
>> Link: https://lore.kernel.org/lkml/d87680bab997fdc9fb4e638983132af235d9a03a.camel@web.de/ [2]
>> Fixes: 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")
>>
>> Cc: "Christian König" <christian.koenig@amd.com>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: Kees Cook <kees@kernel.org>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: Bert Karwatzki <spasswolf@web.de>
>> Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
>> Cc: Nicholas Piggin <npiggin@gmail.com>
>>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>> I've left powerpc out of this regression change due to the time required
>> to setup and test via qemu. I wanted to address the regression quickly
>>
>>
>>  arch/x86/mm/init_64.c | 15 ++++++++++++---
>>  1 file changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index dce60767124f..cc60b57473a4 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -970,9 +970,18 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
>>  	ret = __add_pages(nid, start_pfn, nr_pages, params);
>>  	WARN_ON_ONCE(ret);
>>  
>> -	/* update max_pfn, max_low_pfn and high_memory */
>> -	update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
>> -				  nr_pages << PAGE_SHIFT);
>> +	/*
>> +	 * add_pages() is called by memremap_pages() for adding device private
>> +	 * pages. Do not bump up max_pfn in the device private path. max_pfn
>> +	 * changes affect dma_addressing_limited. dma_addressing_limited
>> +	 * returning true when max_pfn is the device's addressable memory,
>> +	 * can force device drivers to use bounce buffers and impact their
>> +	 * performance
>> +	 */
>> +	if (!params->pgmap)
>> +		/* update max_pfn, max_low_pfn and high_memory */
>> +		update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
>> +					  nr_pages << PAGE_SHIFT);
> 
> The comment says that this adjustment is only for the device-private
> case, but it applies to all driver-managed device memory.
> 
> Why not actually do what the comment says and limit this to
> DEVICE_PRIVATE? I.e.:
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0e4270e20fad..4cc8175f9ffd 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -989,7 +989,7 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
>  	 * addressable memory can force device drivers to use bounce buffers
>  	 * and impact their performance negatively:
>  	 */
> -	if (!params->pgmap)
> +	if (!params->pgmap || params->pgmap->type != MEMORY_DEVICE_PRIVATE)
>  		/* update max_pfn, max_low_pfn and high_memory */
>  		update_end_of_memory_vars(start_pfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT);
>  

At that time when I audited the code, I did notice that max_pfn was already set to the upper
end of physical memory (even if it's not hotunplugged, because those regions were parsed and
added to memblock_add() and I noticed that all hotplug regions changing max_pfn were coming
from the device private).

I agree we should check for pgmap->type, so definitely the right fix

Reviewed-by: Balbir Singh <balbirs@nvidia.com>

     prev parent reply	other threads:[~2025-12-02 22:40 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-01  0:07 [PATCH] arch/x86: memory_hotplug, do not bump up max_pfn for device private pages Balbir Singh
2025-04-01  8:57 ` Ingo Molnar
2025-04-01  9:24   ` Balbir Singh
2025-04-01  9:08 ` [tip: x86/urgent] x86/mm/init: Handle the special case of device private pages in add_pages(), to not increase max_pfn and trigger dma_addressing_limited() bounce buffers tip-bot2 for Balbir Singh
2025-12-01 22:11 ` [PATCH] arch/x86: memory_hotplug, do not bump up max_pfn for device private pages dan.j.williams
2025-12-01 23:01   ` Gregory Price
2025-12-02 22:40   ` Balbir Singh [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=daf462c5-83c2-4e85-ae7b-490d5aeb42ff@nvidia.com \
    --to=balbirs@nvidia.com \
    --cc=alexander.deucher@amd.com \
    --cc=bhelgaas@google.com \
    --cc=christian.koenig@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=gourry@gourry.net \
    --cc=kees@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=maddy@linux.ibm.com \
    --cc=mingo@kernel.org \
    --cc=npiggin@gmail.com \
    --cc=peterz@infradead.org \
    --cc=spasswolf@web.de \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.