[PATCH] arch/x86: memory_hotplug, do not bump up max_pfn for device private pages

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Balbir Singh <balbirs@nvidia.com>
To: linux-kernel@vger.kernel.org, x86@kernel.org
Cc: "Balbir Singh" <balbirs@nvidia.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Ingo Molnar" <mingo@kernel.org>, "Kees Cook" <kees@kernel.org>,
	"Bjorn Helgaas" <bhelgaas@google.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Andy Lutomirski" <luto@kernel.org>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Bert Karwatzki" <spasswolf@web.de>,
	"Madhavan Srinivasan" <maddy@linux.ibm.com>,
	"Nicholas Piggin" <npiggin@gmail.com>
Subject: [PATCH] arch/x86: memory_hotplug, do not bump up max_pfn for device private pages
Date: Tue,  1 Apr 2025 11:07:52 +1100	[thread overview]
Message-ID: <20250401000752.249348-1-balbirs@nvidia.com> (raw)

Commit 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")
exposed a bug with nokaslr and zone device
interaction, as seen on a system with an AMD iGPU and dGPU (see [1]).
The root cause of the issue is that, the gpu driver registers a zone
device private memory region. When kaslr is disabled or the above commit
is applied, the direct_map_physmem_end is set to much higher than 10 TiB
typically to the 64TiB address. When zone device private memory is added
to the system via add_pages(), it bumps up the max_pfn to the same
value. This causes dma_addressing_limited() to return true, since the
device cannot address memory all the way up to max_pfn.

This caused a regression for games played on the iGPU, as it resulted in
the DMA32 zone being used for GPU allocations.

Fix this by not bumping up max_pfn on x86 systems, when pgmap is passed
into add_pages(). The presence of pgmap is used to determine if device
private memory is being added via add_pages().

More details:

devm_request_mem_region() and request_free_mem_region() request for
device private memory. iomem_resource is passed as the base resource
with start and end parameters. iomem_resource's end depends on several
factors, including the platform and virtualization. On x86 for example
on bare metal, this value is set to boot_cpu_data.x86_phys_bits.
boot_cpu_data.x86_phys_bits can change depending on support for MKTME.
By default it is set to the same as log2(direct_map_physmem_end) which
is 46 to 52 bits depending on the number of levels in the page table.
The allocation routines used iomem_resource's end and
direct_map_physmem_end to figure out where to allocate the region.

arch/powerpc is also impacted by this bug, this patch does not fix
the issue for powerpc.

Testing:
1. Tested on a virtual machine with test_hmm for zone device inseration
2. A previous version of this patch was tested by Bert, please see [2]

Link: https://lore.kernel.org/lkml/20250310112206.4168-1-spasswolf@web.de/ [1]
Link: https://lore.kernel.org/lkml/d87680bab997fdc9fb4e638983132af235d9a03a.camel@web.de/ [2]
Fixes: 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")

Cc: "Christian König" <christian.koenig@amd.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
I've left powerpc out of this regression change due to the time required
to setup and test via qemu. I wanted to address the regression quickly

 arch/x86/mm/init_64.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index dce60767124f..cc60b57473a4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -970,9 +970,18 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 	ret = __add_pages(nid, start_pfn, nr_pages, params);
 	WARN_ON_ONCE(ret);

-	/* update max_pfn, max_low_pfn and high_memory */
-	update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
-				  nr_pages << PAGE_SHIFT);
+	/*
+	 * add_pages() is called by memremap_pages() for adding device private
+	 * pages. Do not bump up max_pfn in the device private path. max_pfn
+	 * changes affect dma_addressing_limited. dma_addressing_limited
+	 * returning true when max_pfn is the device's addressable memory,
+	 * can force device drivers to use bounce buffers and impact their
+	 * performance
+	 */
+	if (!params->pgmap)
+		/* update max_pfn, max_low_pfn and high_memory */
+		update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
+					  nr_pages << PAGE_SHIFT);

 	return ret;
 }
-- 
2.48.1

next             reply	other threads:[~2025-04-01  0:08 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-01  0:07 Balbir Singh [this message]
2025-04-01  8:57 ` [PATCH] arch/x86: memory_hotplug, do not bump up max_pfn for device private pages Ingo Molnar
2025-04-01  9:24   ` Balbir Singh
2025-04-01  9:08 ` [tip: x86/urgent] x86/mm/init: Handle the special case of device private pages in add_pages(), to not increase max_pfn and trigger dma_addressing_limited() bounce buffers tip-bot2 for Balbir Singh
2025-12-01 22:11 ` [PATCH] arch/x86: memory_hotplug, do not bump up max_pfn for device private pages dan.j.williams
2025-12-01 23:01   ` Gregory Price
2025-12-02 22:40   ` Balbir Singh

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:dce60767124 dfblob:cc60b57473a )
 OR (
bs:"[PATCH] arch/x86: memory_hotplug, do not bump up max_pfn for device private pages" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250401000752.249348-1-balbirs@nvidia.com \
    --to=balbirs@nvidia.com \
    --cc=alexander.deucher@amd.com \
    --cc=bhelgaas@google.com \
    --cc=christian.koenig@amd.com \
    --cc=kees@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=maddy@linux.ibm.com \
    --cc=mingo@kernel.org \
    --cc=npiggin@gmail.com \
    --cc=peterz@infradead.org \
    --cc=spasswolf@web.de \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.