* Re: [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure
From: Yury M. @ 2026-05-18 20:49 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: bhelgaas, mahesh, oohall, corbet, skhan, linux-pci, linux-doc,
linux-kernel, linuxppc-dev, Lukas Wunner
In-Reply-To: <20260518202903.GA641158@bhelgaas>
Current behavior has existed for a long time and I could easily imagine
that there is software which relies on the fact that the system is in a
non-modified state if AER recovery failed. The software can analyze the
system and do cleanup afterwards. Sometimes, if something fails in the
system, it is better to have it in a non-modified state.
In short, I just wanted to preserve the current logic by default because
there is a chance that we have software which relies on the current
behavior.
On 5/18/26 21:29, Bjorn Helgaas wrote:
> [+cc Lukas]
>
> On Mon, May 18, 2026 at 02:23:36PM +0100, Yury Murashka wrote:
>> pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
>> If a new AER error is subsequently reported, the AER driver calls
>> find_source_device() to find the source of the error. It rescans the
>> whole bus and picks the first device reporting an AER error. Because the
>> previous error was never cleared, the error is attributed to the wrong
>> device and AER recovery is started for the wrong device.
>>
>> Add a kernel boot parameter pci=aer_clear_on_recovery_failure to clear
>> AER error status even when recovery fails, preventing stale errors from
>> causing incorrect device identification on subsequent AER events.
> Why should we add a kernel parameter for this? How would a user
> decide whether to use the parameter? Are there cases where we
> find the source of the first error, but we *wouldn't* want to clear
> it if recovery fails?
^ permalink raw reply
* Re: [PATCH] nios2: remove the architecture
From: Wolfram Sang @ 2026-05-18 20:46 UTC (permalink / raw)
To: Simon Schuster
Cc: Peter Zijlstra, Arnd Bergmann, Ethan Nelson-Moore, Dinh Nguyen,
linux-doc, devicetree, workflows, Linux-Arch, dmaengine,
linux-i2c, linux-iio, Netdev, linux-pci, linux-pwm,
linux-hardening, linux-kbuild, linux-csky@vger.kernel.org,
Jonathan Corbet, Shuah Khan, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Daniel Lezcano, Thomas Gleixner, Alex Shi,
Yanteng Si, Dongliang Mu, Hu Haowen, Kees Cook, Oleg Nesterov,
Will Deacon, Aneesh Kumar K.V (Arm), Andrew Morton,
Nicholas Piggin, Vinod Koul, Frank Li, Dave Penkler, Andi Shyti,
Jonathan Cameron, David Lechner, Nuno Sá, Andy Shevchenko,
Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Lorenzo Pieralisi, Krzysztof WilczyDski,
Andreas Oetken
In-Reply-To: <20260518172444.zyd47mcagrcwu7wt@dev-vm-schuster>
Hi Simon,
> downstream. In other respects, we try to be good citizens and contribute
> bugfixes as well as required cleanups (such as implementing clone3 [2]
> and fixing its flag behaviour on 32-bit architectures) as they come up.
Well, this is definitely 'good citizen'...
> If desired, we also would be happy to intensify our support regarding
> reviews or testing to share the maintnance burden if it helps to keep
> nios2 in mainline a bit longer.
... but given this, you might want to get added in MAINTAINERS as
reviewer (or even maintainer) for nios2? Besides that your efforts are
already worth it in my book, it would also ensure you get CCed on
patches like this. Then, you are not depending on people like Arnd
putting you in the loop manually.
Happy hacking,
Wolfram
^ permalink raw reply
* Re: [PATCH v6 1/2] usb: xhci-pci: add AMD Promontory 21 PCI glue
From: Michal Pecio @ 2026-05-18 20:34 UTC (permalink / raw)
To: Guenter Roeck
Cc: Jihong Min, Greg Kroah-Hartman, Mathias Nyman, Jonathan Corbet,
Shuah Khan, Mario Limonciello, Basavaraj Natikar, linux-usb,
linux-hwmon, linux-doc, linux-pci, linux-kernel,
Mario Limonciello (AMD), Yaroslav Isakov
In-Reply-To: <f05e075d-a87e-49b5-95f8-5858d21acf64@roeck-us.net>
On Mon, 18 May 2026 03:55:52 -0700, Guenter Roeck wrote:
> On 5/17/26 14:21, Michal Pecio wrote:
> > Instead of the X86 heuristic, would it be possible to build glue
> > code if and only if SENSORS_PROM21_XHCI is enabled?
> >
> > This seems to work:
> >
> > config SENSORS_PROM21_XHCI
> > tristate "AMD Promontory 21 xHCI temperature sensor"
> > - depends on USB_XHCI_PCI_PROM21
> > + depends on USB_XHCI_PCI
> >
> > config USB_XHCI_PCI_PROM21
> > tristate
> > - depends on X86
> > depends on USB_XHCI_PCI
> > - default USB_XHCI_PCI
> > + default USB_XHCI_PCI if SENSORS_PROM21_XHCI != 'n'
> > select AUXILIARY_BUS
> >
> > I don't know if it's the best way, perhaps it would be preferable for
> > the hwmon driver to select the glue, but then I'm not sure how to force
> > glue to become 'y' when xhci-pci is 'y'.
> >
>
> Unless I am missing something, that would disable the entire controller
> if the hwmon device is not enabled. That seems a bit draconian to me.
I haven't tested (I don't have this chipset), but it should work like
the similar xhci-pci-renesas module, which I'm familiar with.
When the special unicorn module is disabled by Kconfig, xhci-pci no
longer rejects its devices and works with them normally, like it always
did before the unicorn module even existed.
It should be the same with xhci-pci-prom21. You don't need to enable
this module to use USB, only for the special functions. So if hwmon
is disabled then you can disable it too.
I always found this dual-driver solution (for Renesas) rather ugly and
confusing, but so far it's the least bad option tried. Hmm, maybe the
next iteration should be an aux bus interface for FW loaders...
Regards,
Michal
^ permalink raw reply
* Re: [PATCH v6 1/2] usb: xhci-pci: add AMD Promontory 21 PCI glue
From: Jihong Min @ 2026-05-18 20:30 UTC (permalink / raw)
To: Michal Pecio
Cc: Greg Kroah-Hartman, Mathias Nyman, Guenter Roeck, Jonathan Corbet,
Shuah Khan, Mario Limonciello, Basavaraj Natikar, linux-usb,
linux-hwmon, linux-doc, linux-pci, linux-kernel,
Mario Limonciello (AMD), Yaroslav Isakov
In-Reply-To: <20260517232147.34931718.michal.pecio@gmail.com>
On 5/18/26 06:21, Michal Pecio wrote:
> Instead of the X86 heuristic, would it be possible to build glue
> code if and only if SENSORS_PROM21_XHCI is enabled?
>
> This seems to work:
>
> config SENSORS_PROM21_XHCI
> tristate "AMD Promontory 21 xHCI temperature sensor"
> - depends on USB_XHCI_PCI_PROM21
> + depends on USB_XHCI_PCI
>
> config USB_XHCI_PCI_PROM21
> tristate
> - depends on X86
> depends on USB_XHCI_PCI
> - default USB_XHCI_PCI
> + default USB_XHCI_PCI if SENSORS_PROM21_XHCI != 'n'
> select AUXILIARY_BUS
>
> I don't know if it's the best way, perhaps it would be preferable for
> the hwmon driver to select the glue, but then I'm not sure how to force
> glue to become 'y' when xhci-pci is 'y'.
I think I should keep the current hidden glue option for now.
The PROM21 PCI glue is part of the PCI binding path for the xHCI controller
when enabled, while SENSORS_PROM21_XHCI is only the optional user-visible
hwmon driver. Tying the glue to the hwmon option would make the sensor
option
affect which driver binds the USB controller. As Guenter pointed out, that
would be too strong; the USB controller should not depend on whether the
optional hwmon driver is enabled.
So I would prefer to keep USB_XHCI_PCI_PROM21 as hidden plumbing that
follows
USB_XHCI_PCI, and keep SENSORS_PROM21_XHCI as the user-visible sensor
option.
> +static int prom21_xhci_create_auxdev(struct pci_dev *pdev)
> +{
> + struct prom21_xhci_auxdev *prom21_auxdev;
> + struct usb_hcd *hcd = pci_get_drvdata(pdev);
> +
> + if (!hcd)
> + return -ENODEV;
>
> Shouldn't be necessary after successful xhci_pci_common_probe().
Agreed. I removed the unnecessary NULL check from
prom21_xhci_create_auxdev() locally for v7.
> + prom21_auxdev->id = ida_alloc(&prom21_xhci_auxdev_ida, GFP_KERNEL);
> + if (prom21_auxdev->id < 0) {
> + int ret = prom21_auxdev->id;
> +
> + devres_free(prom21_auxdev);
> + return ret;
> + }
> +
> + prom21_auxdev->auxdev = auxiliary_device_create(&pdev->dev,
> + KBUILD_MODNAME, "hwmon",
> + &prom21_auxdev->pdata,
> + prom21_auxdev->id);
> + if (!prom21_auxdev->auxdev) {
> + ida_free(&prom21_xhci_auxdev_ida, prom21_auxdev->id);
> + devres_free(prom21_auxdev);
> + return -ENOMEM;
>
> The usual "goto error" pattern could be used instead of increasingly
> long sequences of xxx_free() calls.
Agreed. I changed prom21_xhci_create_auxdev() to use a goto-based
cleanup path
locally for v7.
> It seems that these three functions above are everything that you truly
> want to add; the rest is boilerplate required by this two-module scheme
> to work, plus ID tables which must be duplicated and kept in sync.
>
> I wonder if a separate module is really justified, as opposed to simply
> linking this file into xhci_pci.ko when directed by Kconfig.
>
> The downside would be slightly higher memory usage on systems where the
> hwmon driver is enabled but not needed. OTOH, same systems would likely
> see reduced disk waste.
I understand the concern. Linking the PROM21 auxiliary-device publisher
into xhci_pci.ko would reduce some boilerplate and avoid the extra PCI
driver, while still keeping the hwmon driver itself separate.
The reason I kept the current split is that the earlier review direction
was to keep the hwmon functionality out of xhci-pci and bind a
drivers/hwmon driver through an auxiliary device. The current PROM21 PCI
glue keeps the PROM21-specific auxiliary-device lifetime handling outside
the common xhci-pci driver and leaves xhci-pci.c with only the PCI ID
handoff, similar in spirit to the Renesas handoff path.
That said, I agree this is a tradeoff. If Mathias or the USB maintainers
prefer the PROM21 auxiliary-device publisher to be linked into xhci_pci.ko
instead of being a separate PCI glue driver, I can rework it in that
direction while still keeping the hwmon driver under drivers/hwmon and
bound through the auxiliary bus.
Sincerely,
Jihong Min
^ permalink raw reply
* Re: [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure
From: Bjorn Helgaas @ 2026-05-18 20:29 UTC (permalink / raw)
To: Yury Murashka
Cc: bhelgaas, mahesh, oohall, corbet, skhan, linux-pci, linux-doc,
linux-kernel, linuxppc-dev, Lukas Wunner
In-Reply-To: <CAPzpGcRCTCZtaX1EVaJNZ103THZKsoszZduY7=gwfYdcrMo-SQ@mail.gmail.com>
[+cc Lukas]
On Mon, May 18, 2026 at 02:23:36PM +0100, Yury Murashka wrote:
> pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
> If a new AER error is subsequently reported, the AER driver calls
> find_source_device() to find the source of the error. It rescans the
> whole bus and picks the first device reporting an AER error. Because the
> previous error was never cleared, the error is attributed to the wrong
> device and AER recovery is started for the wrong device.
>
> Add a kernel boot parameter pci=aer_clear_on_recovery_failure to clear
> AER error status even when recovery fails, preventing stale errors from
> causing incorrect device identification on subsequent AER events.
Why should we add a kernel parameter for this? How would a user
decide whether to use the parameter? Are there cases where we
find the source of the first error, but we *wouldn't* want to clear
it if recovery fails?
^ permalink raw reply
* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Lorenzo Stoakes @ 2026-05-18 19:32 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Wei Yang, Lance Yang, npache, linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
jannh, jglisse, joshua.hahnjy, kas, liam, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <9b33339e-157a-45b7-942e-3be3418a5142@kernel.org>
On Mon, May 18, 2026 at 03:16:11PM +0200, David Hildenbrand (Arm) wrote:
> On 5/14/26 05:10, Wei Yang wrote:
> > On Tue, May 12, 2026 at 03:42:02PM +0800, Lance Yang wrote:
> >>
> >> On Mon, May 11, 2026 at 12:58:04PM -0600, Nico Pache wrote:
> >>> generalize the order of the __collapse_huge_page_* and collapse_max_*
> >>> functions to support future mTHP collapse.
> >>>
> >>> The current mechanism for determining collapse with the
> >>> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> >>> raises a key design issue: if we support user defined max_pte_none values
> >>> (even those scaled by order), a collapse of a lower order can introduces
> >>> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> >>> than HPAGE_PMD_NR / 2. [1]
> >>>
> >>> With this configuration, a successful collapse to order N will populate
> >>> enough pages to satisfy the collapse condition on order N+1 on the next
> >>> scan. This leads to unnecessary work and memory churn.
> >>>
> >>> To fix this issue introduce a helper function that will limit mTHP
> >>> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> >>> This effectively supports two modes: [2]
> >>>
> >>> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
> >>> that maps the shared zeropage. Consequently, no memory bloat.
> >>> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
> >>> available mTHP order.
> >>>
> >>> This removes the possiblilty of "creep", while not modifying any uAPI
> >>> expectations. A warning will be emitted if any non-supported
> >>> max_ptes_none value is configured with mTHP enabled.
> >>>
> >>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> >>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> >>> shared or swapped entry.
> >>>
> >>> No functional changes in this patch; however it defines future behavior
> >>> for mTHP collapse.
> >>>
> >>> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
> >>> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
> >>>
> >>> Co-developed-by: Dev Jain <dev.jain@arm.com>
> >>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >>> Signed-off-by: Nico Pache <npache@redhat.com>
> >>> ---
> >>> include/trace/events/huge_memory.h | 3 +-
> >>> mm/khugepaged.c | 117 ++++++++++++++++++++---------
> >>> 2 files changed, 85 insertions(+), 35 deletions(-)
> >>>
> >>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> >>> index bcdc57eea270..443e0bd13fdb 100644
> >>> --- a/include/trace/events/huge_memory.h
> >>> +++ b/include/trace/events/huge_memory.h
> >>> @@ -39,7 +39,8 @@
> >>> EM( SCAN_STORE_FAILED, "store_failed") \
> >>> EM( SCAN_COPY_MC, "copy_poisoned_page") \
> >>> EM( SCAN_PAGE_FILLED, "page_filled") \
> >>> - EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
> >>> + EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback") \
> >>> + EMe(SCAN_INVALID_PTES_NONE, "invalid_ptes_none")
> >>>
> >>> #undef EM
> >>> #undef EMe
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index f68853b3caa7..27465161fa6d 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -61,6 +61,7 @@ enum scan_result {
> >>> SCAN_COPY_MC,
> >>> SCAN_PAGE_FILLED,
> >>> SCAN_PAGE_DIRTY_OR_WRITEBACK,
> >>> + SCAN_INVALID_PTES_NONE,
> >>> };
> >>>
> >>> #define CREATE_TRACE_POINTS
> >>> @@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
> >>> * PTEs for the given collapse operation.
> >>> * @cc: The collapse control struct
> >>> * @vma: The vma to check for userfaultfd
> >>> + * @order: The folio order being collapsed to
> >>> *
> >>> * Return: Maximum number of none-page or zero-page PTEs allowed for the
> >>> * collapse operation.
> >>> */
> >>> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> >>> - struct vm_area_struct *vma)
> >>> +static int collapse_max_ptes_none(struct collapse_control *cc,
> >>> + struct vm_area_struct *vma, unsigned int order)
> >>> {
> >>> + unsigned int max_ptes_none = khugepaged_max_ptes_none;
> >>> // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
> >>
> >> One thing I still want to call out: kernel code usually uses C-style
> >> comments :)
> >>
> >>> if (vma && userfaultfd_armed(vma))
> >>> return 0;
> >>> // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> >>> if (!cc->is_khugepaged)
> >>> return HPAGE_PMD_NR;
> >>> - // For all other cases repect the user defined maximum.
> >>> - return khugepaged_max_ptes_none;
> >>> + // for PMD collapse, respect the user defined maximum.
> >>> + if (is_pmd_order(order))
> >>> + return max_ptes_none;
> >>> + /* Zero/non-present collapse disabled. */
> >>> + if (!max_ptes_none)
> >>> + return 0;
> >>> + // for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
> >>> + // scale the maximum number of PTEs to the order of the collapse.
> >>> + if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> >>> + return (1 << order) - 1;
> >>> +
> >>> + // We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
> >>> + // Emit a warning and return -EINVAL.
> >>> + pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
> >>> + KHUGEPAGED_MAX_PTES_LIMIT);
> >>
> >> Maybe fallback to 0 instead, as David suggested earlier?
> >>
> >
> > It looks reasonable to fallback to 0.
> >
> > But as the updated Document says in patch 14:
> >
> > For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other
> > value will emit a warning and no mTHP collapse will be attempted.
> >
> > This is why it does like this now.
> >
> > mthp_collapse()
> > max_ptes_none = collapse_max_ptes_none();
> > if (max_ptes_none < 0)
> > return collapsed;
> >
> >> max_ptes_none is mostly legacy PMD THP behavior. mTHP is new, and any
> >> intermediate value in (0, KHUGEPAGED_MAX_PTES_LIMIT) would implicitly
> >> disable it :(
> >>
> >
> > So it depends on what we want to do here :-)
> >
> > For me, I would vote for fallback to 0.
>
> At this point I'll prefer to not return errors from collapse_max_ptes_none().
> It's just rather awkward to return an error deep down in collapse code for a
> configuration problem.
>
> For mthp collapse, we only support max_ptes_none==0 and
> max_ptes_none=="HPAGE_PMD_NR - 1" (default).
>
> If another value is specified while collapsing mTHP, print a warning and treat
> it as 0 (save value, no creep, no memory waste).
>
> In a sense, this is similar to how we handle max_ptes_shared + max_ptes_swap:
> for mTHP: we always treat them as being 0 for mTHP collapse (and don't issue a
> warning, because we would issue a warning with the default settings).
>
> @Lorenzo, fine with you?
Yes 100%, this sounds sensible both in terms of the error and the default. Let's
keep our lives simple(-ish) please :)
>
> --
> Cheers,
>
> David
Cheers, Lorenzo
^ permalink raw reply
* [PATCH 6/6] selftests/damon: add a test for update_schemes_quota_goals
From: Maksym Shcherba @ 2026-05-18 19:09 UTC (permalink / raw)
To: sj, akpm
Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, corbet, skhan,
damon, linux-mm, linux-kernel, linux-doc, linux-kselftest,
Maksym Shcherba
In-Reply-To: <20260518190932.42270-1-maksym.shcherba@lnu.edu.ua>
The new update_schemes_quota_goals sysfs command allows users to manually
update the current_value of quota goals.
Add a selftest for the command. The test writes a dummy value to
current_value, executes the update command, and verifies that the dummy
value is successfully overwritten by the kernel.
Assisted-by: Antigravity:Gemini-3.1-Pro
Signed-off-by: Maksym Shcherba <maksym.shcherba@lnu.edu.ua>
---
tools/testing/selftests/damon/Makefile | 1 +
.../damon/sysfs_update_schemes_quota_goals.py | 86 +++++++++++++++++++
2 files changed, 87 insertions(+)
create mode 100755 tools/testing/selftests/damon/sysfs_update_schemes_quota_goals.py
diff --git a/tools/testing/selftests/damon/Makefile b/tools/testing/selftests/damon/Makefile
index 2180c328a825..a692ebaa6c8a 100644
--- a/tools/testing/selftests/damon/Makefile
+++ b/tools/testing/selftests/damon/Makefile
@@ -13,6 +13,7 @@ TEST_PROGS += sysfs.py
TEST_PROGS += sysfs_update_schemes_tried_regions_wss_estimation.py
TEST_PROGS += damos_quota.py damos_quota_goal.py damos_apply_interval.py
TEST_PROGS += damos_tried_regions.py damon_nr_regions.py
+TEST_PROGS += sysfs_update_schemes_quota_goals.py
TEST_PROGS += reclaim.sh lru_sort.sh
# regression tests (reproducers of previously found bugs)
diff --git a/tools/testing/selftests/damon/sysfs_update_schemes_quota_goals.py b/tools/testing/selftests/damon/sysfs_update_schemes_quota_goals.py
new file mode 100755
index 000000000000..745b97f75bc2
--- /dev/null
+++ b/tools/testing/selftests/damon/sysfs_update_schemes_quota_goals.py
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+Test the update_schemes_quota_goals sysfs command.
+
+Start DAMON with a scheme that has a some_mem_psi_us quota goal. Write a
+physically impossible dummy value to the goal's current_value sysfs file.
+Wait for a while, ensure the dummy value is not overwritten asynchronously,
+then write 'update_schemes_quota_goals' to the state file and verify that
+the dummy value is overwritten by the kernel.
+"""
+
+import os
+import time
+
+import _damon_sysfs
+
+
+def main():
+ goal = _damon_sysfs.DamosQuotaGoal(
+ metric=_damon_sysfs.qgoal_metric_some_mem_psi_us,
+ target_value=1000)
+ kdamonds = _damon_sysfs.Kdamonds([_damon_sysfs.Kdamond(
+ contexts=[_damon_sysfs.DamonCtx(
+ ops='paddr',
+ schemes=[_damon_sysfs.Damos(
+ action='stat',
+ quota=_damon_sysfs.DamosQuota(
+ goals=[goal], reset_interval_ms=100),
+ )] # schemes
+ )] # contexts
+ )]) # kdamonds
+
+ err = kdamonds.start()
+ if err is not None:
+ print('kdamond start failed: %s' % err)
+ exit(1)
+
+ # Write a dummy value to current_value to ensure the command actually
+ # overwrites it. We use 2x the quota reset interval in microseconds,
+ # which is a physically impossible value for the kernel to measure.
+ impossible_value = goal.quota.reset_interval_ms * 2000
+ err = _damon_sysfs.write_file(
+ os.path.join(goal.sysfs_dir(), 'current_value'),
+ '%d' % impossible_value)
+ if err is not None:
+ kdamonds.stop()
+ print('Writing dummy current_value failed: %s' % err)
+ exit(1)
+
+ # wait a couple of aggregation intervals so that the kernel has a chance
+ # to compute the first current_value measurement
+ time.sleep(0.5)
+
+ content, err = _damon_sysfs.read_file(
+ os.path.join(goal.sysfs_dir(), 'current_value'))
+ if err is not None:
+ kdamonds.stop()
+ print('Reading current_value before update failed: %s' % err)
+ exit(1)
+ if int(content) != impossible_value:
+ kdamonds.stop()
+ print('current_value changed before update (%s)' % content)
+ exit(1)
+
+ err = kdamonds.kdamonds[0].update_schemes_quota_goals()
+ if err is not None:
+ kdamonds.stop()
+ print('update_schemes_quota_goals failed: %s' % err)
+ exit(1)
+
+ # current_value must be updated and different from our dummy value
+ if goal.current_value is None or goal.current_value == impossible_value:
+ kdamonds.stop()
+ print('update_schemes_quota_goals failed to update current_value')
+ exit(1)
+
+ print('current_value after update_schemes_quota_goals: %d' %
+ goal.current_value)
+
+ kdamonds.stop()
+
+
+if __name__ == '__main__':
+ main()
--
2.43.0
^ permalink raw reply related
* [PATCH 5/6] selftests/damon/_damon_sysfs: support update_schemes_quota_goals
From: Maksym Shcherba @ 2026-05-18 19:09 UTC (permalink / raw)
To: sj, akpm
Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, corbet, skhan,
damon, linux-mm, linux-kernel, linux-doc, linux-kselftest,
Maksym Shcherba
In-Reply-To: <20260518190932.42270-1-maksym.shcherba@lnu.edu.ua>
Add update_schemes_quota_goals() method to the Kdamond class in
_damon_sysfs.py, which writes 'update_schemes_quota_goals' to the state
file and reads back the current_value of each quota goal.
Assisted-by: Antigravity:Gemini-3.1-Pro
Signed-off-by: Maksym Shcherba <maksym.shcherba@lnu.edu.ua>
---
tools/testing/selftests/damon/_damon_sysfs.py | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/tools/testing/selftests/damon/_damon_sysfs.py b/tools/testing/selftests/damon/_damon_sysfs.py
index 8b12cc048440..27cd94683f6d 100644
--- a/tools/testing/selftests/damon/_damon_sysfs.py
+++ b/tools/testing/selftests/damon/_damon_sysfs.py
@@ -806,6 +806,21 @@ class Kdamond:
goal.effective_bytes = int(content)
return None
+ def update_schemes_quota_goals(self):
+ err = write_file(os.path.join(self.sysfs_dir(), 'state'),
+ 'update_schemes_quota_goals')
+ if err is not None:
+ return err
+ for context in self.contexts:
+ for scheme in context.schemes:
+ for goal in scheme.quota.goals:
+ content, err = read_file(
+ os.path.join(goal.sysfs_dir(), 'current_value'))
+ if err is not None:
+ return err
+ goal.current_value = int(content)
+ return None
+
def commit(self):
nr_contexts_file = os.path.join(self.sysfs_dir(),
'contexts', 'nr_contexts')
--
2.43.0
^ permalink raw reply related
* [PATCH 4/6] Docs/admin-guide/mm/damon/usage: document update_schemes_quota_goals sysfs command
From: Maksym Shcherba @ 2026-05-18 19:09 UTC (permalink / raw)
To: sj, akpm
Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, corbet, skhan,
damon, linux-mm, linux-kernel, linux-doc, linux-kselftest,
Maksym Shcherba
In-Reply-To: <20260518190932.42270-1-maksym.shcherba@lnu.edu.ua>
Update the DAMON sysfs usage document to describe the newly
added update_schemes_quota_goals command, which allows users to read the
current values of the quota goals after explicitly triggering an update.
Assisted-by: Antigravity:Gemini-3.1-Pro
Signed-off-by: Maksym Shcherba <maksym.shcherba@lnu.edu.ua>
---
Documentation/admin-guide/mm/damon/usage.rst | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index 11c75a598393..097d8ebe960b 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -167,6 +167,9 @@ Users can write below commands for the kdamond to the ``state`` file.
- ``update_schemes_effective_quotas``: Update the contents of
``effective_bytes`` files for each DAMON-based operation scheme of the
kdamond. For more details, refer to :ref:`quotas directory <sysfs_quotas>`.
+- ``update_schemes_quota_goals``: Update the contents of ``current_value`` files
+ for each DAMON-based operation scheme quota goal of the kdamond. For more
+ details, refer to :ref:`goals directory <sysfs_schemes_quota_goals>`.
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
@@ -448,7 +451,11 @@ get the five parameters for the quota auto-tuning goals that specified on the
:ref:`design doc <damon_design_damos_quotas_auto_tuning>` by writing to and
reading from each of the files. Note that users should further write
``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond
-directory <sysfs_kdamond>` to pass the feedback to DAMON.
+directory <sysfs_kdamond>` to pass the feedback to DAMON. The
+``current_value`` file is not updated in real time, so users should ask DAMON
+sysfs interface to periodically update it using ``refresh_ms``, or do a one time
+update by writing a special keyword, ``update_schemes_quota_goals`` to the
+relevant ``kdamonds/<N>/state`` file.
.. _sysfs_watermarks:
--
2.43.0
^ permalink raw reply related
* [PATCH 3/6] Docs/ABI/damon: document update_schemes_quota_goals command
From: Maksym Shcherba @ 2026-05-18 19:09 UTC (permalink / raw)
To: sj, akpm
Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, corbet, skhan,
damon, linux-mm, linux-kernel, linux-doc, linux-kselftest,
Maksym Shcherba
In-Reply-To: <20260518190932.42270-1-maksym.shcherba@lnu.edu.ua>
Update the DAMON ABI doc for the kdamond state file input command
for updating the current values of quota goals.
Assisted-by: Antigravity:Gemini-3.1-Pro
Signed-off-by: Maksym Shcherba <maksym.shcherba@lnu.edu.ua>
---
Documentation/ABI/testing/sysfs-kernel-mm-damon | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon
index ee29d4e204ff..0bd33c1e6790 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-damon
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon
@@ -36,7 +36,9 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or
kdamond. Writing 'clear_schemes_tried_regions' to the file
removes contents of the 'tried_regions' directory. Writing
'update_schemes_effective_quotas' to the file updates
- '.../quotas/effective_bytes' files of this kdamond.
+ '.../quotas/effective_bytes' files of this kdamond. Writing
+ 'update_schemes_quota_goals' to the file updates
+ '.../quotas/goals/<G>/current_value' files of this kdamond.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid
Date: Mar 2022
--
2.43.0
^ permalink raw reply related
* [PATCH 2/6] mm/damon/sysfs: implement update_schemes_quota_goals command
From: Maksym Shcherba @ 2026-05-18 19:09 UTC (permalink / raw)
To: sj, akpm
Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, corbet, skhan,
damon, linux-mm, linux-kernel, linux-doc, linux-kselftest,
Maksym Shcherba
In-Reply-To: <20260518190932.42270-1-maksym.shcherba@lnu.edu.ua>
Add the logic to copy the current_value from the internal
damos_quota_goal structure to the damos_sysfs_quota_goal sysfs structure.
Introduce the DAMON_SYSFS_CMD_UPDATE_SCHEMES_QUOTA_GOALS command
and integrate it with the sysfs interface via the 'state' file.
Assisted-by: Antigravity:Gemini-3.1-Pro
Signed-off-by: Maksym Shcherba <maksym.shcherba@lnu.edu.ua>
---
mm/damon/sysfs-common.h | 4 ++++
mm/damon/sysfs-schemes.c | 29 +++++++++++++++++++++++++++++
mm/damon/sysfs.c | 21 +++++++++++++++++++++
3 files changed, 54 insertions(+)
diff --git a/mm/damon/sysfs-common.h b/mm/damon/sysfs-common.h
index 2099adee11d0..9703414fa15f 100644
--- a/mm/damon/sysfs-common.h
+++ b/mm/damon/sysfs-common.h
@@ -59,3 +59,7 @@ int damos_sysfs_set_quota_scores(struct damon_sysfs_schemes *sysfs_schemes,
void damos_sysfs_update_effective_quotas(
struct damon_sysfs_schemes *sysfs_schemes,
struct damon_ctx *ctx);
+
+void damos_sysfs_update_quota_goals(
+ struct damon_sysfs_schemes *sysfs_schemes,
+ struct damon_ctx *ctx);
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 5d966ac86419..5793659403ca 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -2812,6 +2812,35 @@ void damos_sysfs_update_effective_quotas(
}
}
+void damos_sysfs_update_quota_goals(
+ struct damon_sysfs_schemes *sysfs_schemes,
+ struct damon_ctx *ctx)
+{
+ struct damos *scheme;
+ int schemes_idx = 0;
+
+ damon_for_each_scheme(scheme, ctx) {
+ struct damos_sysfs_quota_goals *sysfs_goals;
+ struct damos_quota_goal *goal;
+ int goals_idx = 0;
+
+ /* user could have removed the scheme sysfs dir */
+ if (schemes_idx >= sysfs_schemes->nr)
+ break;
+
+ sysfs_goals =
+ sysfs_schemes->schemes_arr[schemes_idx++]->quotas->goals;
+
+ damos_for_each_quota_goal(goal, &scheme->quota) {
+ if (goals_idx >= sysfs_goals->nr)
+ break;
+
+ sysfs_goals->goals_arr[goals_idx++]->current_value =
+ goal->current_value;
+ }
+ }
+}
+
static int damos_sysfs_add_migrate_dest(struct damos *scheme,
struct damos_sysfs_dests *sysfs_dests)
{
diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c
index d5863cc33d23..ecc880b52b32 100644
--- a/mm/damon/sysfs.c
+++ b/mm/damon/sysfs.c
@@ -1320,6 +1320,11 @@ enum damon_sysfs_cmd {
* effective size quota of the scheme in bytes.
*/
DAMON_SYSFS_CMD_UPDATE_SCHEMES_EFFECTIVE_QUOTAS,
+ /*
+ * @DAMON_SYSFS_CMD_UPDATE_SCHEMES_QUOTA_GOALS: Update the
+ * current value of the scheme quota goals.
+ */
+ DAMON_SYSFS_CMD_UPDATE_SCHEMES_QUOTA_GOALS,
/*
* @DAMON_SYSFS_CMD_UPDATE_TUNED_INTERVALS: Update the tuned monitoring
* intervals.
@@ -1342,6 +1347,7 @@ static const char * const damon_sysfs_cmd_strs[] = {
"update_schemes_tried_regions",
"clear_schemes_tried_regions",
"update_schemes_effective_quotas",
+ "update_schemes_quota_goals",
"update_tuned_intervals",
};
@@ -1606,6 +1612,16 @@ static int damon_sysfs_upd_schemes_effective_quotas(void *data)
return 0;
}
+static int damon_sysfs_upd_schemes_quota_goals(void *data)
+{
+ struct damon_sysfs_kdamond *kdamond = data;
+ struct damon_ctx *ctx = kdamond->damon_ctx;
+
+ damos_sysfs_update_quota_goals(
+ kdamond->contexts->contexts_arr[0]->schemes, ctx);
+ return 0;
+}
+
static int damon_sysfs_upd_tuned_intervals(void *data)
{
struct damon_sysfs_kdamond *kdamond = data;
@@ -1656,6 +1672,7 @@ static int damon_sysfs_repeat_call_fn(void *data)
damon_sysfs_upd_tuned_intervals(sysfs_kdamond);
damon_sysfs_upd_schemes_stats(sysfs_kdamond);
damon_sysfs_upd_schemes_effective_quotas(sysfs_kdamond);
+ damon_sysfs_upd_schemes_quota_goals(sysfs_kdamond);
out:
mutex_unlock(&damon_sysfs_lock);
return 0;
@@ -1813,6 +1830,10 @@ static int damon_sysfs_handle_cmd(enum damon_sysfs_cmd cmd,
return damon_sysfs_damon_call(
damon_sysfs_upd_schemes_effective_quotas,
kdamond);
+ case DAMON_SYSFS_CMD_UPDATE_SCHEMES_QUOTA_GOALS:
+ return damon_sysfs_damon_call(
+ damon_sysfs_upd_schemes_quota_goals,
+ kdamond);
case DAMON_SYSFS_CMD_UPDATE_TUNED_INTERVALS:
return damon_sysfs_damon_call(
damon_sysfs_upd_tuned_intervals, kdamond);
--
2.43.0
^ permalink raw reply related
* [PATCH 1/6] mm/damon: fix missing parens in macro arguments
From: Maksym Shcherba @ 2026-05-18 19:09 UTC (permalink / raw)
To: sj, akpm
Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, corbet, skhan,
damon, linux-mm, linux-kernel, linux-doc, linux-kselftest,
Maksym Shcherba
The DAMON iterator macros do not wrap their pointer arguments with
parentheses. This can cause build failures when the argument is a
complex expression due to operator precedence issues.
Add missing parentheses around the arguments in the following macros
to prevent potential build failures:
- damon_for_each_region()
- damon_for_each_region_from()
- damon_for_each_region_safe()
- damos_for_each_quota_goal()
Assisted-by: Antigravity:Gemini-3.1-Pro
Signed-off-by: Maksym Shcherba <maksym.shcherba@lnu.edu.ua>
---
include/linux/damon.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h
index 4d4f031bcb45..32f2318ac77f 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -902,13 +902,13 @@ static inline unsigned long damon_sz_region(struct damon_region *r)
#define damon_for_each_region(r, t) \
- list_for_each_entry(r, &t->regions_list, list)
+ list_for_each_entry(r, &(t)->regions_list, list)
#define damon_for_each_region_from(r, t) \
- list_for_each_entry_from(r, &t->regions_list, list)
+ list_for_each_entry_from(r, &(t)->regions_list, list)
#define damon_for_each_region_safe(r, next, t) \
- list_for_each_entry_safe(r, next, &t->regions_list, list)
+ list_for_each_entry_safe(r, next, &(t)->regions_list, list)
#define damon_for_each_target(t, ctx) \
list_for_each_entry(t, &(ctx)->adaptive_targets, list)
@@ -923,7 +923,7 @@ static inline unsigned long damon_sz_region(struct damon_region *r)
list_for_each_entry_safe(s, next, &(ctx)->schemes, list)
#define damos_for_each_quota_goal(goal, quota) \
- list_for_each_entry(goal, "a->goals, list)
+ list_for_each_entry(goal, &(quota)->goals, list)
#define damos_for_each_quota_goal_safe(goal, next, quota) \
list_for_each_entry_safe(goal, next, &(quota)->goals, list)
--
2.43.0
^ permalink raw reply related
* [PATCH] Documentation: hwmon: ad7314: document sysfs interface
From: Chen-Shi-Hong @ 2026-05-18 18:27 UTC (permalink / raw)
To: Guenter Roeck
Cc: Jonathan Corbet, Shuah Khan, linux-hwmon, linux-doc, linux-kernel,
Chen-Shi-Hong
Document the temp1_input sysfs attribute supported by the ad7314
driver.
Signed-off-by: Chen-Shi-Hong <eric039eric@gmail.com>
---
Documentation/hwmon/ad7314.rst | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/Documentation/hwmon/ad7314.rst b/Documentation/hwmon/ad7314.rst
index bf389736bcd1..b454e617d48c 100644
--- a/Documentation/hwmon/ad7314.rst
+++ b/Documentation/hwmon/ad7314.rst
@@ -28,6 +28,12 @@ Driver supports the above parts. The ad7314 has a 10 bit
sensor with 1lsb = 0.25 degrees centigrade. The adt7301 and
adt7302 have 14 bit sensors with 1lsb = 0.03125 degrees centigrade.
+sysfs-Interface
+---------------
+
+temp1_input
+ temperature input
+
Notes
-----
--
2.53.0
^ permalink raw reply related
* Re: [PATCH v11 4/6] iio: adc: ad4691: add SPI offload support
From: Andy Shevchenko @ 2026-05-18 18:26 UTC (permalink / raw)
To: David Lechner
Cc: Sabau, Radu bogdan, Lars-Peter Clausen, Hennerich, Michael,
Jonathan Cameron, Sa, Nuno, Andy Shevchenko, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Uwe Kleine-König,
Liam Girdwood, Mark Brown, Linus Walleij, Bartosz Golaszewski,
Philipp Zabel, Jonathan Corbet, Shuah Khan,
linux-iio@vger.kernel.org, devicetree@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-pwm@vger.kernel.org,
linux-gpio@vger.kernel.org, linux-doc@vger.kernel.org
In-Reply-To: <60d66897-41cc-4f3f-afd2-64e49f0bb55e@baylibre.com>
On Mon, May 18, 2026 at 10:16:38AM -0500, David Lechner wrote:
> On 5/18/26 10:14 AM, Sabau, Radu bogdan wrote:
> >> -----Original Message-----
> >> From: David Lechner <dlechner@baylibre.com>
> >> Sent: Saturday, May 16, 2026 8:53 PM
...
> >>> + if (st->manual_mode && st->offload)
> >>> + return sysfs_emit(buf, "%llu\n", READ_ONCE(st->offload-
> >>> trigger_hz));
> >>
> >> Why do we need READ_ONCE?
> >
> > trigger_hz is u64 and if the target is 32-bit, a 64-bit access compiles to two 32-bit
> > instructions, so show() reading it without a lock and store() writing it concurrently
> > can produce a torn value at the compiler level. READ_ONCE/WRITE_ONCE suppress
> > the compiler transformations that would allow that splitting or caching. We could
> > have st->lock in show() instead, but that felt heavier than necessary for a single
> > scalar where a transiently stale-but-whole read is fine.
>
> I would go with the mutex. It will be easier for people to understand.
But why? READ_ONCE() here is exactly enough. We do not care about
serialisation, we care only about integrity. With mutex it will confuse
(some) people more, e.g., me. Because in that case I would think about
some specific access to it that may happen. Yes, I saw many times the show
functions that do mutex and then print the result when mutex is not held
anymore, but for simple cases like here, mutex is overkill. Interestingly
that using guard()() inside show makes the mentioned functions to print
(almost) latest value of the variable in question. It narrows window down
as printing will go inside critical section.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* Re: [RFC PATCH v3 1/3] scripts: add kconfirm
From: Julian Braha @ 2026-05-18 18:19 UTC (permalink / raw)
To: Arnd Bergmann, Miguel Ojeda, Demi Marie Obenour
Cc: Nathan Chancellor, Nicolas Schier, Jani Nikula, Andrew Morton,
Gary Guo, ljs, Greg Kroah-Hartman, Masahiro Yamada, Miguel Ojeda,
Jonathan Corbet, qingfang.deng, yann.prono, ej, linux-kernel,
rust-for-linux, linux-doc, linux-kbuild
In-Reply-To: <ef59ee46-87e2-4f99-babf-4dc8ee3cbec5@app.fastmail.com>
On 5/18/26 09:08, Arnd Bergmann wrote:
> What about dependencies that are normally shipped by the distros
> along with the rust compiler? Would it be possible to allow a
> range of version that matches the ones that are present on
> common distros like we do with C libraries, or would that cause
> more problems than it solves?
Hi Arnd,
Yes it's something that I would like to enable, though we first need
to wait for all of kconfirm's dependencies to be available for these
distributions.
I've filed a GitHub issue for this in the repo of 'nom-kconfig':
https://github.com/Mcdostone/nom-kconfig/issues/149#issuecomment-4480419622
Also note that the author of that library is CC'd on these emails:
Yann Prono <yann.prono@telecomnancy.net>
I will need to do some testing once all is available, but as far as I
can tell, this would not create many additional problems, though we
would still need to provide crates.io as a source for distributions that
do not package Rust libraries.
- Julian Braha
^ permalink raw reply
* [PATCH v1 16/16] gpu: nova-core: mm: Add BAR1 memory management self-tests
From: Joel Fernandes @ 2026-05-18 18:11 UTC (permalink / raw)
To: linux-kernel
Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron, Benno Lossin,
Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich,
Dave Airlie, Daniel Almeida, dri-devel, rust-for-linux, nova-gpu,
Nikola Djukic, David Airlie, Boqun Feng, John Hubbard,
Alistair Popple, Timur Tabi, Edwin Peer, Alexandre Courbot,
Andrea Righi, Andy Ritger, Zhi Wang, Balbir Singh,
Philipp Stanner, alexeyi, Eliot Courtney, joel, linux-doc,
Joel Fernandes
In-Reply-To: <20260518181126.2493572-1-joelagnelf@nvidia.com>
Add self-tests for BAR1 access during driver probe when
CONFIG_NOVA_MM_SELFTESTS is enabled (default disabled). This results in
testing the Vmm, GPU buddy allocator and BAR1 region all of which should
function correctly for the tests to pass.
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
drivers/gpu/nova-core/gpu.rs | 8 +-
drivers/gpu/nova-core/mm.rs | 12 +-
drivers/gpu/nova-core/mm/bar_user.rs | 253 ++++++++++++++++++++++++++
drivers/gpu/nova-core/mm/pagetable.rs | 33 ++++
4 files changed, 303 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/nova-core/gpu.rs b/drivers/gpu/nova-core/gpu.rs
index b0eebe6406e5..6ed486503957 100644
--- a/drivers/gpu/nova-core/gpu.rs
+++ b/drivers/gpu/nova-core/gpu.rs
@@ -405,7 +405,13 @@ pub(crate) fn run_selftests(
self: Pin<&mut Self>,
pdev: &pci::Device<device::Bound>,
) -> Result {
- crate::mm::run_mm_selftests(pdev, &self.mm, self.spec.chipset)?;
+ crate::mm::run_mm_selftests(
+ pdev,
+ &self.mm,
+ &self.bar1,
+ self.gsp_static_info.bar1_pde_base,
+ self.spec.chipset,
+ )?;
Ok(())
}
}
diff --git a/drivers/gpu/nova-core/mm.rs b/drivers/gpu/nova-core/mm.rs
index 4741ef60593b..ed77162db848 100644
--- a/drivers/gpu/nova-core/mm.rs
+++ b/drivers/gpu/nova-core/mm.rs
@@ -55,7 +55,10 @@ macro_rules! impl_pfn_bounded {
};
use crate::{
- driver::Bar0,
+ driver::{
+ Bar0,
+ Bar1, //
+ },
gpu::Chipset, //
};
@@ -122,10 +125,15 @@ pub(crate) fn tlb(&self) -> &Tlb {
pub(crate) fn run_mm_selftests(
pdev: &pci::Device<device::Bound>,
mm: &Arc<GpuMm>,
+ bar1: &Arc<Devres<Bar1>>,
+ bar1_pde_base: u64,
chipset: Chipset,
) -> Result {
#[cfg(CONFIG_NOVA_MM_SELFTESTS)]
- pramin::run_self_test(pdev.as_ref(), mm.pramin(), chipset)?;
+ {
+ pramin::run_self_test(pdev.as_ref(), mm.pramin(), chipset)?;
+ bar_user::run_self_test(pdev.as_ref(), mm, bar1, bar1_pde_base, chipset)?;
+ }
Ok(())
}
diff --git a/drivers/gpu/nova-core/mm/bar_user.rs b/drivers/gpu/nova-core/mm/bar_user.rs
index bb9742c036b7..96e1389dcbe9 100644
--- a/drivers/gpu/nova-core/mm/bar_user.rs
+++ b/drivers/gpu/nova-core/mm/bar_user.rs
@@ -192,3 +192,256 @@ fn drop(&mut self) {
// identifying the leaked VA range.
}
}
+
+/// Run MM subsystem self-tests during probe.
+///
+/// Tests page table infrastructure and `BAR1` MMIO access using the `BAR1`
+/// address space. Uses the `GpuMm`'s buddy allocator to allocate page tables
+/// and test pages as needed.
+#[cfg(CONFIG_NOVA_MM_SELFTESTS)]
+pub(crate) fn run_self_test(
+ pdev: &device::Device<device::Bound>,
+ mm: &Arc<GpuMm>,
+ bar1_devres: &Arc<Devres<Bar1>>,
+ bar1_pdb: u64,
+ chipset: Chipset,
+) -> Result {
+ use kernel::gpu::buddy::{
+ GpuBuddyAllocFlags,
+ GpuBuddyAllocMode, //
+ };
+ use kernel::ptr::Alignment;
+ use kernel::sizes::{
+ SZ_16K,
+ SZ_32K,
+ SZ_4K,
+ SZ_64K, //
+ };
+
+ // Test patterns.
+ const PATTERN_PRAMIN: u32 = 0xDEAD_BEEF;
+ const PATTERN_BAR1: u32 = 0xCAFE_BABE;
+
+ let dev = pdev;
+ let bar1: &Bar1 = bar1_devres.access(pdev)?;
+ dev_info!(dev, "MM: Starting self-test...\n");
+
+ let pdb_addr = VramAddress::new(bar1_pdb);
+
+ // Check if initial page tables are in VRAM.
+ if crate::mm::pagetable::check_pdb_valid(pdev, mm.pramin(), pdb_addr, chipset).is_err() {
+ dev_info!(dev, "MM: Self-test SKIPPED - no valid VRAM page tables\n");
+ return Ok(());
+ }
+
+ // Set up a test page from the buddy allocator.
+ let test_page_blocks = KBox::pin_init(
+ mm.buddy().alloc_blocks(
+ GpuBuddyAllocMode::Simple,
+ SZ_4K.into_safe_cast(),
+ Alignment::new::<SZ_4K>(),
+ GpuBuddyAllocFlags::default(),
+ ),
+ GFP_KERNEL,
+ )?;
+ let test_vram_offset = test_page_blocks.iter().next().ok_or(ENOMEM)?.offset();
+ let test_vram = VramAddress::new(test_vram_offset);
+ let test_pfn = Pfn::from(test_vram);
+
+ // Create a VMM of size 64K to track virtual memory mappings.
+ let mut vmm = Vmm::new(pdb_addr, chipset.mmu_version(), SZ_64K.into_safe_cast())?;
+
+ // Create a test mapping.
+ let mapped = vmm.map_pages(pdev, mm, &[test_pfn], None, true)?;
+ let test_vfn = mapped.vfn_start;
+
+ // Pre-compute test addresses for the PRAMIN to BAR1 read test.
+ let vfn_offset: usize = test_vfn.raw().into_safe_cast();
+ let bar1_base_offset = vfn_offset.checked_mul(PAGE_SIZE).ok_or(EOVERFLOW)?;
+ let bar1_read_offset: usize = bar1_base_offset + 0x100;
+ let vram_read_addr = test_vram + 0x100;
+
+ // Test 1: Write via PRAMIN, read via BAR1.
+ {
+ let mut window = mm.pramin().get_window(pdev)?;
+ window.try_write32(vram_read_addr, PATTERN_PRAMIN)?;
+ }
+
+ // Read back via BAR1 aperture.
+ let bar1_value = bar1.try_read32(bar1_read_offset)?;
+
+ let test1_passed = if bar1_value == PATTERN_PRAMIN {
+ true
+ } else {
+ dev_err!(
+ dev,
+ "MM: Test 1 FAILED - Expected {:#010x}, got {:#010x}\n",
+ PATTERN_PRAMIN,
+ bar1_value
+ );
+ false
+ };
+
+ // Cleanup - invalidate PTE.
+ vmm.unmap_pages(pdev, mm, mapped)?;
+
+ // Test 2: Two-phase prepare/execute API.
+ let prepared = vmm.prepare_map(pdev, mm, 1, None)?;
+ let mapped2 = vmm.execute_map(pdev, mm, prepared, &[test_pfn], true)?;
+ let readback = vmm.read_mapping(pdev, mm, mapped2.vfn_start)?;
+ let test2_passed = if readback == Some(test_pfn) {
+ true
+ } else {
+ dev_err!(dev, "MM: Test 2 FAILED - Two-phase map readback mismatch\n");
+ false
+ };
+ vmm.unmap_pages(pdev, mm, mapped2)?;
+
+ // Test 3: Range-constrained allocation with a hole — exercises block.size()-driven
+ // BAR1 mapping. A 4K hole is punched at base+16K, then a single 32K allocation
+ // is requested within [base, base+36K). The buddy allocator must split around the
+ // hole, returning multiple blocks (expected: {16K, 4K, 8K, 4K} = 32K total).
+ // Each block is mapped into BAR1 and verified via PRAMIN read-back.
+ //
+ // Address layout (base = 0x10000):
+ // [ 16K ] [HOLE 4K] [4K] [ 8K ] [4K]
+ // 0x10000 0x14000 0x15000 0x16000 0x18000 0x19000
+ let range_base: u64 = SZ_64K.into_safe_cast();
+ let sz_4k: u64 = SZ_4K.into_safe_cast();
+ let sz_16k: u64 = SZ_16K.into_safe_cast();
+ let sz_32k_4k: u64 = (SZ_32K + SZ_4K).into_safe_cast();
+
+ // Punch a 4K hole at base+16K so the subsequent 32K allocation must split.
+ let _hole = KBox::pin_init(
+ mm.buddy().alloc_blocks(
+ GpuBuddyAllocMode::Range(range_base + sz_16k..range_base + sz_16k + sz_4k),
+ SZ_4K.into_safe_cast(),
+ Alignment::new::<SZ_4K>(),
+ GpuBuddyAllocFlags::default(),
+ ),
+ GFP_KERNEL,
+ )?;
+
+ // Allocate 32K within [base, base+36K). The hole forces the allocator to return
+ // split blocks whose sizes are determined by buddy alignment.
+ let blocks = KBox::pin_init(
+ mm.buddy().alloc_blocks(
+ GpuBuddyAllocMode::Range(range_base..range_base + sz_32k_4k),
+ SZ_32K.into_safe_cast(),
+ Alignment::new::<SZ_4K>(),
+ GpuBuddyAllocFlags::default(),
+ ),
+ GFP_KERNEL,
+ )?;
+
+ let mut test3_passed = true;
+ let mut total_size = 0usize;
+
+ for block in blocks.iter() {
+ total_size += IntoSafeCast::<usize>::into_safe_cast(block.size());
+
+ // Map all pages of this block.
+ let page_size: u64 = PAGE_SIZE.into_safe_cast();
+ let num_pages: usize = (block.size() / page_size).into_safe_cast();
+
+ let mut pfns = KVec::new();
+ for j in 0..num_pages {
+ let j_u64: u64 = j.into_safe_cast();
+ pfns.push(
+ Pfn::from(VramAddress::new(
+ block.offset() + j_u64.checked_mul(page_size).ok_or(EOVERFLOW)?,
+ )),
+ GFP_KERNEL,
+ )?;
+ }
+
+ let mapped = vmm.map_pages(pdev, mm, &pfns, None, true)?;
+ let bar1_base_vfn: usize = mapped.vfn_start.raw().into_safe_cast();
+ let bar1_base = bar1_base_vfn.checked_mul(PAGE_SIZE).ok_or(EOVERFLOW)?;
+
+ for j in 0..num_pages {
+ let page_bar1_off = bar1_base + j * PAGE_SIZE;
+ let j_u64: u64 = j.into_safe_cast();
+ let page_phys = block.offset()
+ + j_u64
+ .checked_mul(PAGE_SIZE.into_safe_cast())
+ .ok_or(EOVERFLOW)?;
+
+ bar1.try_write32(PATTERN_BAR1, page_bar1_off)?;
+
+ let pramin_val = {
+ let mut window = mm.pramin().get_window(pdev)?;
+ window.try_read32(VramAddress::new(page_phys))?
+ };
+
+ if pramin_val != PATTERN_BAR1 {
+ dev_err!(
+ dev,
+ "MM: Test 3 FAILED block offset {:#x} page {} (val={:#x})\n",
+ block.offset(),
+ j,
+ pramin_val
+ );
+ test3_passed = false;
+ }
+ }
+
+ vmm.unmap_pages(pdev, mm, mapped)?;
+ }
+
+ // Verify aggregate: all returned block sizes must sum to allocation size.
+ if total_size != SZ_32K {
+ dev_err!(
+ dev,
+ "MM: Test 3 FAILED - total size {} != expected {}\n",
+ total_size,
+ SZ_32K
+ );
+ test3_passed = false;
+ }
+
+ // Release Tests 1-3's Vmm before Test 4 constructs a fresh BarUser on
+ // the same PDB.
+ drop(vmm);
+
+ // Test 4: Exercise `BarUser::map()` end-to-end.
+ let bar_user = Arc::pin_init(
+ BarUser::new(
+ pdb_addr,
+ chipset,
+ SZ_64K.into_safe_cast(),
+ mm.clone(),
+ bar1_devres.clone(),
+ )?,
+ GFP_KERNEL,
+ )?;
+ let access = bar_user.map(pdev, &[test_pfn], true)?;
+
+ // Write pattern via PRAMIN, read via BarUserAccess.
+ {
+ let mut window = mm.pramin().get_window(pdev)?;
+ window.try_write32(test_vram, PATTERN_BAR1)?;
+ }
+
+ let readback = access.try_read32(pdev, 0)?;
+ let test4_passed = if readback == PATTERN_BAR1 {
+ true
+ } else {
+ dev_err!(
+ dev,
+ "MM: Test 4 FAILED - Expected {:#010x}, got {:#010x}\n",
+ PATTERN_BAR1,
+ readback
+ );
+ false
+ };
+ access.release(pdev)?;
+
+ if test1_passed && test2_passed && test3_passed && test4_passed {
+ dev_info!(dev, "MM: All self-tests PASSED\n");
+ Ok(())
+ } else {
+ dev_err!(dev, "MM: Self-tests FAILED\n");
+ Err(EIO)
+ }
+}
diff --git a/drivers/gpu/nova-core/mm/pagetable.rs b/drivers/gpu/nova-core/mm/pagetable.rs
index 042584e5178b..fb573f07b4cf 100644
--- a/drivers/gpu/nova-core/mm/pagetable.rs
+++ b/drivers/gpu/nova-core/mm/pagetable.rs
@@ -17,6 +17,9 @@
use kernel::num::Bounded;
+#[cfg(CONFIG_NOVA_MM_SELFTESTS)]
+use kernel::device;
+
use crate::gpu::Architecture;
use crate::mm::{
pramin,
@@ -379,3 +382,33 @@ fn from(val: AperturePde) -> Self {
Bounded::from_expr(val as u64 & 0x3)
}
}
+
+/// Check if the PDB has valid, VRAM-backed page tables.
+#[cfg(CONFIG_NOVA_MM_SELFTESTS)]
+fn check_pdb_inner<M: MmuConfig>(
+ dev: &device::Device<device::Bound>,
+ pramin: &pramin::Pramin,
+ pdb_addr: VramAddress,
+) -> Result {
+ let mut window = pramin.get_window(dev)?;
+ let raw = window.try_read64(pdb_addr)?;
+
+ if !M::Pde::from_raw(raw).is_valid_vram() {
+ return Err(ENOENT);
+ }
+ Ok(())
+}
+
+/// Check if the PDB has valid, VRAM-backed page tables, dispatching by MMU version.
+#[cfg(CONFIG_NOVA_MM_SELFTESTS)]
+pub(super) fn check_pdb_valid(
+ dev: &device::Device<device::Bound>,
+ pramin: &pramin::Pramin,
+ pdb_addr: VramAddress,
+ chipset: crate::gpu::Chipset,
+) -> Result {
+ match MmuVersion::from(chipset.arch()) {
+ MmuVersion::V2 => check_pdb_inner::<MmuV2>(dev, pramin, pdb_addr),
+ MmuVersion::V3 => check_pdb_inner::<MmuV3>(dev, pramin, pdb_addr),
+ }
+}
--
2.34.1
^ permalink raw reply related
* [PATCH v1 13/16] gpu: nova-core: mm: Add multi-page mapping API to VMM
From: Joel Fernandes @ 2026-05-18 18:11 UTC (permalink / raw)
To: linux-kernel
Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron, Benno Lossin,
Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich,
Dave Airlie, Daniel Almeida, dri-devel, rust-for-linux, nova-gpu,
Nikola Djukic, David Airlie, Boqun Feng, John Hubbard,
Alistair Popple, Timur Tabi, Edwin Peer, Alexandre Courbot,
Andrea Righi, Andy Ritger, Zhi Wang, Balbir Singh,
Philipp Stanner, alexeyi, Eliot Courtney, joel, linux-doc,
Joel Fernandes
In-Reply-To: <20260518181126.2493572-1-joelagnelf@nvidia.com>
Add the page table mapping and unmapping API to the Virtual Memory
Manager, implementing a two-phase prepare/execute model suitable for
use both inside and outside the DMA fence signalling critical path.
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
drivers/gpu/nova-core/mm/pagetable.rs | 1 +
drivers/gpu/nova-core/mm/pagetable/map.rs | 367 ++++++++++++++++++++++
drivers/gpu/nova-core/mm/vmm.rs | 270 ++++++++++++++--
3 files changed, 619 insertions(+), 19 deletions(-)
create mode 100644 drivers/gpu/nova-core/mm/pagetable/map.rs
diff --git a/drivers/gpu/nova-core/mm/pagetable.rs b/drivers/gpu/nova-core/mm/pagetable.rs
index 5e192679f27c..042584e5178b 100644
--- a/drivers/gpu/nova-core/mm/pagetable.rs
+++ b/drivers/gpu/nova-core/mm/pagetable.rs
@@ -8,6 +8,7 @@
#![expect(dead_code)]
+pub(super) mod map;
pub(super) mod ver2;
pub(super) mod ver3;
pub(super) mod walk;
diff --git a/drivers/gpu/nova-core/mm/pagetable/map.rs b/drivers/gpu/nova-core/mm/pagetable/map.rs
new file mode 100644
index 000000000000..b0678a36d406
--- /dev/null
+++ b/drivers/gpu/nova-core/mm/pagetable/map.rs
@@ -0,0 +1,367 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Page table mapping operations for NVIDIA GPUs.
+
+use core::marker::PhantomData;
+
+use kernel::{
+ device,
+ gpu::buddy::{
+ AllocatedBlocks,
+ GpuBuddyAllocFlags,
+ GpuBuddyAllocMode, //
+ },
+ prelude::*,
+ ptr::Alignment,
+ rbtree::{RBTree, RBTreeNode},
+ sizes::SZ_4K, //
+};
+
+use super::{
+ walk::{
+ PtWalkInner,
+ WalkPdeResult,
+ WalkResult, //
+ },
+ AperturePde,
+ AperturePte,
+ DualPdeOps,
+ MmuConfig,
+ MmuV2,
+ MmuV3,
+ MmuVersion,
+ PageTableLevel,
+ PdeOps,
+ PteOps, //
+};
+use crate::{
+ mm::{
+ GpuMm,
+ Pfn,
+ Vfn,
+ VramAddress,
+ PAGE_SIZE, //
+ },
+ num::{
+ IntoSafeCast, //
+ },
+};
+
+/// A pre-allocated and zeroed page table page.
+///
+/// Created during the mapping prepare phase and consumed during the execute phase.
+/// Stored in an [`RBTree`] keyed by the PDE slot address (`install_addr`).
+pub(in crate::mm) struct PreparedPtPage {
+ /// The allocated and zeroed page table page.
+ pub(in crate::mm) alloc: Pin<KBox<AllocatedBlocks>>,
+ /// Page table level -- needed to determine if this PT page is for a dual PDE.
+ pub(in crate::mm) level: PageTableLevel,
+}
+
+/// Page table mapper.
+pub(in crate::mm) struct PtMapInner<M: MmuConfig> {
+ walker: PtWalkInner<M>,
+ pdb_addr: VramAddress,
+ _phantom: PhantomData<M>,
+}
+
+impl<M: MmuConfig> PtMapInner<M> {
+ /// Create a new [`PtMapInner`].
+ pub(super) fn new(pdb_addr: VramAddress) -> Self {
+ Self {
+ walker: PtWalkInner::<M>::new(pdb_addr),
+ pdb_addr,
+ _phantom: PhantomData,
+ }
+ }
+
+ /// Allocate and zero a physical page table page.
+ fn alloc_and_zero_page(
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ level: PageTableLevel,
+ ) -> Result<PreparedPtPage> {
+ let blocks = KBox::pin_init(
+ mm.buddy().alloc_blocks(
+ GpuBuddyAllocMode::Simple,
+ SZ_4K.into_safe_cast(),
+ Alignment::new::<SZ_4K>(),
+ GpuBuddyAllocFlags::default(),
+ ),
+ GFP_KERNEL,
+ )?;
+
+ let page_vram = VramAddress::new(blocks.iter().next().ok_or(ENOMEM)?.offset());
+
+ // Zero via PRAMIN.
+ let mut window = mm.pramin().get_window(dev)?;
+ for off in (0..PAGE_SIZE).step_by(8) {
+ let off_u64: u64 = off.into_safe_cast();
+ window.try_write64(page_vram + off_u64, 0)?;
+ }
+
+ Ok(PreparedPtPage {
+ alloc: blocks,
+ level,
+ })
+ }
+
+ /// Ensure all intermediate page table pages exist for a single VFN.
+ ///
+ /// PRAMIN is released before each allocation and re-acquired after. Memory
+ /// allocations are done outside of holding this lock to prevent deadlocks with
+ /// the fence signalling critical path.
+ fn ensure_single_pte_path(
+ &self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ vfn: Vfn,
+ pt_pages: &mut RBTree<VramAddress, PreparedPtPage>,
+ ) -> Result {
+ let max_iter = 2 * M::PDE_LEVELS.len();
+
+ for _ in 0..max_iter {
+ let mut window = mm.pramin().get_window(dev)?;
+
+ let result = self
+ .walker
+ .walk_pde_levels(&mut window, vfn, |install_addr| {
+ pt_pages
+ .get(&install_addr)
+ .and_then(|p| p.alloc.iter().next().map(|b| VramAddress::new(b.offset())))
+ })?;
+
+ match result {
+ WalkPdeResult::Complete { .. } => {
+ return Ok(());
+ }
+ WalkPdeResult::Missing {
+ install_addr,
+ level,
+ } => {
+ // Drop PRAMIN before allocation.
+ drop(window);
+ let page = Self::alloc_and_zero_page(dev, mm, level)?;
+ let node = RBTreeNode::new(install_addr, page, GFP_KERNEL)?;
+ let old = pt_pages.insert(node);
+ if old.is_some() {
+ kernel::pr_warn_once!(
+ "VMM: duplicate install_addr in pt_pages (internal consistency error)\n"
+ );
+ return Err(EIO);
+ }
+ }
+ }
+ }
+
+ kernel::pr_warn!(
+ "VMM: ensure_pte_path: loop exhausted after {} iters (VFN {:?})\n",
+ max_iter,
+ vfn
+ );
+ Err(EIO)
+ }
+
+ /// Prepare page table resources for mapping `num_pages` pages starting at `vfn_start`.
+ ///
+ /// Reserves capacity in `page_table_allocs`, then walks the hierarchy
+ /// per-VFN to prepare pages for all missing PDEs.
+ pub(super) fn prepare_map(
+ &self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ vfn_start: Vfn,
+ num_pages: usize,
+ page_table_allocs: &mut KVec<Pin<KBox<AllocatedBlocks>>>,
+ pt_pages: &mut RBTree<VramAddress, PreparedPtPage>,
+ ) -> Result {
+ // Pre-reserve so install_mappings() can use push_within_capacity (no alloc
+ // in fence signalling critical path).
+ let pt_upper_bound = M::pt_pages_upper_bound(num_pages);
+ page_table_allocs.reserve(pt_upper_bound, GFP_KERNEL)?;
+
+ // Walk the hierarchy per-VFN to prepare pages for all missing PDEs.
+ for i in 0..num_pages {
+ let i_u64: u64 = i.into_safe_cast();
+ let vfn = Vfn::new(vfn_start.raw() + i_u64);
+ self.ensure_single_pte_path(dev, mm, vfn, pt_pages)?;
+ }
+ Ok(())
+ }
+
+ /// Install prepared PDEs and write PTEs, then flush TLB.
+ ///
+ /// Drains `pt_pages` and moves allocations into `page_table_allocs`.
+ #[expect(clippy::too_many_arguments)]
+ pub(super) fn install_mappings(
+ &self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ pt_pages: &mut RBTree<VramAddress, PreparedPtPage>,
+ page_table_allocs: &mut KVec<Pin<KBox<AllocatedBlocks>>>,
+ vfn_start: Vfn,
+ pfns: &[Pfn],
+ writable: bool,
+ ) -> Result {
+ let mut window = mm.pramin().get_window(dev)?;
+
+ // Drain prepared PT pages, install all pending PDEs.
+ let mut cursor = pt_pages.cursor_front_mut();
+ while let Some(c) = cursor {
+ let (next, node) = c.remove_current();
+ let (install_addr, page) = node.to_key_value();
+ let page_vram = VramAddress::new(page.alloc.iter().next().ok_or(ENOMEM)?.offset());
+
+ if page.level == M::DUAL_PDE_LEVEL {
+ let new_dpde = M::DualPde::new_small(Pfn::from(page_vram));
+ new_dpde.write(&mut window, install_addr)?;
+ } else {
+ let new_pde = M::Pde::new(AperturePde::VideoMemory, Pfn::from(page_vram));
+ new_pde.write(&mut window, install_addr)?;
+ }
+
+ page_table_allocs
+ .push_within_capacity(page.alloc)
+ .map_err(|_| ENOMEM)?;
+
+ cursor = next;
+ }
+
+ // Write PTEs (all PDEs now installed in HW).
+ for (i, &pfn) in pfns.iter().enumerate() {
+ let i_u64: u64 = i.into_safe_cast();
+ let vfn = Vfn::new(vfn_start.raw() + i_u64);
+ let result = self
+ .walker
+ .walk_to_pte_lookup_with_window(&mut window, vfn)?;
+
+ match result {
+ WalkResult::Unmapped { pte_addr } | WalkResult::Mapped { pte_addr, .. } => {
+ let pte = M::Pte::new(AperturePte::VideoMemory, pfn, writable);
+ pte.write(&mut window, pte_addr)?;
+ }
+ WalkResult::PageTableMissing => {
+ kernel::pr_warn_once!("VMM: page table missing for VFN {vfn:?}\n");
+ return Err(EIO);
+ }
+ }
+ }
+
+ drop(window);
+
+ // Flush TLB.
+ mm.tlb().flush(dev, self.pdb_addr)
+ }
+
+ /// Invalidate PTEs for a range and flush TLB.
+ pub(super) fn invalidate_ptes(
+ &self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ vfn_start: Vfn,
+ num_pages: usize,
+ ) -> Result {
+ let invalid_pte = M::Pte::invalid();
+
+ let mut window = mm.pramin().get_window(dev)?;
+ for i in 0..num_pages {
+ let i_u64: u64 = i.into_safe_cast();
+ let vfn = Vfn::new(vfn_start.raw() + i_u64);
+ let result = self
+ .walker
+ .walk_to_pte_lookup_with_window(&mut window, vfn)?;
+
+ match result {
+ WalkResult::Mapped { pte_addr, .. } | WalkResult::Unmapped { pte_addr } => {
+ invalid_pte.write(&mut window, pte_addr)?;
+ }
+ WalkResult::PageTableMissing => {
+ continue;
+ }
+ }
+ }
+ drop(window);
+
+ mm.tlb().flush(dev, self.pdb_addr)
+ }
+}
+
+macro_rules! pt_map_dispatch {
+ ($self:expr, $method:ident ( $($arg:expr),* $(,)? )) => {
+ match $self {
+ PtMap::V2(inner) => inner.$method($($arg),*),
+ PtMap::V3(inner) => inner.$method($($arg),*),
+ }
+ };
+}
+
+/// Page table mapper dispatch.
+pub(in crate::mm) enum PtMap {
+ /// MMU v2 (Turing/Ampere/Ada).
+ V2(PtMapInner<MmuV2>),
+ /// MMU v3 (Hopper+).
+ V3(PtMapInner<MmuV3>),
+}
+
+impl PtMap {
+ /// Create a new page table mapper for the given MMU version.
+ pub(in crate::mm) fn new(pdb_addr: VramAddress, version: MmuVersion) -> Self {
+ match version {
+ MmuVersion::V2 => Self::V2(PtMapInner::<MmuV2>::new(pdb_addr)),
+ MmuVersion::V3 => Self::V3(PtMapInner::<MmuV3>::new(pdb_addr)),
+ }
+ }
+
+ /// Prepare page table resources for a mapping.
+ pub(in crate::mm) fn prepare_map(
+ &self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ vfn_start: Vfn,
+ num_pages: usize,
+ page_table_allocs: &mut KVec<Pin<KBox<AllocatedBlocks>>>,
+ pt_pages: &mut RBTree<VramAddress, PreparedPtPage>,
+ ) -> Result {
+ pt_map_dispatch!(
+ self,
+ prepare_map(dev, mm, vfn_start, num_pages, page_table_allocs, pt_pages)
+ )
+ }
+
+ /// Install prepared PDEs and write PTEs, then flush TLB.
+ #[expect(clippy::too_many_arguments)]
+ pub(in crate::mm) fn install_mappings(
+ &self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ pt_pages: &mut RBTree<VramAddress, PreparedPtPage>,
+ page_table_allocs: &mut KVec<Pin<KBox<AllocatedBlocks>>>,
+ vfn_start: Vfn,
+ pfns: &[Pfn],
+ writable: bool,
+ ) -> Result {
+ pt_map_dispatch!(
+ self,
+ install_mappings(
+ dev,
+ mm,
+ pt_pages,
+ page_table_allocs,
+ vfn_start,
+ pfns,
+ writable
+ )
+ )
+ }
+
+ /// Invalidate PTEs for a range and flush TLB.
+ pub(in crate::mm) fn invalidate_ptes(
+ &self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ vfn_start: Vfn,
+ num_pages: usize,
+ ) -> Result {
+ pt_map_dispatch!(self, invalidate_ptes(dev, mm, vfn_start, num_pages))
+ }
+}
diff --git a/drivers/gpu/nova-core/mm/vmm.rs b/drivers/gpu/nova-core/mm/vmm.rs
index 05ff77c5f888..1cceea759f6a 100644
--- a/drivers/gpu/nova-core/mm/vmm.rs
+++ b/drivers/gpu/nova-core/mm/vmm.rs
@@ -3,22 +3,31 @@
//! Virtual Memory Manager for NVIDIA GPU page table management.
//!
//! The [`Vmm`] provides high-level page mapping and unmapping operations for GPU
-//! virtual address spaces (Channels, BAR1, BAR2). It wraps the page table walker
-//! and handles TLB flushing after modifications.
+//! virtual address spaces (Channels, BAR1, BAR2).
use kernel::{
device,
gpu::buddy::AllocatedBlocks,
maple_tree::MapleTreeAlloc,
- prelude::*, //
+ prelude::*,
+ rbtree::RBTree, //
};
-use core::ops::Range;
+use core::{
+ cell::Cell,
+ ops::Range, //
+};
use crate::{
mm::{
pagetable::{
- walk::{PtWalk, WalkResult},
+ map::{
+ PtMap, //
+ },
+ walk::{
+ PtWalk,
+ WalkResult, //
+ },
MmuVersion, //
},
GpuMm,
@@ -32,22 +41,108 @@
},
};
+/// Multi-page prepared mapping -- VA range allocated, ready for execute.
+///
+/// Produced by [`Vmm::prepare_map()`], consumed by [`Vmm::execute_map()`].
+/// The VA space allocation is tracked in the [`Vmm`]'s maple tree and freed
+/// on error or via [`Vmm::unmap_pages()`].
+///
+/// Dropping without calling [`Vmm::execute_map()`] logs a warning and leaks
+/// the VA range in the maple tree.
+pub(crate) struct PreparedMapping {
+ vfn_start: Vfn,
+ num_pages: usize,
+ /// Logs a warning if dropped without executing.
+ _drop_guard: MustExecuteGuard,
+}
+
+/// Result of a mapping operation -- tracks the active mapped range.
+///
+/// Returned by [`Vmm::execute_map()`] and [`Vmm::map_pages()`].
+/// Callers must call [`Vmm::unmap_pages()`] before dropping to invalidate
+/// PTEs and free the VA range. Dropping without unmapping logs a warning
+/// and leaks the VA range in the maple tree.
+pub(crate) struct MappedRange {
+ pub(super) vfn_start: Vfn,
+ pub(super) num_pages: usize,
+ /// Logs a warning if dropped without unmapping.
+ _drop_guard: MustUnmapGuard,
+}
+
+/// Guard that logs a warning if a [`PreparedMapping`] is dropped without
+/// being consumed by [`Vmm::execute_map()`].
+struct MustExecuteGuard {
+ armed: Cell<bool>,
+}
+
+impl MustExecuteGuard {
+ const fn new() -> Self {
+ Self {
+ armed: Cell::new(true),
+ }
+ }
+
+ fn disarm(&self) {
+ self.armed.set(false);
+ }
+}
+
+impl Drop for MustExecuteGuard {
+ fn drop(&mut self) {
+ if self.armed.get() {
+ kernel::pr_warn!("PreparedMapping dropped without calling execute_map()\n");
+ }
+ }
+}
+
+/// Guard that logs a warning if a [`MappedRange`] is dropped without
+/// calling [`Vmm::unmap_pages()`].
+struct MustUnmapGuard {
+ armed: Cell<bool>,
+}
+
+impl MustUnmapGuard {
+ const fn new() -> Self {
+ Self {
+ armed: Cell::new(true),
+ }
+ }
+
+ fn disarm(&self) {
+ self.armed.set(false);
+ }
+}
+
+impl Drop for MustUnmapGuard {
+ fn drop(&mut self) {
+ if self.armed.get() {
+ kernel::pr_warn!("MappedRange dropped without calling unmap_pages()\n");
+ }
+ }
+}
+
/// Virtual Memory Manager for a GPU address space.
///
/// Each [`Vmm`] instance manages a single address space identified by its Page
-/// Directory Base (`PDB`) address. The [`Vmm`] is used for Channel, BAR1 and
-/// BAR2 mappings.
+/// Directory Base (`PDB`) address. Used for Channel, BAR1 and BAR2 mappings.
pub(crate) struct Vmm {
/// Page Directory Base address for this address space.
pdb_addr: VramAddress,
- /// MMU version used for page table layout.
- mmu_version: MmuVersion,
+ /// Page table walker for reading existing mappings.
+ pt_walk: PtWalk,
+ /// Page table mapper for prepare/execute operations.
+ pt_map: PtMap,
/// Page table allocations required for mappings.
page_table_allocs: KVec<Pin<KBox<AllocatedBlocks>>>,
/// Maple tree allocator for virtual address range tracking.
virt_alloc: Pin<KBox<MapleTreeAlloc<()>>>,
/// Total number of pages in the virtual address space.
va_pages: usize,
+ /// Prepared PT pages pending PDE installation, keyed by `install_addr`.
+ ///
+ /// Populated during prepare phase and drained in execute phase. Shared by all
+ /// pending maps, preventing races on the same PDE slot.
+ pt_pages: RBTree<VramAddress, super::pagetable::map::PreparedPtPage>,
}
impl Vmm {
@@ -65,20 +160,16 @@ pub(crate) fn new(
Ok(Self {
pdb_addr,
- mmu_version,
+ pt_walk: PtWalk::new(pdb_addr, mmu_version),
+ pt_map: PtMap::new(pdb_addr, mmu_version),
page_table_allocs: KVec::new(),
virt_alloc,
va_pages,
+ pt_pages: RBTree::new(),
})
}
/// Allocate a contiguous virtual frame number range.
- ///
- /// # Arguments
- ///
- /// - `num_pages`: Number of pages to allocate.
- /// - `va_range`: `None` = allocate anywhere, `Some(range)` = constrain allocation to the given
- /// range.
fn alloc_vfn_range(&self, num_pages: usize, va_range: Option<Range<u64>>) -> Result<Vfn> {
let page_size: u64 = PAGE_SIZE.into_safe_cast();
@@ -119,11 +210,152 @@ pub(super) fn read_mapping(
mm: &GpuMm,
vfn: Vfn,
) -> Result<Option<Pfn>> {
- let walker = PtWalk::new(self.pdb_addr, self.mmu_version);
-
- match walker.walk_to_pte(dev, mm, vfn)? {
+ match self.pt_walk.walk_to_pte(dev, mm, vfn)? {
WalkResult::Mapped { pfn, .. } => Ok(Some(pfn)),
WalkResult::Unmapped { .. } | WalkResult::PageTableMissing => Ok(None),
}
}
+
+ /// Prepare resources for mapping `num_pages` pages.
+ ///
+ /// Allocates a contiguous VA range, then walks the hierarchy per-VFN to prepare pages
+ /// for all missing PDEs. Returns a [`PreparedMapping`] with the VA allocation.
+ ///
+ /// If `va_range` is not `None`, the VA range is constrained to the given range. Safe
+ /// to call outside the fence signalling critical path.
+ pub(crate) fn prepare_map(
+ &mut self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ num_pages: usize,
+ va_range: Option<Range<u64>>,
+ ) -> Result<PreparedMapping> {
+ if num_pages == 0 {
+ return Err(EINVAL);
+ }
+
+ // Allocate contiguous VA range.
+ let vfn_start = self.alloc_vfn_range(num_pages, va_range)?;
+
+ if let Err(e) = self.pt_map.prepare_map(
+ dev,
+ mm,
+ vfn_start,
+ num_pages,
+ &mut self.page_table_allocs,
+ &mut self.pt_pages,
+ ) {
+ self.free_vfn(vfn_start);
+ return Err(e);
+ }
+
+ Ok(PreparedMapping {
+ vfn_start,
+ num_pages,
+ _drop_guard: MustExecuteGuard::new(),
+ })
+ }
+
+ /// Execute a prepared multi-page mapping.
+ ///
+ /// Installs all prepared PDEs and writes PTEs into the page table, then flushes TLB.
+ pub(crate) fn execute_map(
+ &mut self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ prepared: PreparedMapping,
+ pfns: &[Pfn],
+ writable: bool,
+ ) -> Result<MappedRange> {
+ if pfns.len() != prepared.num_pages {
+ self.free_vfn(prepared.vfn_start);
+ return Err(EINVAL);
+ }
+
+ let PreparedMapping {
+ vfn_start,
+ num_pages,
+ _drop_guard,
+ } = prepared;
+ _drop_guard.disarm();
+
+ if let Err(e) = self.pt_map.install_mappings(
+ dev,
+ mm,
+ &mut self.pt_pages,
+ &mut self.page_table_allocs,
+ vfn_start,
+ pfns,
+ writable,
+ ) {
+ self.free_vfn(vfn_start);
+ return Err(e);
+ }
+
+ Ok(MappedRange {
+ vfn_start,
+ num_pages,
+ _drop_guard: MustUnmapGuard::new(),
+ })
+ }
+
+ /// Map pages doing prepare and execute in the same call.
+ ///
+ /// This is a convenience wrapper for callers outside the fence signalling critical
+ /// path (e.g., BAR mappings). For DRM usecases, [`Vmm::prepare_map()`] and
+ /// [`Vmm::execute_map()`] will be called separately.
+ pub(crate) fn map_pages(
+ &mut self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ pfns: &[Pfn],
+ va_range: Option<Range<u64>>,
+ writable: bool,
+ ) -> Result<MappedRange> {
+ if pfns.is_empty() {
+ return Err(EINVAL);
+ }
+
+ // Check if provided VA range is sufficient (if provided).
+ if let Some(ref range) = va_range {
+ let required: u64 = pfns
+ .len()
+ .checked_mul(PAGE_SIZE)
+ .ok_or(EOVERFLOW)?
+ .into_safe_cast();
+ let available = range.end.checked_sub(range.start).ok_or(EINVAL)?;
+ if available < required {
+ return Err(EINVAL);
+ }
+ }
+
+ let prepared = self.prepare_map(dev, mm, pfns.len(), va_range)?;
+ self.execute_map(dev, mm, prepared, pfns, writable)
+ }
+
+ /// Unmap all pages in a [`MappedRange`] with a single TLB flush.
+ pub(crate) fn unmap_pages(
+ &mut self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ range: MappedRange,
+ ) -> Result {
+ let result = self
+ .pt_map
+ .invalidate_ptes(dev, mm, range.vfn_start, range.num_pages);
+
+ // TODO: Internal page table pages (PDE, PTE pages) are still kept around.
+ // This is by design as repeated maps/unmaps will be fast. As a future TODO,
+ // we can add a reclaimer here to reclaim if VRAM is short. For now, the PT
+ // pages are dropped once the `Vmm` is dropped.
+
+ // Free the VA range regardless of PTE invalidation success, so that the VA
+ // range is recovered even on failure (PTEs may be stale, but that is better
+ // than leaking both PTEs and VA range).
+ self.free_vfn(range.vfn_start);
+
+ // Unmap complete, safe to drop `MappedRange`.
+ range._drop_guard.disarm();
+ result
+ }
}
--
2.34.1
^ permalink raw reply related
* [PATCH v1 15/16] gpu: nova-core: mm: Add BAR1 user interface
From: Joel Fernandes @ 2026-05-18 18:11 UTC (permalink / raw)
To: linux-kernel
Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron, Benno Lossin,
Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich,
Dave Airlie, Daniel Almeida, dri-devel, rust-for-linux, nova-gpu,
Nikola Djukic, David Airlie, Boqun Feng, John Hubbard,
Alistair Popple, Timur Tabi, Edwin Peer, Alexandre Courbot,
Andrea Righi, Andy Ritger, Zhi Wang, Balbir Singh,
Philipp Stanner, alexeyi, Eliot Courtney, joel, linux-doc,
Joel Fernandes
In-Reply-To: <20260518181126.2493572-1-joelagnelf@nvidia.com>
Add the BAR1 user interface for CPU access to GPU virtual memory through
the BAR1 aperture.
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
drivers/gpu/nova-core/driver.rs | 22 ++-
drivers/gpu/nova-core/gpu.rs | 41 +++++-
drivers/gpu/nova-core/gsp/commands.rs | 1 -
drivers/gpu/nova-core/mm.rs | 1 +
drivers/gpu/nova-core/mm/bar_user.rs | 194 ++++++++++++++++++++++++++
5 files changed, 255 insertions(+), 4 deletions(-)
create mode 100644 drivers/gpu/nova-core/mm/bar_user.rs
diff --git a/drivers/gpu/nova-core/driver.rs b/drivers/gpu/nova-core/driver.rs
index b14d4b599783..207ba164cf4e 100644
--- a/drivers/gpu/nova-core/driver.rs
+++ b/drivers/gpu/nova-core/driver.rs
@@ -2,10 +2,12 @@
use kernel::{
auxiliary,
+ device::Bound,
device::Core,
devres::Devres,
dma::Device,
dma::DmaMask,
+ io::resource,
pci,
pci::{
Class,
@@ -47,9 +49,27 @@ pub(crate) struct NovaCore {
const GPU_DMA_BITS: u32 = 47;
pub(crate) type Bar0 = pci::Bar<BAR0_SIZE>;
-#[expect(dead_code)]
pub(crate) type Bar1 = pci::Bar;
+/// Returns the Linux PCI resource index that holds BAR1 for an NVIDIA GPU.
+///
+/// On Maxwell through Ada, BAR0 is a 32-bit memory BAR occupying a single
+/// Linux PCI resource slot, so BAR1 lives at index 1. Starting with Blackwell
+/// (and on some Ampere GA100 / Hopper SKUs) BAR0 is a 64-bit memory BAR that
+/// consumes two consecutive resource slots: index 0 holds the low 32 bits and
+/// index 1 holds the high 32 bits (with no `flags` / or size of its own),
+/// shifting BAR1 to index 2.
+pub(crate) fn bar1_resource_index(pdev: &pci::Device<Bound>) -> Result<u32> {
+ // Probe the `IORESOURCE_MEM_64` flag of BAR0 as a robust way of exposing
+ // if BAR0 and hence BAR1 is 64-bit.
+ let flags0 = pdev.resource_flags(0)?;
+ if flags0.contains(resource::Flags::IORESOURCE_MEM_64) {
+ Ok(2)
+ } else {
+ Ok(1)
+ }
+}
+
kernel::pci_device_table!(
PCI_TABLE,
MODULE_PCI_TABLE,
diff --git a/drivers/gpu/nova-core/gpu.rs b/drivers/gpu/nova-core/gpu.rs
index f789d956cc49..b0eebe6406e5 100644
--- a/drivers/gpu/nova-core/gpu.rs
+++ b/drivers/gpu/nova-core/gpu.rs
@@ -19,7 +19,10 @@
use crate::{
bounded_enum,
- driver::Bar0,
+ driver::{
+ Bar0,
+ Bar1, //
+ },
falcon::{
gsp::Gsp as GspFalcon,
sec2::Sec2 as Sec2Falcon,
@@ -31,8 +34,11 @@
Gsp, //
},
mm::{
+ bar_user::BarUser,
+ pagetable::MmuVersion,
GpuMm,
- IntoVramRange, //
+ IntoVramRange,
+ VramAddress, //
},
regs,
};
@@ -145,6 +151,11 @@ pub(crate) const fn arch(self) -> Architecture {
pub(crate) const fn needs_fwsec_bootloader(self) -> bool {
matches!(self.arch(), Architecture::Turing) || matches!(self, Self::GA100)
}
+
+ /// Returns the MMU version for this chipset.
+ pub(crate) fn mmu_version(self) -> MmuVersion {
+ MmuVersion::from(self.arch())
+ }
}
// TODO
@@ -263,6 +274,8 @@ pub(crate) struct Gpu {
spec: Spec,
/// MMIO mapping of PCI BAR 0
bar: Arc<Devres<Bar0>>,
+ /// MMIO mapping of PCI BAR 1.
+ bar1: Arc<Devres<Bar1>>,
/// System memory page required for flushing all pending GPU-side memory writes done through
/// PCIE into system memory, via sysmembar (A GPU-initiated HW memory-barrier operation).
sysmem_flush: SysmemFlush,
@@ -276,6 +289,8 @@ pub(crate) struct Gpu {
#[pin]
gsp: Gsp,
gsp_static_info: GetGspStaticInfoReply,
+ /// BAR1 user interface for CPU access to GPU virtual memory.
+ bar_user: Arc<BarUser>,
}
impl Gpu {
@@ -348,6 +363,28 @@ pub(crate) fn new<'a>(
)?
},
+ bar1: {
+ let bar1_idx = crate::driver::bar1_resource_index(pdev)?;
+ Arc::pin_init(pdev.iomap_region(bar1_idx, c"nova-core/bar1"), GFP_KERNEL)?
+ },
+
+ // Create BAR1 user interface for CPU access to GPU virtual memory.
+ bar_user: {
+ let pdb_addr = VramAddress::new(gsp_static_info.bar1_pde_base);
+ let bar1_idx = crate::driver::bar1_resource_index(pdev)?;
+ let bar1_size = pdev.resource_len(bar1_idx)?;
+ Arc::pin_init(
+ BarUser::new(
+ pdb_addr,
+ spec.chipset,
+ bar1_size,
+ mm.clone(),
+ bar1.clone(),
+ )?,
+ GFP_KERNEL,
+ )?
+ },
+
bar: devres_bar,
})
}
diff --git a/drivers/gpu/nova-core/gsp/commands.rs b/drivers/gpu/nova-core/gsp/commands.rs
index bee7539eff60..301c95686efd 100644
--- a/drivers/gpu/nova-core/gsp/commands.rs
+++ b/drivers/gpu/nova-core/gsp/commands.rs
@@ -194,7 +194,6 @@ fn init(&self) -> impl Init<Self::Command, Self::InitError> {
pub(crate) struct GetGspStaticInfoReply {
gpu_name: [u8; 64],
/// BAR1 Page Directory Entry base address.
- #[expect(dead_code)]
pub(crate) bar1_pde_base: u64,
/// Usable FB (VRAM) region for driver memory allocation.
pub(crate) usable_fb_region: Range<u64>,
diff --git a/drivers/gpu/nova-core/mm.rs b/drivers/gpu/nova-core/mm.rs
index 502c7fdceba2..4741ef60593b 100644
--- a/drivers/gpu/nova-core/mm.rs
+++ b/drivers/gpu/nova-core/mm.rs
@@ -31,6 +31,7 @@ macro_rules! impl_pfn_bounded {
};
}
+pub(crate) mod bar_user;
pub(super) mod pagetable;
pub(crate) mod pramin;
pub(super) mod tlb;
diff --git a/drivers/gpu/nova-core/mm/bar_user.rs b/drivers/gpu/nova-core/mm/bar_user.rs
new file mode 100644
index 000000000000..bb9742c036b7
--- /dev/null
+++ b/drivers/gpu/nova-core/mm/bar_user.rs
@@ -0,0 +1,194 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! BAR1 user interface for CPU access to GPU virtual memory. Used for USERD
+//! for GPU work submission, and applications to access GPU buffers via mmap().
+
+use kernel::{
+ device,
+ devres::Devres,
+ io::Io,
+ new_mutex,
+ prelude::*,
+ sync::{
+ Arc,
+ Mutex, //
+ },
+};
+
+use crate::{
+ driver::Bar1,
+ gpu::Chipset,
+ mm::{
+ vmm::{
+ MappedRange,
+ Vmm, //
+ },
+ GpuMm,
+ Pfn,
+ Vfn,
+ VirtualAddress,
+ VramAddress,
+ PAGE_SIZE, //
+ },
+ num::IntoSafeCast,
+};
+
+/// BAR1 user interface for virtual memory mappings.
+///
+/// Owns the [`Vmm`] for the BAR1 address space.
+#[pin_data]
+pub(crate) struct BarUser {
+ #[pin]
+ vmm: Mutex<Vmm>,
+ mm: Arc<GpuMm>,
+ bar1: Arc<Devres<Bar1>>,
+}
+
+impl BarUser {
+ /// Create a pin-initializer for [`BarUser`].
+ pub(crate) fn new(
+ pdb_addr: VramAddress,
+ chipset: Chipset,
+ va_size: u64,
+ mm: Arc<GpuMm>,
+ bar1: Arc<Devres<Bar1>>,
+ ) -> Result<impl PinInit<Self>> {
+ let vmm = Vmm::new(pdb_addr, chipset.mmu_version(), va_size)?;
+ Ok(pin_init!(Self {
+ vmm <- new_mutex!(vmm, "bar_user_vmm"),
+ mm,
+ bar1,
+ }))
+ }
+
+ /// Map physical pages to a contiguous BAR1 virtual range.
+ pub(crate) fn map(
+ self: &Arc<Self>,
+ dev: &device::Device<device::Bound>,
+ pfns: &[Pfn],
+ writable: bool,
+ ) -> Result<BarUserAccess> {
+ if pfns.is_empty() {
+ return Err(EINVAL);
+ }
+ let mut vmm = self.vmm.lock();
+ let mapped = vmm.map_pages(dev, &self.mm, pfns, None, writable)?;
+
+ Ok(BarUserAccess {
+ bar_user: self.clone(),
+ mapped: Some(mapped),
+ })
+ }
+}
+
+/// Access object for a mapped BAR1 region.
+pub(crate) struct BarUserAccess {
+ bar_user: Arc<BarUser>,
+ /// [`BarUserAccess::release`] [`Option::take`]s this; `Some` at
+ /// drop time means `release()` was never called.
+ mapped: Option<MappedRange>,
+}
+
+impl BarUserAccess {
+ /// Tear down the BAR1 mapping using a caller-supplied bound device.
+ pub(crate) fn release(mut self, dev: &device::Device<device::Bound>) -> Result {
+ let mapped = self.mapped.take().ok_or(EINVAL)?;
+ let mut vmm = self.bar_user.vmm.lock();
+ vmm.unmap_pages(dev, &self.bar_user.mm, mapped)?;
+ Ok(())
+ }
+
+ /// Returns the active mapping.
+ fn mapped(&self) -> &MappedRange {
+ // `mapped` is only `None` after `take()` in `release`; hence unwrap()
+ // cannot panic here.
+ self.mapped.as_ref().unwrap()
+ }
+
+ /// Get the base virtual address of this mapping.
+ pub(crate) fn base(&self) -> VirtualAddress {
+ VirtualAddress::from(self.mapped().vfn_start)
+ }
+
+ /// Get the total size of the mapped region in bytes.
+ pub(crate) fn size(&self) -> usize {
+ self.mapped().num_pages * PAGE_SIZE
+ }
+
+ /// Get the starting virtual frame number.
+ pub(crate) fn vfn_start(&self) -> Vfn {
+ self.mapped().vfn_start
+ }
+
+ /// Get the number of pages in this mapping.
+ pub(crate) fn num_pages(&self) -> usize {
+ self.mapped().num_pages
+ }
+
+ /// Translate an offset within this mapping to a BAR1 aperture offset.
+ fn bar_offset(&self, offset: usize) -> Result<usize> {
+ if offset >= self.size() {
+ return Err(EINVAL);
+ }
+
+ let base_vfn: usize = self.mapped().vfn_start.raw().into_safe_cast();
+ let base = base_vfn.checked_mul(PAGE_SIZE).ok_or(EOVERFLOW)?;
+ base.checked_add(offset).ok_or(EOVERFLOW)
+ }
+
+ // Fallible accessors with runtime bounds checking.
+
+ /// Read a 32-bit value at the given offset.
+ pub(crate) fn try_read32(
+ &self,
+ dev: &device::Device<device::Bound>,
+ offset: usize,
+ ) -> Result<u32> {
+ let off = self.bar_offset(offset)?;
+ self.bar_user.bar1.access(dev)?.try_read32(off)
+ }
+
+ /// Write a 32-bit value at the given offset.
+ pub(crate) fn try_write32(
+ &self,
+ dev: &device::Device<device::Bound>,
+ value: u32,
+ offset: usize,
+ ) -> Result {
+ let off = self.bar_offset(offset)?;
+ self.bar_user.bar1.access(dev)?.try_write32(value, off)
+ }
+
+ /// Read a 64-bit value at the given offset.
+ pub(crate) fn try_read64(
+ &self,
+ dev: &device::Device<device::Bound>,
+ offset: usize,
+ ) -> Result<u64> {
+ let off = self.bar_offset(offset)?;
+ self.bar_user.bar1.access(dev)?.try_read64(off)
+ }
+
+ /// Write a 64-bit value at the given offset.
+ pub(crate) fn try_write64(
+ &self,
+ dev: &device::Device<device::Bound>,
+ value: u64,
+ offset: usize,
+ ) -> Result {
+ let off = self.bar_offset(offset)?;
+ self.bar_user.bar1.access(dev)?.try_write64(value, off)
+ }
+}
+
+impl Drop for BarUserAccess {
+ fn drop(&mut self) {
+ if self.mapped.is_some() {
+ kernel::pr_warn!(
+ "BarUserAccess dropped without calling release(). BarUser address space will leak.\n"
+ );
+ }
+ // The inner `MappedRange`'s own `MustUnmapGuard` will also fire,
+ // identifying the leaked VA range.
+ }
+}
--
2.34.1
^ permalink raw reply related
* [PATCH v1 09/16] gpu: nova-core: mm: pagetable: Add MmuConfig trait
From: Joel Fernandes @ 2026-05-18 18:11 UTC (permalink / raw)
To: linux-kernel
Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron, Benno Lossin,
Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich,
Dave Airlie, Daniel Almeida, dri-devel, rust-for-linux, nova-gpu,
Nikola Djukic, David Airlie, Boqun Feng, John Hubbard,
Alistair Popple, Timur Tabi, Edwin Peer, Alexandre Courbot,
Andrea Righi, Andy Ritger, Zhi Wang, Balbir Singh,
Philipp Stanner, alexeyi, Eliot Courtney, joel, linux-doc,
Joel Fernandes
In-Reply-To: <20260518181126.2493572-1-joelagnelf@nvidia.com>
Introduce `MmuConfig`, the trait that ties the entry-operation traits
(`PteOps`, `PdeOps`, `DualPdeOps`) together with the version-specific
constants and helpers.
`MmuV2` and `MmuV3` are zero-sized marker structs that implement
`MmuConfig` for Turing/Ampere/Ada and Hopper/Blackwell respectively.
Dispatch is fully resolved at compile time through these markers, so
version-specific code is selected without runtime overhead and without
wrapper enums.
This enables version-agnostic page-table operations while keeping
version-specific implementation details encapsulated in the `ver2` and
`ver3` modules.
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
drivers/gpu/nova-core/mm/pagetable.rs | 109 ++++++++++++++++++++++++++
1 file changed, 109 insertions(+)
diff --git a/drivers/gpu/nova-core/mm/pagetable.rs b/drivers/gpu/nova-core/mm/pagetable.rs
index 3cc546f94fdb..38f4f0c6e8ce 100644
--- a/drivers/gpu/nova-core/mm/pagetable.rs
+++ b/drivers/gpu/nova-core/mm/pagetable.rs
@@ -19,6 +19,7 @@
use crate::mm::{
pramin,
Pfn,
+ VirtualAddress,
VramAddress, //
};
@@ -196,6 +197,114 @@ fn write(&self, window: &mut pramin::PraminWindow<'_>, addr: VramAddress) -> Res
}
}
+/// MMU configuration trait -- encodes version-specific constants and types.
+pub(super) trait MmuConfig: 'static {
+ /// Page Table Entry type.
+ type Pte: PteOps;
+ /// Page Directory Entry type.
+ type Pde: PdeOps;
+ /// Dual Page Directory Entry type (128-bit).
+ type DualPde: DualPdeOps;
+
+ /// PDE levels (excluding PTE level) for page table walking.
+ const PDE_LEVELS: &'static [PageTableLevel];
+ /// PTE level for this MMU version.
+ const PTE_LEVEL: PageTableLevel;
+ /// Dual PDE level (128-bit entries) for this MMU version.
+ const DUAL_PDE_LEVEL: PageTableLevel;
+
+ /// Get the number of entries per page table page for a given level.
+ fn entries_per_page(level: PageTableLevel) -> usize;
+
+ /// Extract the page table index at `level` from `va`.
+ fn level_index(va: VirtualAddress, level: u64) -> u64;
+
+ /// Get the entry size in bytes for a given level.
+ fn entry_size(level: PageTableLevel) -> usize {
+ if level == Self::DUAL_PDE_LEVEL {
+ 16 // 128-bit dual PDE
+ } else {
+ 8 // 64-bit PDE/PTE
+ }
+ }
+
+ /// Compute upper bound on page table pages needed for `num_virt_pages`.
+ ///
+ /// Walks from PTE level up through PDE levels, accumulating the tree.
+ fn pt_pages_upper_bound(num_virt_pages: usize) -> usize {
+ let mut total = 0;
+
+ // PTE pages at the leaf level.
+ let pte_epp = Self::entries_per_page(Self::PTE_LEVEL);
+ let mut pages_at_level = num_virt_pages.div_ceil(pte_epp);
+ total += pages_at_level;
+
+ // Walk PDE levels bottom-up (reverse of PDE_LEVELS).
+ for &level in Self::PDE_LEVELS.iter().rev() {
+ let epp = Self::entries_per_page(level);
+
+ // How many pages at this level do we need to point to
+ // the previous pages_at_level?
+ pages_at_level = pages_at_level.div_ceil(epp);
+ total += pages_at_level;
+ }
+
+ total
+ }
+}
+
+/// Marker struct for MMU v2 (Turing/Ampere/Ada).
+pub(super) struct MmuV2;
+
+impl MmuConfig for MmuV2 {
+ type Pte = ver2::Pte;
+ type Pde = ver2::Pde;
+ type DualPde = ver2::DualPde;
+
+ const PDE_LEVELS: &'static [PageTableLevel] = ver2::PDE_LEVELS;
+ const PTE_LEVEL: PageTableLevel = ver2::PTE_LEVEL;
+ const DUAL_PDE_LEVEL: PageTableLevel = ver2::DUAL_PDE_LEVEL;
+
+ fn entries_per_page(level: PageTableLevel) -> usize {
+ // TODO: Calculate these values from the bitfield dynamically
+ // instead of hardcoding them.
+ match level {
+ PageTableLevel::Pdb => 4, // PD3 root: bits [48:47] = 2 bits
+ PageTableLevel::L3 => 256, // PD0 dual: bits [28:21] = 8 bits
+ _ => 512, // PD2, PD1, PT: 9 bits each
+ }
+ }
+
+ fn level_index(va: VirtualAddress, level: u64) -> u64 {
+ ver2::VirtualAddressV2::new(va).level_index(level)
+ }
+}
+
+/// Marker struct for MMU v3 (Hopper and later).
+pub(super) struct MmuV3;
+
+impl MmuConfig for MmuV3 {
+ type Pte = ver3::Pte;
+ type Pde = ver3::Pde;
+ type DualPde = ver3::DualPde;
+
+ const PDE_LEVELS: &'static [PageTableLevel] = ver3::PDE_LEVELS;
+ const PTE_LEVEL: PageTableLevel = ver3::PTE_LEVEL;
+ const DUAL_PDE_LEVEL: PageTableLevel = ver3::DUAL_PDE_LEVEL;
+
+ fn entries_per_page(level: PageTableLevel) -> usize {
+ match level {
+ PageTableLevel::Pdb => 2, // PDE4 root: bit [56] = 1 bit, 2 entries
+ PageTableLevel::L4 => 256, // PDE0 dual: bits [28:21] = 8 bits
+ _ => 512, // PDE3, PDE2, PDE1, PT: 9 bits each
+ }
+ }
+
+ fn level_index(va: VirtualAddress, level: u64) -> u64 {
+ ver3::VirtualAddressV3::new(va).level_index(level)
+ }
+}
+
/// Memory aperture for Page Table Entries (`PTE`s).
///
/// Determines which memory region the `PTE` points to.
--
2.34.1
^ permalink raw reply related
* [PATCH v1 14/16] gpu: nova-core: Add BAR1 aperture type and size constant
From: Joel Fernandes @ 2026-05-18 18:11 UTC (permalink / raw)
To: linux-kernel
Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron, Benno Lossin,
Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich,
Dave Airlie, Daniel Almeida, dri-devel, rust-for-linux, nova-gpu,
Nikola Djukic, David Airlie, Boqun Feng, John Hubbard,
Alistair Popple, Timur Tabi, Edwin Peer, Alexandre Courbot,
Andrea Righi, Andy Ritger, Zhi Wang, Balbir Singh,
Philipp Stanner, alexeyi, Eliot Courtney, joel, linux-doc,
Joel Fernandes
In-Reply-To: <20260518181126.2493572-1-joelagnelf@nvidia.com>
Add BAR1_SIZE constant and Bar1 type alias for the 256MB BAR1 aperture.
These are prerequisites for BAR1 memory access functionality.
Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
drivers/gpu/nova-core/driver.rs | 2 ++
drivers/gpu/nova-core/gsp/commands.rs | 4 ++++
drivers/gpu/nova-core/gsp/fw/commands.rs | 8 ++++++++
3 files changed, 14 insertions(+)
diff --git a/drivers/gpu/nova-core/driver.rs b/drivers/gpu/nova-core/driver.rs
index 77746d6949d7..b14d4b599783 100644
--- a/drivers/gpu/nova-core/driver.rs
+++ b/drivers/gpu/nova-core/driver.rs
@@ -47,6 +47,8 @@ pub(crate) struct NovaCore {
const GPU_DMA_BITS: u32 = 47;
pub(crate) type Bar0 = pci::Bar<BAR0_SIZE>;
+#[expect(dead_code)]
+pub(crate) type Bar1 = pci::Bar;
kernel::pci_device_table!(
PCI_TABLE,
diff --git a/drivers/gpu/nova-core/gsp/commands.rs b/drivers/gpu/nova-core/gsp/commands.rs
index 5abd7950320b..bee7539eff60 100644
--- a/drivers/gpu/nova-core/gsp/commands.rs
+++ b/drivers/gpu/nova-core/gsp/commands.rs
@@ -193,6 +193,9 @@ fn init(&self) -> impl Init<Self::Command, Self::InitError> {
/// The reply from the GSP to the [`GetGspStaticInfo`] command.
pub(crate) struct GetGspStaticInfoReply {
gpu_name: [u8; 64],
+ /// BAR1 Page Directory Entry base address.
+ #[expect(dead_code)]
+ pub(crate) bar1_pde_base: u64,
/// Usable FB (VRAM) region for driver memory allocation.
pub(crate) usable_fb_region: Range<u64>,
/// End of VRAM.
@@ -212,6 +215,7 @@ fn read(
Ok(GetGspStaticInfoReply {
gpu_name: msg.gpu_name_str(),
+ bar1_pde_base: msg.bar1_pde_base(),
usable_fb_region: msg.usable_fb_regions_iter().next().ok_or(ENODEV)?,
total_fb_end,
})
diff --git a/drivers/gpu/nova-core/gsp/fw/commands.rs b/drivers/gpu/nova-core/gsp/fw/commands.rs
index ea663079d95c..13418b494a73 100644
--- a/drivers/gpu/nova-core/gsp/fw/commands.rs
+++ b/drivers/gpu/nova-core/gsp/fw/commands.rs
@@ -127,6 +127,14 @@ impl GspStaticConfigInfo {
self.0.gpuNameString
}
+ /// Returns the BAR1 Page Directory Entry base address.
+ ///
+ /// This is the root page table address for BAR1 virtual memory,
+ /// set up by GSP-RM firmware.
+ pub(crate) fn bar1_pde_base(&self) -> u64 {
+ self.0.bar1PdeBase
+ }
+
/// Returns an iterator over valid FB regions from GSP firmware data.
fn fb_regions(
&self,
--
2.34.1
^ permalink raw reply related
* [PATCH v1 08/16] gpu: nova-core: mm: Add MMU v3 page table types
From: Joel Fernandes @ 2026-05-18 18:11 UTC (permalink / raw)
To: linux-kernel
Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron, Benno Lossin,
Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich,
Dave Airlie, Daniel Almeida, dri-devel, rust-for-linux, nova-gpu,
Nikola Djukic, David Airlie, Boqun Feng, John Hubbard,
Alistair Popple, Timur Tabi, Edwin Peer, Alexandre Courbot,
Andrea Righi, Andy Ritger, Zhi Wang, Balbir Singh,
Philipp Stanner, alexeyi, Eliot Courtney, joel, linux-doc,
Joel Fernandes
In-Reply-To: <20260518181126.2493572-1-joelagnelf@nvidia.com>
Add page table entry and directory structures for MMU version 3 used by
Hopper and later GPUs. The `Pte`, `Pde`, and `DualPde` types each
implement the `PteOps`, `PdeOps`, and `DualPdeOps` traits introduced
earlier in the series, providing the version-agnostic API used by the
forthcoming page-table walker and mapper.
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
drivers/gpu/nova-core/mm/pagetable.rs | 1 +
drivers/gpu/nova-core/mm/pagetable/ver3.rs | 421 +++++++++++++++++++++
2 files changed, 422 insertions(+)
create mode 100644 drivers/gpu/nova-core/mm/pagetable/ver3.rs
diff --git a/drivers/gpu/nova-core/mm/pagetable.rs b/drivers/gpu/nova-core/mm/pagetable.rs
index df041fc89390..3cc546f94fdb 100644
--- a/drivers/gpu/nova-core/mm/pagetable.rs
+++ b/drivers/gpu/nova-core/mm/pagetable.rs
@@ -9,6 +9,7 @@
#![expect(dead_code)]
pub(super) mod ver2;
+pub(super) mod ver3;
use kernel::prelude::*;
diff --git a/drivers/gpu/nova-core/mm/pagetable/ver3.rs b/drivers/gpu/nova-core/mm/pagetable/ver3.rs
new file mode 100644
index 000000000000..805be90df45d
--- /dev/null
+++ b/drivers/gpu/nova-core/mm/pagetable/ver3.rs
@@ -0,0 +1,421 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! MMU v3 page table types for Hopper and later GPUs.
+//!
+//! This module defines MMU version 3 specific types (Hopper and later GPUs).
+//!
+//! Key differences from MMU v2:
+//! - Unified 40-bit address field for all apertures (v2 had separate sys/vid fields).
+//! - PCF (Page Classification Field) replaces separate privilege/RO/atomic/cache bits.
+//! - KIND field is 4 bits (not 8).
+//! - IS_PTE bit in PDE to support large pages directly.
+//! - No COMPTAGLINE field (compression handled differently in v3).
+//! - No separate ENCRYPTED bit.
+//!
+//! Bit field layouts derived from the NVIDIA OpenRM documentation:
+//! `open-gpu-kernel-modules/src/common/inc/swref/published/hopper/gh100/dev_mmu.h`
+
+#![allow(dead_code)]
+
+use kernel::bitfield;
+use kernel::num::Bounded;
+use kernel::prelude::*;
+use pin_init::Zeroable;
+
+use super::{
+ AperturePde,
+ AperturePte,
+ DualPdeOps,
+ PageTableLevel,
+ PdeOps,
+ PteOps,
+ VaLevelIndex, //
+};
+use crate::mm::{
+ Pfn,
+ VirtualAddress,
+ VramAddress, //
+};
+
+// Bounded to version 3 Pfn conversion.
+impl_pfn_bounded!(40);
+
+bitfield! {
+ /// MMU v3 57-bit virtual address layout.
+ pub(super) struct VirtualAddressV3(u64) {
+ /// Page offset [11:0].
+ 11:0 offset;
+ /// PT index [20:12].
+ 20:12 pt_idx;
+ /// PDE0 index [28:21].
+ 28:21 pde0_idx;
+ /// PDE1 index [37:29].
+ 37:29 pde1_idx;
+ /// PDE2 index [46:38].
+ 46:38 pde2_idx;
+ /// PDE3 index [55:47].
+ 55:47 pde3_idx;
+ /// PDE4 index [56].
+ 56:56 pde4_idx;
+ }
+}
+
+impl VirtualAddressV3 {
+ /// Create a [`VirtualAddressV3`] from a [`VirtualAddress`].
+ pub(super) fn new(va: VirtualAddress) -> Self {
+ Self::from_raw(va.into_raw())
+ }
+}
+
+impl VaLevelIndex for VirtualAddressV3 {
+ fn level_index(&self, level: u64) -> u64 {
+ match level {
+ 0 => *self.pde4_idx(),
+ 1 => *self.pde3_idx(),
+ 2 => *self.pde2_idx(),
+ 3 => *self.pde1_idx(),
+ 4 => *self.pde0_idx(),
+ 5 => *self.pt_idx(),
+ _ => 0,
+ }
+ }
+}
+
+/// PDE levels for MMU v3 (6-level hierarchy).
+pub(super) const PDE_LEVELS: &[PageTableLevel] = &[
+ PageTableLevel::Pdb,
+ PageTableLevel::L1,
+ PageTableLevel::L2,
+ PageTableLevel::L3,
+ PageTableLevel::L4,
+];
+
+/// PTE level for MMU v3.
+pub(super) const PTE_LEVEL: PageTableLevel = PageTableLevel::L5;
+
+/// Dual PDE level for MMU v3 (128-bit entries).
+pub(super) const DUAL_PDE_LEVEL: PageTableLevel = PageTableLevel::L4;
+
+bitfield! {
+ /// Page Classification Field for PTEs (5 bits) in MMU v3.
+ pub(in crate::mm) struct PtePcf(u8) {
+ /// Bypass L2 cache (0=cached, 1=bypass).
+ 0:0 uncached;
+ /// Access counting disabled (0=enabled, 1=disabled).
+ 1:1 acd;
+ /// Read-only access (0=read-write, 1=read-only).
+ 2:2 read_only;
+ /// Atomics disabled (0=enabled, 1=disabled).
+ 3:3 no_atomic;
+ /// Privileged access only (0=regular, 1=privileged).
+ 4:4 privileged;
+ }
+}
+
+impl PtePcf {
+ /// Create PCF for read-write mapping (cached, no atomics, regular mode).
+ fn rw() -> Self {
+ Self::zeroed().with_no_atomic(true)
+ }
+
+ /// Create PCF for read-only mapping (cached, no atomics, regular mode).
+ fn ro() -> Self {
+ Self::zeroed().with_read_only(true).with_no_atomic(true)
+ }
+
+ /// Get the raw `u8` value.
+ fn raw_u8(&self) -> u8 {
+ self.into_raw()
+ }
+}
+
+impl From<Bounded<u64, 5>> for PtePcf {
+ fn from(val: Bounded<u64, 5>) -> Self {
+ Self::from_raw(u8::from(val))
+ }
+}
+
+impl From<PtePcf> for Bounded<u64, 5> {
+ fn from(pcf: PtePcf) -> Self {
+ Bounded::from_expr(u64::from(pcf.into_raw()) & 0x1F)
+ }
+}
+
+bitfield! {
+ /// Page Classification Field for PDEs (3 bits) in MMU v3.
+ ///
+ /// Controls Address Translation Services (ATS) and caching.
+ pub(in crate::mm) struct PdePcf(u8) {
+ /// Bypass L2 cache (0=cached, 1=bypass).
+ 0:0 uncached;
+ /// ATS disabled (0=enabled, 1=disabled).
+ 1:1 no_ats;
+ }
+}
+
+impl PdePcf {
+ /// Create PCF for cached mapping with ATS enabled (default).
+ fn cached() -> Self {
+ Self::zeroed()
+ }
+
+ /// Get the raw `u8` value.
+ fn raw_u8(&self) -> u8 {
+ self.into_raw()
+ }
+}
+
+impl From<Bounded<u64, 3>> for PdePcf {
+ fn from(val: Bounded<u64, 3>) -> Self {
+ Self::from_raw(u8::from(val))
+ }
+}
+
+impl From<PdePcf> for Bounded<u64, 3> {
+ fn from(pcf: PdePcf) -> Self {
+ Bounded::from_expr(u64::from(pcf.into_raw()) & 0x7)
+ }
+}
+
+bitfield! {
+ /// Page Table Entry for MMU v3.
+ pub(in crate::mm) struct Pte(u64) {
+ /// Entry is valid.
+ 0:0 valid;
+ /// Memory aperture type.
+ 2:1 aperture => AperturePte;
+ /// Page Classification Field.
+ 7:3 pcf => PtePcf;
+ /// Surface kind (4 bits, 0x0=pitch, 0xF=invalid).
+ 11:8 kind;
+ /// Physical frame number (for all apertures).
+ 51:12 frame_number => Pfn;
+ /// Peer GPU ID for peer memory (0-7).
+ 63:61 peer_id;
+ }
+}
+
+impl PteOps for Pte {
+ fn from_raw(val: u64) -> Self {
+ Self::from_raw(val)
+ }
+
+ fn invalid() -> Self {
+ Self::zeroed()
+ }
+
+ fn new(aperture: AperturePte, pfn: Pfn, writable: bool) -> Self {
+ let pcf = match (aperture, writable) {
+ (AperturePte::VideoMemory, true) => PtePcf::rw(),
+ (AperturePte::VideoMemory, false) => PtePcf::ro(),
+ // Sysmem PTEs use uncached+no_atomic PCF for cache coherency.
+ (AperturePte::SystemCoherent, true) => PtePcf::zeroed()
+ .with_uncached(true)
+ .with_no_atomic(true),
+ (AperturePte::SystemCoherent, false) => PtePcf::zeroed()
+ .with_uncached(true)
+ .with_no_atomic(true)
+ .with_read_only(true),
+ (AperturePte::PeerMemory | AperturePte::SystemNonCoherent, _) => {
+ kernel::pr_warn!("MMU v3 PTE aperture {:?} not supported\n", aperture);
+ return Self::invalid();
+ }
+ };
+ Self::zeroed()
+ .with_valid(true)
+ .with_aperture(aperture)
+ .with_pcf(pcf)
+ .with_frame_number(pfn)
+ }
+
+ fn is_valid(&self) -> bool {
+ self.valid().into_bool()
+ }
+
+ fn frame_number(&self) -> Pfn {
+ Pte::frame_number(*self)
+ }
+}
+
+bitfield! {
+ /// Page Directory Entry for MMU v3 (Hopper+).
+ ///
+ /// ## Note
+ ///
+ /// v3 uses a unified 40-bit address field (v2 had separate sys/vid address fields).
+ pub(in crate::mm) struct Pde(u64) {
+ /// Entry is a PTE (0=PDE, 1=large page PTE).
+ 0:0 is_pte;
+ /// Memory aperture type.
+ 2:1 aperture => AperturePde;
+ /// Page Classification Field (3 bits for PDE).
+ 5:3 pcf => PdePcf;
+ /// Table frame number (40-bit unified address).
+ 51:12 table_frame => Pfn;
+ }
+}
+
+impl PdeOps for Pde {
+ fn from_raw(val: u64) -> Self {
+ Self::from_raw(val)
+ }
+
+ fn new(aperture: AperturePde, table_pfn: Pfn) -> Self {
+ match aperture {
+ AperturePde::VideoMemory => Self::zeroed()
+ .with_is_pte(false)
+ .with_aperture(aperture)
+ .with_table_frame(table_pfn),
+ AperturePde::Invalid
+ | AperturePde::SystemCoherent
+ | AperturePde::SystemNonCoherent => {
+ kernel::pr_warn!("MMU v3 PDE aperture {:?} not supported\n", aperture);
+ Self::invalid()
+ }
+ }
+ }
+
+ fn invalid() -> Self {
+ Self::zeroed().with_aperture(AperturePde::Invalid)
+ }
+
+ fn is_valid(&self) -> bool {
+ Pde::aperture(*self) != AperturePde::Invalid
+ }
+
+ fn aperture(&self) -> AperturePde {
+ Pde::aperture(*self)
+ }
+
+ fn table_vram_address(&self) -> VramAddress {
+ debug_assert!(
+ Pde::aperture(*self) == AperturePde::VideoMemory,
+ "table_vram_address called on non-VRAM PDE (aperture: {:?})",
+ Pde::aperture(*self)
+ );
+ VramAddress::from(self.table_frame())
+ }
+}
+
+bitfield! {
+ /// Big Page Table pointer in Dual PDE (MMU v3).
+ ///
+ /// 64-bit lower word of the 128-bit Dual PDE.
+ pub(super) struct DualPdeBig(u64) {
+ /// Entry is a PTE (for large pages).
+ 0:0 is_pte;
+ /// Memory aperture type.
+ 2:1 aperture => AperturePde;
+ /// Page Classification Field.
+ 5:3 pcf => PdePcf;
+ /// Table frame (table address 256-byte aligned).
+ 51:8 table_frame;
+ }
+}
+
+impl DualPdeBig {
+ /// Create an invalid big page table pointer.
+ fn invalid() -> Self {
+ Self::zeroed().with_aperture(AperturePde::Invalid)
+ }
+
+ /// Create a valid big PDE pointing to a page table in the given aperture.
+ fn new(aperture: AperturePde, table_addr: VramAddress) -> Result<Self> {
+ // Big page table addresses must be 256-byte aligned (shift 8).
+ if table_addr.raw() & 0xFF != 0 {
+ return Err(EINVAL);
+ }
+ let table_frame = Bounded::from_expr(table_addr.raw() >> 8);
+ match aperture {
+ AperturePde::VideoMemory => Ok(Self::zeroed()
+ .with_is_pte(false)
+ .with_aperture(aperture)
+ .with_table_frame(table_frame)),
+ AperturePde::Invalid
+ | AperturePde::SystemCoherent
+ | AperturePde::SystemNonCoherent => {
+ kernel::pr_warn!("MMU v3 DualPdeBig aperture {:?} not supported\n", aperture);
+ Ok(Self::invalid())
+ }
+ }
+ }
+
+ /// Check if this big PDE is valid.
+ fn is_valid(&self) -> bool {
+ self.aperture() != AperturePde::Invalid
+ }
+
+ /// Get the VRAM address of the big page table.
+ fn table_vram_address(&self) -> VramAddress {
+ debug_assert!(
+ self.aperture() == AperturePde::VideoMemory,
+ "table_vram_address called on non-VRAM DualPdeBig (aperture: {:?})",
+ self.aperture()
+ );
+ VramAddress::new(*self.table_frame() << 8)
+ }
+}
+
+/// Dual PDE at Level 4 for MMU v3 - 128-bit entry.
+///
+/// Contains both big (64KB) and small (4KB) page table pointers:
+/// - Lower 64 bits: Big Page Table pointer.
+/// - Upper 64 bits: Small Page Table pointer.
+///
+/// ## Note
+///
+/// The big and small page table pointers have different address layouts:
+/// - Big address = field value << 8 (256-byte alignment).
+/// - Small address = field value << 12 (4KB alignment).
+///
+/// This is why `DualPdeBig` is a separate type from `Pde`.
+#[repr(C)]
+#[derive(Debug, Clone, Copy)]
+pub(in crate::mm) struct DualPde {
+ /// Big Page Table pointer.
+ pub(super) big: DualPdeBig,
+ /// Small Page Table pointer.
+ pub(super) small: Pde,
+}
+
+// SAFETY: Both `DualPdeBig` and `Pde` fields are `Zeroable` (bitfield types are Zeroable).
+unsafe impl Zeroable for DualPde {}
+
+impl DualPde {
+ /// Check if the big page table pointer is valid.
+ fn has_big(&self) -> bool {
+ self.big.is_valid()
+ }
+}
+
+impl DualPdeOps for DualPde {
+ fn from_raw(big: u64, small: u64) -> Self {
+ Self {
+ big: DualPdeBig::from_raw(big),
+ small: PdeOps::from_raw(small),
+ }
+ }
+
+ fn new_small(table_pfn: Pfn) -> Self {
+ Self {
+ big: DualPdeBig::invalid(),
+ small: PdeOps::new(AperturePde::VideoMemory, table_pfn),
+ }
+ }
+
+ fn has_small(&self) -> bool {
+ PdeOps::is_valid(&self.small)
+ }
+
+ fn small_vram_address(&self) -> VramAddress {
+ PdeOps::table_vram_address(&self.small)
+ }
+
+ fn big_raw_u64(&self) -> u64 {
+ self.big.into_raw()
+ }
+
+ fn small_raw_u64(&self) -> u64 {
+ self.small.into_raw()
+ }
+}
--
2.34.1
^ permalink raw reply related
* [PATCH v1 07/16] gpu: nova-core: mm: Add MMU v2 page table types
From: Joel Fernandes @ 2026-05-18 18:11 UTC (permalink / raw)
To: linux-kernel
Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron, Benno Lossin,
Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich,
Dave Airlie, Daniel Almeida, dri-devel, rust-for-linux, nova-gpu,
Nikola Djukic, David Airlie, Boqun Feng, John Hubbard,
Alistair Popple, Timur Tabi, Edwin Peer, Alexandre Courbot,
Andrea Righi, Andy Ritger, Zhi Wang, Balbir Singh,
Philipp Stanner, alexeyi, Eliot Courtney, joel, linux-doc,
Joel Fernandes
In-Reply-To: <20260518181126.2493572-1-joelagnelf@nvidia.com>
Add page table entry and directory structures for MMU version 2 used by
Hopper and later GPUs. The `Pte`, `Pde`, and `DualPde` types each
implement the `PteOps`, `PdeOps`, and `DualPdeOps` traits introduced
earlier in the series, providing the version-agnostic API used by the
forthcoming page-table walker and mapper.
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
drivers/gpu/nova-core/mm/pagetable.rs | 2 +
drivers/gpu/nova-core/mm/pagetable/ver2.rs | 271 +++++++++++++++++++++
2 files changed, 273 insertions(+)
create mode 100644 drivers/gpu/nova-core/mm/pagetable/ver2.rs
diff --git a/drivers/gpu/nova-core/mm/pagetable.rs b/drivers/gpu/nova-core/mm/pagetable.rs
index 7ea090024d91..df041fc89390 100644
--- a/drivers/gpu/nova-core/mm/pagetable.rs
+++ b/drivers/gpu/nova-core/mm/pagetable.rs
@@ -8,6 +8,8 @@
#![expect(dead_code)]
+pub(super) mod ver2;
+
use kernel::prelude::*;
use kernel::num::Bounded;
diff --git a/drivers/gpu/nova-core/mm/pagetable/ver2.rs b/drivers/gpu/nova-core/mm/pagetable/ver2.rs
new file mode 100644
index 000000000000..089e5cc2bfc3
--- /dev/null
+++ b/drivers/gpu/nova-core/mm/pagetable/ver2.rs
@@ -0,0 +1,271 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! MMU v2 page table types for Turing, Ampere and Ada GPUs.
+//!
+//! This module defines MMU version 2 specific types (Turing, Ampere and Ada GPUs).
+//!
+//! Bit field layouts derived from the NVIDIA OpenRM documentation:
+//! `open-gpu-kernel-modules/src/common/inc/swref/published/turing/tu102/dev_mmu.h`
+
+#![allow(dead_code)]
+
+use kernel::bitfield;
+use kernel::num::Bounded;
+use pin_init::Zeroable;
+
+use super::{
+ AperturePde,
+ AperturePte,
+ DualPdeOps,
+ PageTableLevel,
+ PdeOps,
+ PteOps,
+ VaLevelIndex, //
+};
+use crate::mm::{
+ Pfn,
+ VirtualAddress,
+ VramAddress, //
+};
+
+// Bounded to version 2 Pfn bitfield conversions:
+// 25 bits for video memory frame numbers (bits 32:8).
+impl_pfn_bounded!(25);
+// 46 bits for system memory frame numbers (bits 53:8).
+impl_pfn_bounded!(46);
+
+bitfield! {
+ /// MMU v2 49-bit virtual address layout.
+ pub(super) struct VirtualAddressV2(u64) {
+ /// Page offset [11:0].
+ 11:0 offset;
+ /// PT index [20:12].
+ 20:12 pt_idx;
+ /// PDE0 index [28:21].
+ 28:21 pde0_idx;
+ /// PDE1 index [37:29].
+ 37:29 pde1_idx;
+ /// PDE2 index [46:38].
+ 46:38 pde2_idx;
+ /// PDE3 index [48:47].
+ 48:47 pde3_idx;
+ }
+}
+
+impl VirtualAddressV2 {
+ /// Create a [`VirtualAddressV2`] from a [`VirtualAddress`].
+ pub(super) fn new(va: VirtualAddress) -> Self {
+ Self::from_raw(va.into_raw())
+ }
+}
+
+impl VaLevelIndex for VirtualAddressV2 {
+ fn level_index(&self, level: u64) -> u64 {
+ match level {
+ 0 => *self.pde3_idx(),
+ 1 => *self.pde2_idx(),
+ 2 => *self.pde1_idx(),
+ 3 => *self.pde0_idx(),
+ 4 => *self.pt_idx(),
+ _ => 0,
+ }
+ }
+}
+
+/// `PDE` levels for MMU v2 (5-level hierarchy: `PDB` -> `L1` -> `L2` -> `L3` -> `L4`).
+pub(super) const PDE_LEVELS: &[PageTableLevel] = &[
+ PageTableLevel::Pdb,
+ PageTableLevel::L1,
+ PageTableLevel::L2,
+ PageTableLevel::L3,
+];
+
+/// `PTE` level for MMU v2.
+pub(super) const PTE_LEVEL: PageTableLevel = PageTableLevel::L4;
+
+/// Dual `PDE` level for MMU v2 (128-bit entries).
+pub(super) const DUAL_PDE_LEVEL: PageTableLevel = PageTableLevel::L3;
+
+// Page Table Entry (PTE) for MMU v2 - 64-bit entry at level 4.
+bitfield! {
+ /// Page Table Entry for MMU v2.
+ pub(in crate::mm) struct Pte(u64) {
+ /// Entry is valid.
+ 0:0 valid;
+ /// Memory aperture type.
+ 2:1 aperture => AperturePte;
+ /// Volatile (bypass L2 cache).
+ 3:3 volatile;
+ /// Encryption enabled (Confidential Computing).
+ 4:4 encrypted;
+ /// Privileged access only.
+ 5:5 privilege;
+ /// Write protection.
+ 6:6 read_only;
+ /// Atomic operations disabled.
+ 7:7 atomic_disable;
+ /// Frame number for system memory.
+ 53:8 frame_number_sys => Pfn;
+ /// Frame number for video memory.
+ 32:8 frame_number_vid => Pfn;
+ /// Peer GPU ID for peer memory (0-7).
+ 35:33 peer_id;
+ /// Compression tag line bits.
+ 53:36 comptagline;
+ /// Surface kind/format.
+ 63:56 kind;
+ }
+}
+
+impl PteOps for Pte {
+ fn from_raw(val: u64) -> Self {
+ Self::from_raw(val)
+ }
+
+ fn invalid() -> Self {
+ Self::zeroed()
+ }
+
+ fn new(aperture: AperturePte, pfn: Pfn, writable: bool) -> Self {
+ let base = Self::zeroed()
+ .with_valid(true)
+ .with_aperture(aperture)
+ .with_read_only(!writable);
+ match aperture {
+ AperturePte::VideoMemory => base.with_frame_number_vid(pfn),
+ // Sysmem PTEs use VOL=1 to bypass L2 for cache coherency.
+ AperturePte::SystemCoherent => base.with_frame_number_sys(pfn).with_volatile(true),
+ AperturePte::PeerMemory | AperturePte::SystemNonCoherent => {
+ kernel::pr_warn!("MMU v2 PTE aperture {:?} not supported\n", aperture);
+ Self::invalid()
+ }
+ }
+ }
+
+ fn is_valid(&self) -> bool {
+ self.valid().into_bool()
+ }
+
+ fn frame_number(&self) -> Pfn {
+ match self.aperture() {
+ AperturePte::VideoMemory => self.frame_number_vid(),
+ _ => self.frame_number_sys(),
+ }
+ }
+}
+
+// Page Directory Entry (PDE) for MMU v2 - 64-bit entry at levels 0-2.
+bitfield! {
+ /// Page Directory Entry for MMU v2.
+ pub(in crate::mm) struct Pde(u64) {
+ /// Valid bit (inverted logic).
+ 0:0 valid_inverted;
+ /// Memory aperture type.
+ 2:1 aperture => AperturePde;
+ /// Volatile (bypass L2 cache).
+ 3:3 volatile;
+ /// Disable Address Translation Services.
+ 5:5 no_ats;
+ /// Table frame number for system memory.
+ 53:8 table_frame_sys => Pfn;
+ /// Table frame number for video memory.
+ 32:8 table_frame_vid => Pfn;
+ /// Peer GPU ID (0-7).
+ 35:33 peer_id;
+ }
+}
+
+impl PdeOps for Pde {
+ fn from_raw(val: u64) -> Self {
+ Self::from_raw(val)
+ }
+
+ fn new(aperture: AperturePde, table_pfn: Pfn) -> Self {
+ let base = Self::zeroed()
+ .with_valid_inverted(false) // 0 = valid
+ .with_aperture(aperture);
+ match aperture {
+ AperturePde::VideoMemory => base.with_table_frame_vid(table_pfn),
+ // Sysmem PTEs use VOL=1 to bypass L2 for cache coherency.
+ AperturePde::SystemCoherent => base.with_table_frame_sys(table_pfn).with_volatile(true),
+ AperturePde::Invalid | AperturePde::SystemNonCoherent => {
+ kernel::pr_warn!("MMU v2 PDE aperture {:?} not supported\n", aperture);
+ Self::invalid()
+ }
+ }
+ }
+
+ fn invalid() -> Self {
+ Self::zeroed()
+ .with_valid_inverted(true)
+ .with_aperture(AperturePde::Invalid)
+ }
+
+ fn is_valid(&self) -> bool {
+ !self.valid_inverted().into_bool() && self.aperture() != AperturePde::Invalid
+ }
+
+ fn aperture(&self) -> AperturePde {
+ Pde::aperture(*self)
+ }
+
+ fn table_vram_address(&self) -> VramAddress {
+ debug_assert!(
+ Pde::aperture(*self) == AperturePde::VideoMemory,
+ "table_vram_address called on non-VRAM PDE (aperture: {:?})",
+ Pde::aperture(*self)
+ );
+ VramAddress::from(self.table_frame_vid())
+ }
+}
+
+/// Dual `PDE` at Level 3 - 128-bit entry of Large/Small Page Table pointers.
+///
+/// The dual `PDE` supports both large (64KB) and small (4KB) page tables.
+#[repr(C)]
+#[derive(Debug, Clone, Copy)]
+pub(in crate::mm) struct DualPde {
+ /// Large/Big Page Table pointer (lower 64 bits).
+ pub(super) big: Pde,
+ /// Small Page Table pointer (upper 64 bits).
+ pub(super) small: Pde,
+}
+
+impl DualPde {
+ /// Check if the big page table pointer is valid.
+ fn has_big(&self) -> bool {
+ PdeOps::is_valid(&self.big)
+ }
+}
+
+impl DualPdeOps for DualPde {
+ fn from_raw(big: u64, small: u64) -> Self {
+ Self {
+ big: PdeOps::from_raw(big),
+ small: PdeOps::from_raw(small),
+ }
+ }
+
+ fn new_small(table_pfn: Pfn) -> Self {
+ Self {
+ big: PdeOps::from_raw(0),
+ small: PdeOps::new(AperturePde::VideoMemory, table_pfn),
+ }
+ }
+
+ fn has_small(&self) -> bool {
+ PdeOps::is_valid(&self.small)
+ }
+
+ fn small_vram_address(&self) -> VramAddress {
+ PdeOps::table_vram_address(&self.small)
+ }
+
+ fn big_raw_u64(&self) -> u64 {
+ self.big.into_raw()
+ }
+
+ fn small_raw_u64(&self) -> u64 {
+ self.small.into_raw()
+ }
+}
--
2.34.1
^ permalink raw reply related
* [PATCH v1 10/16] gpu: nova-core: mm: Add page table walker for MMU v2/v3
From: Joel Fernandes @ 2026-05-18 18:11 UTC (permalink / raw)
To: linux-kernel
Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron, Benno Lossin,
Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich,
Dave Airlie, Daniel Almeida, dri-devel, rust-for-linux, nova-gpu,
Nikola Djukic, David Airlie, Boqun Feng, John Hubbard,
Alistair Popple, Timur Tabi, Edwin Peer, Alexandre Courbot,
Andrea Righi, Andy Ritger, Zhi Wang, Balbir Singh,
Philipp Stanner, alexeyi, Eliot Courtney, joel, linux-doc,
Joel Fernandes
In-Reply-To: <20260518181126.2493572-1-joelagnelf@nvidia.com>
Add the page table walker implementation that traverses the page table
hierarchy for both MMU v2 (5-level) and MMU v3 (6-level) to resolve
virtual addresses to physical addresses or find PTE locations.
Currently only v2 has been tested (nova-core currently boots pre-hopper)
with some initial preparatory work done for v3.
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
drivers/gpu/nova-core/mm/pagetable.rs | 1 +
drivers/gpu/nova-core/mm/pagetable/walk.rs | 258 +++++++++++++++++++++
2 files changed, 259 insertions(+)
create mode 100644 drivers/gpu/nova-core/mm/pagetable/walk.rs
diff --git a/drivers/gpu/nova-core/mm/pagetable.rs b/drivers/gpu/nova-core/mm/pagetable.rs
index 38f4f0c6e8ce..5e192679f27c 100644
--- a/drivers/gpu/nova-core/mm/pagetable.rs
+++ b/drivers/gpu/nova-core/mm/pagetable.rs
@@ -10,6 +10,7 @@
pub(super) mod ver2;
pub(super) mod ver3;
+pub(super) mod walk;
use kernel::prelude::*;
diff --git a/drivers/gpu/nova-core/mm/pagetable/walk.rs b/drivers/gpu/nova-core/mm/pagetable/walk.rs
new file mode 100644
index 000000000000..a5f6c461f96a
--- /dev/null
+++ b/drivers/gpu/nova-core/mm/pagetable/walk.rs
@@ -0,0 +1,258 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Page table walker implementation for NVIDIA GPUs.
+//!
+//! This module provides page table walking functionality for MMU v2 and v3.
+//! The walker traverses the page table hierarchy to resolve virtual addresses
+//! to physical addresses or to find PTE locations.
+//!
+//! # Page Table Hierarchy
+//!
+//! ## MMU v2 (Turing/Ampere/Ada) - 5 levels
+//!
+//! ```text
+//! +-------+ +-------+ +-------+ +---------+ +-------+
+//! | PDB |---->| L1 |---->| L2 |---->| L3 Dual |---->| L4 |
+//! | (L0) | | | | | | PDE | | (PTE) |
+//! +-------+ +-------+ +-------+ +---------+ +-------+
+//! 64-bit 64-bit 64-bit 128-bit 64-bit
+//! PDE PDE PDE (big+small) PTE
+//! ```
+//!
+//! ## MMU v3 (Hopper+) - 6 levels
+//!
+//! ```text
+//! +-------+ +-------+ +-------+ +-------+ +---------+ +-------+
+//! | PDB |---->| L1 |---->| L2 |---->| L3 |---->| L4 Dual |---->| L5 |
+//! | (L0) | | | | | | | | PDE | | (PTE) |
+//! +-------+ +-------+ +-------+ +-------+ +---------+ +-------+
+//! 64-bit 64-bit 64-bit 64-bit 128-bit 64-bit
+//! PDE PDE PDE PDE (big+small) PTE
+//! ```
+//!
+//! # Result of a page table walk
+//!
+//! The walker returns a [`WalkResult`] indicating the outcome.
+
+use core::marker::PhantomData;
+
+use kernel::{
+ device,
+ prelude::*, //
+};
+
+use super::{
+ DualPdeOps,
+ MmuConfig,
+ MmuV2,
+ MmuV3,
+ MmuVersion,
+ PageTableLevel,
+ PdeOps,
+ PteOps, //
+};
+use crate::{
+ mm::{
+ pramin,
+ GpuMm,
+ Pfn,
+ Vfn,
+ VirtualAddress,
+ VramAddress, //
+ },
+ num::{
+ IntoSafeCast, //
+ },
+};
+
+/// Result of walking to a PTE.
+#[derive(Debug, Clone, Copy)]
+pub(in crate::mm) enum WalkResult {
+ /// Intermediate page tables are missing (only returned in lookup mode).
+ PageTableMissing,
+ /// PTE exists but is invalid (page not mapped).
+ Unmapped { pte_addr: VramAddress },
+ /// PTE exists and is valid (page is mapped).
+ Mapped { pte_addr: VramAddress, pfn: Pfn },
+}
+
+/// Result of walking PDE levels only.
+///
+/// Returned by [`PtWalkInner::walk_pde_levels()`] to indicate whether all PDE
+/// levels resolved or a PDE is missing.
+#[derive(Debug, Clone, Copy)]
+pub(in crate::mm) enum WalkPdeResult {
+ /// All PDE levels resolved -- returns PTE page table address.
+ Complete {
+ /// VRAM address of the PTE-level page table.
+ pte_table: VramAddress,
+ },
+ /// A PDE is missing and no prepared page was provided by the closure.
+ Missing {
+ /// PDE slot address in the parent page table (where to install).
+ install_addr: VramAddress,
+ /// The page table level that is missing.
+ level: PageTableLevel,
+ },
+}
+
+/// Page table walker.
+pub(in crate::mm) struct PtWalkInner<M: MmuConfig> {
+ pdb_addr: VramAddress,
+ _phantom: PhantomData<M>,
+}
+
+impl<M: MmuConfig> PtWalkInner<M> {
+ /// Calculate the VRAM address of an entry within a page table.
+ fn entry_addr(table: VramAddress, level: PageTableLevel, index: u64) -> VramAddress {
+ let entry_size: u64 = M::entry_size(level).into_safe_cast();
+ table + index * entry_size
+ }
+
+ /// Create a new page table walker.
+ pub(super) fn new(pdb_addr: VramAddress) -> Self {
+ Self {
+ pdb_addr,
+ _phantom: PhantomData,
+ }
+ }
+
+ /// Walk PDE levels with closure-based resolution for missing PDEs.
+ ///
+ /// Traverses all PDE levels for the MMU version. At each level, reads the PDE.
+ /// If valid, extracts the child table address and continues. If missing, calls
+ /// `resolve_prepared(install_addr)` to resolve the missing PDE.
+ pub(super) fn walk_pde_levels(
+ &self,
+ window: &mut pramin::PraminWindow<'_>,
+ vfn: Vfn,
+ resolve_prepared: impl Fn(VramAddress) -> Option<VramAddress>,
+ ) -> Result<WalkPdeResult> {
+ let va = VirtualAddress::from(vfn);
+ let mut cur_table = self.pdb_addr;
+
+ for &level in M::PDE_LEVELS {
+ let idx = M::level_index(va, level.as_index());
+ let install_addr = Self::entry_addr(cur_table, level, idx);
+
+ if level == M::DUAL_PDE_LEVEL {
+ // 128-bit dual PDE with big+small page table pointers.
+ let dpde = M::DualPde::read(window, install_addr)?;
+ if dpde.has_small() {
+ cur_table = dpde.small_vram_address();
+ continue;
+ }
+ } else {
+ // Regular 64-bit PDE. Use `is_valid_vram()` because
+ // `table_vram_address()` only reads the VRAM frame-number
+ // bitfield; system-memory PDEs store the address in a
+ // different (wider) field and would be silently truncated.
+ let pde = M::Pde::read(window, install_addr)?;
+ if pde.is_valid_vram() {
+ cur_table = pde.table_vram_address();
+ continue;
+ }
+ }
+
+ // PDE missing in HW. Ask caller for resolution.
+ if let Some(prepared_addr) = resolve_prepared(install_addr) {
+ cur_table = prepared_addr;
+ continue;
+ }
+
+ return Ok(WalkPdeResult::Missing {
+ install_addr,
+ level,
+ });
+ }
+
+ Ok(WalkPdeResult::Complete {
+ pte_table: cur_table,
+ })
+ }
+
+ /// Walk to PTE for lookup only (no allocation).
+ ///
+ /// Returns [`WalkResult::PageTableMissing`] if intermediate tables don't exist.
+ pub(super) fn walk_to_pte_lookup(
+ &self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ vfn: Vfn,
+ ) -> Result<WalkResult> {
+ let mut window = mm.pramin().get_window(dev)?;
+ self.walk_to_pte_lookup_with_window(&mut window, vfn)
+ }
+
+ /// Walk to PTE using a caller-provided PRAMIN window (lookup only).
+ pub(super) fn walk_to_pte_lookup_with_window(
+ &self,
+ window: &mut pramin::PraminWindow<'_>,
+ vfn: Vfn,
+ ) -> Result<WalkResult> {
+ match self.walk_pde_levels(window, vfn, |_| None)? {
+ WalkPdeResult::Complete { pte_table } => {
+ Self::read_pte_at_level(window, vfn, pte_table)
+ }
+ WalkPdeResult::Missing { .. } => Ok(WalkResult::PageTableMissing),
+ }
+ }
+
+ /// Read the PTE at the PTE level given the PTE table address.
+ fn read_pte_at_level(
+ window: &mut pramin::PraminWindow<'_>,
+ vfn: Vfn,
+ pte_table: VramAddress,
+ ) -> Result<WalkResult> {
+ let va = VirtualAddress::from(vfn);
+ let pte_level = M::PTE_LEVEL;
+ let pte_idx = M::level_index(va, pte_level.as_index());
+ let pte_addr = Self::entry_addr(pte_table, pte_level, pte_idx);
+ let pte = M::Pte::read(window, pte_addr)?;
+
+ if pte.is_valid() {
+ return Ok(WalkResult::Mapped {
+ pte_addr,
+ pfn: pte.frame_number(),
+ });
+ }
+ Ok(WalkResult::Unmapped { pte_addr })
+ }
+}
+
+macro_rules! pt_walk_dispatch {
+ ($self:expr, $method:ident ( $($arg:expr),* $(,)? )) => {
+ match $self {
+ PtWalk::V2(inner) => inner.$method($($arg),*),
+ PtWalk::V3(inner) => inner.$method($($arg),*),
+ }
+ };
+}
+
+/// Page table walker dispatch.
+pub(in crate::mm) enum PtWalk {
+ /// MMU v2 (Turing/Ampere/Ada).
+ V2(PtWalkInner<MmuV2>),
+ /// MMU v3 (Hopper+).
+ V3(PtWalkInner<MmuV3>),
+}
+
+impl PtWalk {
+ /// Create a new page table walker for the given MMU version.
+ pub(in crate::mm) fn new(pdb_addr: VramAddress, version: MmuVersion) -> Self {
+ match version {
+ MmuVersion::V2 => Self::V2(PtWalkInner::<MmuV2>::new(pdb_addr)),
+ MmuVersion::V3 => Self::V3(PtWalkInner::<MmuV3>::new(pdb_addr)),
+ }
+ }
+
+ /// Walk to PTE for lookup.
+ pub(in crate::mm) fn walk_to_pte(
+ &self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ vfn: Vfn,
+ ) -> Result<WalkResult> {
+ pt_walk_dispatch!(self, walk_to_pte_lookup(dev, mm, vfn))
+ }
+}
--
2.34.1
^ permalink raw reply related
* [PATCH v1 12/16] gpu: nova-core: mm: Add virtual address range tracking to VMM
From: Joel Fernandes @ 2026-05-18 18:11 UTC (permalink / raw)
To: linux-kernel
Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron, Benno Lossin,
Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich,
Dave Airlie, Daniel Almeida, dri-devel, rust-for-linux, nova-gpu,
Nikola Djukic, David Airlie, Boqun Feng, John Hubbard,
Alistair Popple, Timur Tabi, Edwin Peer, Alexandre Courbot,
Andrea Righi, Andy Ritger, Zhi Wang, Balbir Singh,
Philipp Stanner, alexeyi, Eliot Courtney, joel, linux-doc,
Joel Fernandes
In-Reply-To: <20260518181126.2493572-1-joelagnelf@nvidia.com>
Add virtual address range tracking to the VMM using a maple tree
allocator. This enables contiguous virtual address range allocation
for mappings.
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
drivers/gpu/nova-core/mm/vmm.rs | 83 +++++++++++++++++++++++++++++----
1 file changed, 74 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/nova-core/mm/vmm.rs b/drivers/gpu/nova-core/mm/vmm.rs
index 3e18adc23b68..05ff77c5f888 100644
--- a/drivers/gpu/nova-core/mm/vmm.rs
+++ b/drivers/gpu/nova-core/mm/vmm.rs
@@ -9,18 +9,27 @@
use kernel::{
device,
gpu::buddy::AllocatedBlocks,
+ maple_tree::MapleTreeAlloc,
prelude::*, //
};
-use crate::mm::{
- pagetable::{
- walk::{PtWalk, WalkResult},
- MmuVersion, //
+use core::ops::Range;
+
+use crate::{
+ mm::{
+ pagetable::{
+ walk::{PtWalk, WalkResult},
+ MmuVersion, //
+ },
+ GpuMm,
+ Pfn,
+ Vfn,
+ VramAddress,
+ PAGE_SIZE, //
+ },
+ num::{
+ IntoSafeCast, //
},
- GpuMm,
- Pfn,
- Vfn,
- VramAddress, //
};
/// Virtual Memory Manager for a GPU address space.
@@ -35,18 +44,74 @@ pub(crate) struct Vmm {
mmu_version: MmuVersion,
/// Page table allocations required for mappings.
page_table_allocs: KVec<Pin<KBox<AllocatedBlocks>>>,
+ /// Maple tree allocator for virtual address range tracking.
+ virt_alloc: Pin<KBox<MapleTreeAlloc<()>>>,
+ /// Total number of pages in the virtual address space.
+ va_pages: usize,
}
impl Vmm {
/// Create a new [`Vmm`] for the given Page Directory Base address.
- pub(crate) fn new(pdb_addr: VramAddress, mmu_version: MmuVersion) -> Result<Self> {
+ ///
+ /// The [`Vmm`] will manage a virtual address space of `va_size` bytes.
+ pub(crate) fn new(
+ pdb_addr: VramAddress,
+ mmu_version: MmuVersion,
+ va_size: u64,
+ ) -> Result<Self> {
+ let page_size: u64 = PAGE_SIZE.into_safe_cast();
+ let va_pages: usize = (va_size / page_size).into_safe_cast();
+ let virt_alloc = KBox::pin_init(MapleTreeAlloc::<()>::new(), GFP_KERNEL)?;
+
Ok(Self {
pdb_addr,
mmu_version,
page_table_allocs: KVec::new(),
+ virt_alloc,
+ va_pages,
})
}
+ /// Allocate a contiguous virtual frame number range.
+ ///
+ /// # Arguments
+ ///
+ /// - `num_pages`: Number of pages to allocate.
+ /// - `va_range`: `None` = allocate anywhere, `Some(range)` = constrain allocation to the given
+ /// range.
+ fn alloc_vfn_range(&self, num_pages: usize, va_range: Option<Range<u64>>) -> Result<Vfn> {
+ let page_size: u64 = PAGE_SIZE.into_safe_cast();
+
+ let start_vfn = match va_range {
+ Some(r) => {
+ let num_pages_u64: u64 = num_pages.into_safe_cast();
+ let size = num_pages_u64.checked_mul(page_size).ok_or(EOVERFLOW)?;
+ let range_size = r.end.checked_sub(r.start).ok_or(EOVERFLOW)?;
+ if range_size != size {
+ return Err(EINVAL);
+ }
+ let start_vfn: usize = (r.start / page_size).into_safe_cast();
+ let end_vfn: usize = (r.end / page_size).into_safe_cast();
+ self.virt_alloc
+ .insert_range(start_vfn..end_vfn, (), GFP_KERNEL)?;
+ start_vfn
+ }
+ None => self
+ .virt_alloc
+ .alloc_range(num_pages, (), ..self.va_pages, GFP_KERNEL)?,
+ };
+
+ Ok(Vfn::new(start_vfn.into_safe_cast()))
+ }
+
+ /// Free a virtual frame number range back to the maple tree.
+ fn free_vfn(&self, vfn: Vfn) {
+ let vfn_index: usize = vfn.raw().into_safe_cast();
+ if self.virt_alloc.erase(vfn_index).is_none() {
+ kernel::pr_warn!("free_vfn: VFN {} not found in maple tree\n", vfn_index);
+ }
+ }
+
/// Read the [`Pfn`] for a mapped [`Vfn`] if one is mapped.
pub(super) fn read_mapping(
&self,
--
2.34.1
^ permalink raw reply related
* [PATCH v1 11/16] gpu: nova-core: mm: Add Virtual Memory Manager
From: Joel Fernandes @ 2026-05-18 18:11 UTC (permalink / raw)
To: linux-kernel
Cc: Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron, Benno Lossin,
Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich,
Dave Airlie, Daniel Almeida, dri-devel, rust-for-linux, nova-gpu,
Nikola Djukic, David Airlie, Boqun Feng, John Hubbard,
Alistair Popple, Timur Tabi, Edwin Peer, Alexandre Courbot,
Andrea Righi, Andy Ritger, Zhi Wang, Balbir Singh,
Philipp Stanner, alexeyi, Eliot Courtney, joel, linux-doc,
Joel Fernandes
In-Reply-To: <20260518181126.2493572-1-joelagnelf@nvidia.com>
Add the Virtual Memory Manager (VMM) infrastructure for GPU address
space management. Each Vmm instance manages a single address space
identified by its Page Directory Base (PDB) address, used for Channel,
BAR1 and BAR2 mappings.
Mapping APIs and virtual address range tracking are added in later
commits.
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
drivers/gpu/nova-core/mm.rs | 1 +
drivers/gpu/nova-core/mm/vmm.rs | 64 +++++++++++++++++++++++++++++++++
2 files changed, 65 insertions(+)
create mode 100644 drivers/gpu/nova-core/mm/vmm.rs
diff --git a/drivers/gpu/nova-core/mm.rs b/drivers/gpu/nova-core/mm.rs
index 66cc33389159..502c7fdceba2 100644
--- a/drivers/gpu/nova-core/mm.rs
+++ b/drivers/gpu/nova-core/mm.rs
@@ -34,6 +34,7 @@ macro_rules! impl_pfn_bounded {
pub(super) mod pagetable;
pub(crate) mod pramin;
pub(super) mod tlb;
+pub(super) mod vmm;
use core::ops::Range;
diff --git a/drivers/gpu/nova-core/mm/vmm.rs b/drivers/gpu/nova-core/mm/vmm.rs
new file mode 100644
index 000000000000..3e18adc23b68
--- /dev/null
+++ b/drivers/gpu/nova-core/mm/vmm.rs
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Virtual Memory Manager for NVIDIA GPU page table management.
+//!
+//! The [`Vmm`] provides high-level page mapping and unmapping operations for GPU
+//! virtual address spaces (Channels, BAR1, BAR2). It wraps the page table walker
+//! and handles TLB flushing after modifications.
+
+use kernel::{
+ device,
+ gpu::buddy::AllocatedBlocks,
+ prelude::*, //
+};
+
+use crate::mm::{
+ pagetable::{
+ walk::{PtWalk, WalkResult},
+ MmuVersion, //
+ },
+ GpuMm,
+ Pfn,
+ Vfn,
+ VramAddress, //
+};
+
+/// Virtual Memory Manager for a GPU address space.
+///
+/// Each [`Vmm`] instance manages a single address space identified by its Page
+/// Directory Base (`PDB`) address. The [`Vmm`] is used for Channel, BAR1 and
+/// BAR2 mappings.
+pub(crate) struct Vmm {
+ /// Page Directory Base address for this address space.
+ pdb_addr: VramAddress,
+ /// MMU version used for page table layout.
+ mmu_version: MmuVersion,
+ /// Page table allocations required for mappings.
+ page_table_allocs: KVec<Pin<KBox<AllocatedBlocks>>>,
+}
+
+impl Vmm {
+ /// Create a new [`Vmm`] for the given Page Directory Base address.
+ pub(crate) fn new(pdb_addr: VramAddress, mmu_version: MmuVersion) -> Result<Self> {
+ Ok(Self {
+ pdb_addr,
+ mmu_version,
+ page_table_allocs: KVec::new(),
+ })
+ }
+
+ /// Read the [`Pfn`] for a mapped [`Vfn`] if one is mapped.
+ pub(super) fn read_mapping(
+ &self,
+ dev: &device::Device<device::Bound>,
+ mm: &GpuMm,
+ vfn: Vfn,
+ ) -> Result<Option<Pfn>> {
+ let walker = PtWalk::new(self.pdb_addr, self.mmu_version);
+
+ match walker.walk_to_pte(dev, mm, vfn)? {
+ WalkResult::Mapped { pfn, .. } => Ok(Some(pfn)),
+ WalkResult::Unmapped { .. } | WalkResult::PageTableMissing => Ok(None),
+ }
+ }
+}
--
2.34.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox