Re: [PATCH v5 01/15] genpt: Generic Page Table base API

patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

From: Nicolin Chen <nicolinc@nvidia.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>, <iommu@lists.linux.dev>,
	Joerg Roedel <joro@8bytes.org>,
	Justin Stitt <justinstitt@google.com>,
	Kevin Tian <kevin.tian@intel.com>, <linux-doc@vger.kernel.org>,
	<linux-kselftest@vger.kernel.org>, <llvm@lists.linux.dev>,
	Bill Wendling <morbo@google.com>,
	Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <nick.desaulniers+lkml@gmail.com>,
	Miguel Ojeda <ojeda@kernel.org>,
	"Robin Murphy" <robin.murphy@arm.com>,
	Shuah Khan <shuah@kernel.org>,
	"Suravee Suthikulpanit" <suravee.suthikulpanit@amd.com>,
	Will Deacon <will@kernel.org>, Alexey Kardashevskiy <aik@amd.com>,
	Alejandro Jimenez <alejandro.j.jimenez@oracle.com>,
	James Gowans <jgowans@amazon.com>,
	"Michael Roth" <michael.roth@amd.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	<patches@lists.linux.dev>
Subject: Re: [PATCH v5 01/15] genpt: Generic Page Table base API
Date: Tue, 9 Sep 2025 20:40:28 -0700	[thread overview]
Message-ID: <aMDzLC9nV47Xvud9@nvidia.com> (raw)
In-Reply-To: <1-v5-116c4948af3d+68091-iommu_pt_jgg@nvidia.com>

On Wed, Sep 03, 2025 at 02:46:28PM -0300, Jason Gunthorpe wrote:
> +/**
> + * pt_entry_oa_lg2sz() - Return the size of a OA entry

an OA

> + * @pts: Entry to query
> + *
> + * If the entry is not contiguous this returns pt_table_item_lg2sz(), otherwise
> + * it returns the total VA/OA size of the entire contiguous entry.
> + */
> +static inline unsigned int pt_entry_oa_lg2sz(const struct pt_state *pts)
> +{
> +	return pt_entry_num_contig_lg2(pts) + pt_table_item_lg2sz(pts);
> +}
------

> + *  level
> + *     The number of table hops from the lowest leaf. Level 0
> + *     is always a table of only leaves of the least significant VA bits. The

Hmm, I am a bit confused here. I thought "leaf" was meant to be a
"leaf" table? But here "a table of only leaves" makes it feel like
a "leaf" table entry?

Also, isn't "the least significant VA bits" the page offset?

> + *  table
> + *     A linear array of entries representing the translation items for that
> + *     level.
> + *  index
> + *     The position in a table of an element: item = table[index]
> + *  item
> + *     A single position in a table
> + *  entry
> + *     A single logical element in a table. If contiguous pages are not
> + *     supported then item and entry are the same thing, otherwise entry refers
> + *     to the all the items that comprise a single contiguous translation.

So, an "entry" is a group of "items" if contiguous pages (huge
page?) are supported. Then, the "entry" sounds like a physical
(v.s. "logical") table entry, e.g. a PTE that we usually say?

> +#if !IS_ENABLED(CONFIG_GENERIC_ATOMIC64)
> +static inline bool pt_table_install64(struct pt_state *pts, u64 table_entry)
> +{
> +	u64 *entryp = pt_cur_table(pts, u64) + pts->index;
> +	u64 old_entry = pts->entry;
> +	bool ret;
> +
> +	/*
> +	 * Ensure the zero'd table content itself is visible before its PTE can
> +	 * be. release is a NOP on !SMP, but the HW is still doing an acquire.
> +	 */
> +	if (!IS_ENABLED(CONFIG_SMP))
> +		dma_wmb();

Mind elaborating why SMP doesn't need this?

> +/*
> + * PT_WARN_ON is used for invariants that the kunit should be checking can't
> + * happen.
> + */
> +#if IS_ENABLED(CONFIG_DEBUG_GENERIC_PT)
> +#define PT_WARN_ON WARN_ON
> +#else
> +static inline bool PT_WARN_ON(bool condition)
> +{
> +	return false;

Should it "return condition"?

Otherwise, these validations wouldn't be effective?

drivers/iommu/generic_pt/pt_iter.h:388: if (PT_WARN_ON(!pts->table_lower))
drivers/iommu/generic_pt/pt_iter.h-389-         return -EINVAL;
--
drivers/iommu/generic_pt/pt_iter.h-429-
drivers/iommu/generic_pt/pt_iter.h:430: if (PT_WARN_ON(!pt_can_have_table(pts)) ||
drivers/iommu/generic_pt/pt_iter.h:431:     PT_WARN_ON(!pts->table_lower))
drivers/iommu/generic_pt/pt_iter.h-432-         return -EINVAL;

> +/**
> + * pt_load_entry() - Read from the location pts points at into the pts
> + * @pts: Table index to load
> + *
> + * Set the type of entry that was loaded. pts->entry and pts->table_lower
> + * will be filled in with the entry's content.
> + */
> +static inline void pt_load_entry(struct pt_state *pts)
> +{
> +	pts->type = pt_load_entry_raw(pts);
> +	if (pts->type == PT_ENTRY_TABLE)
> +		pts->table_lower = pt_table_ptr(pts);
> +}

I see a couple of callers check pts->type. Maybe it could return
pts->type, matching with pt_load_entry_raw()?

> diff --git a/drivers/iommu/generic_pt/pt_fmt_defaults.h b/drivers/iommu/generic_pt/pt_fmt_defaults.h
> new file mode 100644
> index 00000000000000..19e8f820c1dccf
> --- /dev/null
> +++ b/drivers/iommu/generic_pt/pt_fmt_defaults.h
> @@ -0,0 +1,193 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES
> + *
> + * Default definitions for formats that don't define these functions.
> + */
> +#ifndef __GENERIC_PT_PT_FMT_DEFAULTS_H
> +#define __GENERIC_PT_PT_FMT_DEFAULTS_H
> +
> +#include "pt_defs.h"
> +#include <linux/log2.h>

<> precedes ""

> +#ifndef pt_pgsz_lg2_to_level
> +static inline unsigned int pt_pgsz_lg2_to_level(struct pt_common *common,
> +						unsigned int pgsize_lg2)
> +{
> +	return (pgsize_lg2 - PT_GRANULE_LG2SZ) /
> +	       (PT_TABLEMEM_LG2SZ - ilog2(PT_ITEM_WORD_SIZE));
> +	return 0;

"return 0" should likely be dropped.

> +/*
> + * Format supplies either:
> + *   pt_entry_oa - OA is at the start of a contiguous entry
> + * or
> + *   pt_item_oa  - OA is correct for every item in a contiguous entry

What does the "correct" mean here?

> +/**
> + * pt_range_to_end_index() - Ending index iteration
> + * @pts: Iteration State
> + *
> + * Return: the last index for the iteration in pts.
> + */
> +static inline unsigned int pt_range_to_end_index(const struct pt_state *pts)
> +{
> +	unsigned int isz_lg2 = pt_table_item_lg2sz(pts);
> +	struct pt_range *range = pts->range;
> +	unsigned int num_entries_lg2;
> +
> +	if (range->va == range->last_va)
> +		return pts->index + 1;
> +
> +	if (pts->range->top_level == pts->level)
> +		return log2_div(fvalog2_mod(pts->range->last_va,
> +					    pts->range->max_vasz_lg2),
> +				isz_lg2) +
> +		       1;

How about:
		return 1 + log2_div(...);
?

> +static __always_inline struct pt_range _pt_top_range(struct pt_common *common,
> +						     uintptr_t top_of_table)
> +{
> +	struct pt_range range = {
> +		.common = common,
> +		.top_table =
> +			(struct pt_table_p *)(top_of_table &
> +					      ~(uintptr_t)PT_TOP_LEVEL_MASK),
> +		.top_level = top_of_table % (1 << PT_TOP_LEVEL_BITS),

Since top_level is unsigned, would it be faster to do bitwise:
		.top_level = top_of_table & PT_TOP_LEVEL_MASK,
?

> +/*

/**

> + * pt_walk_descend_all() - Recursively invoke the walker for a table item
> + * @parent_pts: Iteration State
> + * @fn: Walker function to call
> + * @arg: Value to pass to the function
> + *
> + * With pts pointing at a table item this will descend and over the entire lower
> + * table. This creates a new walk and does not alter pts or pts->range.
> + */
> +static __always_inline int
> +pt_walk_descend_all(const struct pt_state *parent_pts, pt_level_fn_t fn,
> +		    void *arg)
-------

> +/**
> + * pt_compute_best_pgsize() - Determine the best page size for leaf entries
> + * @pgsz_bitmap: Permitted page sizes
> + * @va: Starting virtual address for the leaf entry
> + * @last_va: Last virtual address for the leaf entry, sets the max page size
> + * @oa: Starting output address for the leaf entry
> + *
> + * Compute the largest page size for va, last_va, and oa together and return it
> + * in lg2. The largest page size depends on the format's supported page sizes at
> + * this level, and the relative alignment of the VA and OA addresses. 0 means
> + * the OA cannot be stored with the provided pgsz_bitmap.
> + */
> +static inline unsigned int pt_compute_best_pgsize(pt_vaddr_t pgsz_bitmap,
> +						  pt_vaddr_t va,
> +						  pt_vaddr_t last_va,
> +						  pt_oaddr_t oa)
> +{
> +	unsigned int best_pgsz_lg2;
> +	unsigned int pgsz_lg2;
> +	pt_vaddr_t len = last_va - va + 1;
> +	pt_vaddr_t mask;
> +
> +	if (PT_WARN_ON(va >= last_va))
> +		return 0;
> +
> +	/*
> +	 * Given a VA/OA pair the best page size is the largest page side
> +	 * where:
> +	 *
> +	 * 1) VA and OA start at the page. Bitwise this is the count of least
> +	 *    significant 0 bits.
> +	 *    This also implies that last_va/oa has the same prefix as va/oa.
> +	 */
> +	mask = va | oa;
> +
> +	/*
> +	 * 2) The page size is not larger than the last_va (length). Since page
> +	 *    sizes are always power of two this can't be larger than the
> +	 *    largest power of two factor of the length.
> +	 */
> +	mask |= log2_to_int(log2_fls(len) - 1);
> +
> +	best_pgsz_lg2 = log2_ffs(mask);
> +
> +	/* Choose the higest bit <= best_pgsz_lg2 */

highest

> +/* Compute a */
> +#define log2_to_int_t(type, a_lg2) ((type)(((type)1) << (a_lg2)))
> +static_assert(log2_to_int_t(unsigned int, 0) == 1);
> +
> +/* Compute a - 1 (aka all low bits set) */
> +#define log2_to_max_int_t(type, a_lg2) ((type)(log2_to_int_t(type, a_lg2) - 1))
> +
> +/* Compute a / b */
> +#define log2_div_t(type, a, b_lg2) ((type)(((type)a) >> (b_lg2)))
> +static_assert(log2_div_t(unsigned int, 4, 2) == 1);

I skimmed through these macros and its callers. They are mostly
dealing with VA, OA, and mask, which feels like straightforward
bit-masking/shifting operations.

E.g.
log2_to_int_t = 64bit ? BIT_ULL(lg2) : BIT(lg2);
log2_to_max_int_t = 64bit ? GENMASK_ULL(lg2 - 1, 0) : GENMASK(lg2 - 1, 0);

What's the benefit from making them as arithmetic ones?

> diff --git a/include/linux/generic_pt/common.h b/include/linux/generic_pt/common.h
> new file mode 100644
> index 00000000000000..a29bdd7b244de6
> --- /dev/null
> +++ b/include/linux/generic_pt/common.h
> @@ -0,0 +1,134 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES
> + */
> +#ifndef __GENERIC_PT_COMMON_H
> +#define __GENERIC_PT_COMMON_H
> +
> +#include <linux/types.h>
> +#include <linux/build_bug.h>
> +#include <linux/bits.h>

In alphabetical order.

Thanks
Nicolin

next prev parent reply	other threads:[~2025-09-10  3:41 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-03 17:46 [PATCH v5 00/15] Consolidate iommu page table implementations (AMD) Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 01/15] genpt: Generic Page Table base API Jason Gunthorpe
2025-09-10  3:40   ` Nicolin Chen [this message]
2025-09-15 15:51     ` Jason Gunthorpe
2025-09-18  7:14       ` Nicolin Chen
2025-09-18 14:49         ` Jason Gunthorpe
2025-09-18 19:43           ` Nicolin Chen
2025-09-18  6:49   ` Tian, Kevin
2025-09-18 18:06     ` Jason Gunthorpe
2025-09-19  8:11       ` Tian, Kevin
2025-09-19 14:31         ` Jason Gunthorpe
2025-09-24  9:20           ` Tian, Kevin
2025-09-22 14:45   ` [External] : " ALOK TIWARI
2025-09-22 17:05     ` Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 02/15] genpt: Add Documentation/ files Jason Gunthorpe
2025-09-11  4:23   ` Nicolin Chen
2025-09-15 15:42     ` Jason Gunthorpe
2025-09-18  6:55   ` Tian, Kevin
2025-09-19 14:42     ` Jason Gunthorpe
2025-09-24  9:21       ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 03/15] iommupt: Add the basic structure of the iommu implementation Jason Gunthorpe
2025-09-11  5:38   ` Nicolin Chen
2025-09-15 15:36     ` Jason Gunthorpe
2025-09-18  6:58   ` Tian, Kevin
2025-09-19 15:26     ` Jason Gunthorpe
2025-09-24  9:22       ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 04/15] iommupt: Add the AMD IOMMU v1 page table format Jason Gunthorpe
2025-09-18  7:05   ` Tian, Kevin
2025-09-19 18:19     ` Jason Gunthorpe
2025-09-24  9:23       ` Tian, Kevin
2025-10-07 12:28     ` Jason Gunthorpe
2025-10-08  9:43   ` Vasant Hegde
2025-10-08 13:08     ` Jason Gunthorpe
2025-10-09 11:44       ` Vasant Hegde
2025-09-03 17:46 ` [PATCH v5 05/15] iommupt: Add iova_to_phys op Jason Gunthorpe
2025-09-18  7:08   ` Tian, Kevin
2025-09-19 18:35     ` Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 06/15] iommupt: Add unmap_pages op Jason Gunthorpe
2025-09-24  9:28   ` Tian, Kevin
2025-09-24 12:23     ` Jason Gunthorpe
2025-09-26  7:23       ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 07/15] iommupt: Add map_pages op Jason Gunthorpe
2025-09-26  7:47   ` Tian, Kevin
2025-09-29 16:44     ` Jason Gunthorpe
2025-10-07 12:08   ` Vasant Hegde
2025-10-07 13:11     ` Jason Gunthorpe
2025-10-08  9:52       ` Vasant Hegde
2025-09-03 17:46 ` [PATCH v5 08/15] iommupt: Add read_and_clear_dirty op Jason Gunthorpe
2025-09-26  7:48   ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 09/15] iommupt: Add a kunit test for Generic Page Table Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 10/15] iommupt: Add a mock pagetable format for iommufd selftest to use Jason Gunthorpe
2025-09-26  7:50   ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 11/15] iommufd: Change the selftest to use iommupt instead of xarray Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 12/15] iommupt: Add the x86 64 bit page table format Jason Gunthorpe
2025-09-26  7:57   ` Tian, Kevin
2025-09-29 16:17     ` Jason Gunthorpe
2025-10-08 10:05   ` Vasant Hegde
2025-10-08 13:03     ` Jason Gunthorpe
2025-10-09 11:43       ` Vasant Hegde
2025-09-03 17:46 ` [PATCH v5 13/15] iommu/amd: Use the generic iommu page table Jason Gunthorpe
2025-09-25 12:07   ` Ankit Soni
2025-09-25 12:32     ` Jason Gunthorpe
2025-09-25 12:39       ` Ankit Soni
2025-10-08  9:47   ` Vasant Hegde
2025-09-03 17:46 ` [PATCH v5 14/15] iommu/amd: Remove AMD io_pgtable support Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 15/15] iommupt: Add a kunit test for the IOMMU implementation Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aMDzLC9nV47Xvud9@nvidia.com \
    --to=nicolinc@nvidia.com \
    --cc=aik@amd.com \
    --cc=alejandro.j.jimenez@oracle.com \
    --cc=corbet@lwn.net \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@nvidia.com \
    --cc=jgowans@amazon.com \
    --cc=joro@8bytes.org \
    --cc=justinstitt@google.com \
    --cc=kevin.tian@intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=llvm@lists.linux.dev \
    --cc=michael.roth@amd.com \
    --cc=morbo@google.com \
    --cc=nathan@kernel.org \
    --cc=nick.desaulniers+lkml@gmail.com \
    --cc=ojeda@kernel.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=patches@lists.linux.dev \
    --cc=robin.murphy@arm.com \
    --cc=shuah@kernel.org \
    --cc=suravee.suthikulpanit@amd.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).