Re: [PATCH v5 01/15] genpt: Generic Page Table base API

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nicolin Chen <nicolinc@nvidia.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>, <iommu@lists.linux.dev>,
	Joerg Roedel <joro@8bytes.org>,
	Justin Stitt <justinstitt@google.com>,
	Kevin Tian <kevin.tian@intel.com>, <linux-doc@vger.kernel.org>,
	<linux-kselftest@vger.kernel.org>, <llvm@lists.linux.dev>,
	Bill Wendling <morbo@google.com>,
	Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <nick.desaulniers+lkml@gmail.com>,
	Miguel Ojeda <ojeda@kernel.org>,
	"Robin Murphy" <robin.murphy@arm.com>,
	Shuah Khan <shuah@kernel.org>,
	"Suravee Suthikulpanit" <suravee.suthikulpanit@amd.com>,
	Will Deacon <will@kernel.org>, Alexey Kardashevskiy <aik@amd.com>,
	Alejandro Jimenez <alejandro.j.jimenez@oracle.com>,
	James Gowans <jgowans@amazon.com>,
	"Michael Roth" <michael.roth@amd.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	<patches@lists.linux.dev>
Subject: Re: [PATCH v5 01/15] genpt: Generic Page Table base API
Date: Tue, 9 Sep 2025 20:40:28 -0700	[thread overview]
Message-ID: <aMDzLC9nV47Xvud9@nvidia.com> (raw)
In-Reply-To: <1-v5-116c4948af3d+68091-iommu_pt_jgg@nvidia.com>

On Wed, Sep 03, 2025 at 02:46:28PM -0300, Jason Gunthorpe wrote:
> +/**
> + * pt_entry_oa_lg2sz() - Return the size of a OA entry

an OA

> + * @pts: Entry to query
> + *
> + * If the entry is not contiguous this returns pt_table_item_lg2sz(), otherwise
> + * it returns the total VA/OA size of the entire contiguous entry.
> + */
> +static inline unsigned int pt_entry_oa_lg2sz(const struct pt_state *pts)
> +{
> +	return pt_entry_num_contig_lg2(pts) + pt_table_item_lg2sz(pts);
> +}
------

> + *  level
> + *     The number of table hops from the lowest leaf. Level 0
> + *     is always a table of only leaves of the least significant VA bits. The

Hmm, I am a bit confused here. I thought "leaf" was meant to be a
"leaf" table? But here "a table of only leaves" makes it feel like
a "leaf" table entry?

Also, isn't "the least significant VA bits" the page offset?

> + *  table
> + *     A linear array of entries representing the translation items for that
> + *     level.
> + *  index
> + *     The position in a table of an element: item = table[index]
> + *  item
> + *     A single position in a table
> + *  entry
> + *     A single logical element in a table. If contiguous pages are not
> + *     supported then item and entry are the same thing, otherwise entry refers
> + *     to the all the items that comprise a single contiguous translation.

So, an "entry" is a group of "items" if contiguous pages (huge
page?) are supported. Then, the "entry" sounds like a physical
(v.s. "logical") table entry, e.g. a PTE that we usually say?

> +#if !IS_ENABLED(CONFIG_GENERIC_ATOMIC64)
> +static inline bool pt_table_install64(struct pt_state *pts, u64 table_entry)
> +{
> +	u64 *entryp = pt_cur_table(pts, u64) + pts->index;
> +	u64 old_entry = pts->entry;
> +	bool ret;
> +
> +	/*
> +	 * Ensure the zero'd table content itself is visible before its PTE can
> +	 * be. release is a NOP on !SMP, but the HW is still doing an acquire.
> +	 */
> +	if (!IS_ENABLED(CONFIG_SMP))
> +		dma_wmb();

Mind elaborating why SMP doesn't need this?

> +/*
> + * PT_WARN_ON is used for invariants that the kunit should be checking can't
> + * happen.
> + */
> +#if IS_ENABLED(CONFIG_DEBUG_GENERIC_PT)
> +#define PT_WARN_ON WARN_ON
> +#else
> +static inline bool PT_WARN_ON(bool condition)
> +{
> +	return false;

Should it "return condition"?

Otherwise, these validations wouldn't be effective?

drivers/iommu/generic_pt/pt_iter.h:388: if (PT_WARN_ON(!pts->table_lower))
drivers/iommu/generic_pt/pt_iter.h-389-         return -EINVAL;
--
drivers/iommu/generic_pt/pt_iter.h-429-
drivers/iommu/generic_pt/pt_iter.h:430: if (PT_WARN_ON(!pt_can_have_table(pts)) ||
drivers/iommu/generic_pt/pt_iter.h:431:     PT_WARN_ON(!pts->table_lower))
drivers/iommu/generic_pt/pt_iter.h-432-         return -EINVAL;

> +/**
> + * pt_load_entry() - Read from the location pts points at into the pts
> + * @pts: Table index to load
> + *
> + * Set the type of entry that was loaded. pts->entry and pts->table_lower
> + * will be filled in with the entry's content.
> + */
> +static inline void pt_load_entry(struct pt_state *pts)
> +{
> +	pts->type = pt_load_entry_raw(pts);
> +	if (pts->type == PT_ENTRY_TABLE)
> +		pts->table_lower = pt_table_ptr(pts);
> +}

I see a couple of callers check pts->type. Maybe it could return
pts->type, matching with pt_load_entry_raw()?

> diff --git a/drivers/iommu/generic_pt/pt_fmt_defaults.h b/drivers/iommu/generic_pt/pt_fmt_defaults.h
> new file mode 100644
> index 00000000000000..19e8f820c1dccf
> --- /dev/null
> +++ b/drivers/iommu/generic_pt/pt_fmt_defaults.h
> @@ -0,0 +1,193 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES
> + *
> + * Default definitions for formats that don't define these functions.
> + */
> +#ifndef __GENERIC_PT_PT_FMT_DEFAULTS_H
> +#define __GENERIC_PT_PT_FMT_DEFAULTS_H
> +
> +#include "pt_defs.h"
> +#include <linux/log2.h>

<> precedes ""

> +#ifndef pt_pgsz_lg2_to_level
> +static inline unsigned int pt_pgsz_lg2_to_level(struct pt_common *common,
> +						unsigned int pgsize_lg2)
> +{
> +	return (pgsize_lg2 - PT_GRANULE_LG2SZ) /
> +	       (PT_TABLEMEM_LG2SZ - ilog2(PT_ITEM_WORD_SIZE));
> +	return 0;

"return 0" should likely be dropped.

> +/*
> + * Format supplies either:
> + *   pt_entry_oa - OA is at the start of a contiguous entry
> + * or
> + *   pt_item_oa  - OA is correct for every item in a contiguous entry

What does the "correct" mean here?

> +/**
> + * pt_range_to_end_index() - Ending index iteration
> + * @pts: Iteration State
> + *
> + * Return: the last index for the iteration in pts.
> + */
> +static inline unsigned int pt_range_to_end_index(const struct pt_state *pts)
> +{
> +	unsigned int isz_lg2 = pt_table_item_lg2sz(pts);
> +	struct pt_range *range = pts->range;
> +	unsigned int num_entries_lg2;
> +
> +	if (range->va == range->last_va)
> +		return pts->index + 1;
> +
> +	if (pts->range->top_level == pts->level)
> +		return log2_div(fvalog2_mod(pts->range->last_va,
> +					    pts->range->max_vasz_lg2),
> +				isz_lg2) +
> +		       1;

How about:
		return 1 + log2_div(...);
?

> +static __always_inline struct pt_range _pt_top_range(struct pt_common *common,
> +						     uintptr_t top_of_table)
> +{
> +	struct pt_range range = {
> +		.common = common,
> +		.top_table =
> +			(struct pt_table_p *)(top_of_table &
> +					      ~(uintptr_t)PT_TOP_LEVEL_MASK),
> +		.top_level = top_of_table % (1 << PT_TOP_LEVEL_BITS),

Since top_level is unsigned, would it be faster to do bitwise:
		.top_level = top_of_table & PT_TOP_LEVEL_MASK,
?

> +/*

/**

> + * pt_walk_descend_all() - Recursively invoke the walker for a table item
> + * @parent_pts: Iteration State
> + * @fn: Walker function to call
> + * @arg: Value to pass to the function
> + *
> + * With pts pointing at a table item this will descend and over the entire lower
> + * table. This creates a new walk and does not alter pts or pts->range.
> + */
> +static __always_inline int
> +pt_walk_descend_all(const struct pt_state *parent_pts, pt_level_fn_t fn,
> +		    void *arg)
-------

> +/**
> + * pt_compute_best_pgsize() - Determine the best page size for leaf entries
> + * @pgsz_bitmap: Permitted page sizes
> + * @va: Starting virtual address for the leaf entry
> + * @last_va: Last virtual address for the leaf entry, sets the max page size
> + * @oa: Starting output address for the leaf entry
> + *
> + * Compute the largest page size for va, last_va, and oa together and return it
> + * in lg2. The largest page size depends on the format's supported page sizes at
> + * this level, and the relative alignment of the VA and OA addresses. 0 means
> + * the OA cannot be stored with the provided pgsz_bitmap.
> + */
> +static inline unsigned int pt_compute_best_pgsize(pt_vaddr_t pgsz_bitmap,
> +						  pt_vaddr_t va,
> +						  pt_vaddr_t last_va,
> +						  pt_oaddr_t oa)
> +{
> +	unsigned int best_pgsz_lg2;
> +	unsigned int pgsz_lg2;
> +	pt_vaddr_t len = last_va - va + 1;
> +	pt_vaddr_t mask;
> +
> +	if (PT_WARN_ON(va >= last_va))
> +		return 0;
> +
> +	/*
> +	 * Given a VA/OA pair the best page size is the largest page side
> +	 * where:
> +	 *
> +	 * 1) VA and OA start at the page. Bitwise this is the count of least
> +	 *    significant 0 bits.
> +	 *    This also implies that last_va/oa has the same prefix as va/oa.
> +	 */
> +	mask = va | oa;
> +
> +	/*
> +	 * 2) The page size is not larger than the last_va (length). Since page
> +	 *    sizes are always power of two this can't be larger than the
> +	 *    largest power of two factor of the length.
> +	 */
> +	mask |= log2_to_int(log2_fls(len) - 1);
> +
> +	best_pgsz_lg2 = log2_ffs(mask);
> +
> +	/* Choose the higest bit <= best_pgsz_lg2 */

highest

> +/* Compute a */
> +#define log2_to_int_t(type, a_lg2) ((type)(((type)1) << (a_lg2)))
> +static_assert(log2_to_int_t(unsigned int, 0) == 1);
> +
> +/* Compute a - 1 (aka all low bits set) */
> +#define log2_to_max_int_t(type, a_lg2) ((type)(log2_to_int_t(type, a_lg2) - 1))
> +
> +/* Compute a / b */
> +#define log2_div_t(type, a, b_lg2) ((type)(((type)a) >> (b_lg2)))
> +static_assert(log2_div_t(unsigned int, 4, 2) == 1);

I skimmed through these macros and its callers. They are mostly
dealing with VA, OA, and mask, which feels like straightforward
bit-masking/shifting operations.

E.g.
log2_to_int_t = 64bit ? BIT_ULL(lg2) : BIT(lg2);
log2_to_max_int_t = 64bit ? GENMASK_ULL(lg2 - 1, 0) : GENMASK(lg2 - 1, 0);

What's the benefit from making them as arithmetic ones?

> diff --git a/include/linux/generic_pt/common.h b/include/linux/generic_pt/common.h
> new file mode 100644
> index 00000000000000..a29bdd7b244de6
> --- /dev/null
> +++ b/include/linux/generic_pt/common.h
> @@ -0,0 +1,134 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES
> + */
> +#ifndef __GENERIC_PT_COMMON_H
> +#define __GENERIC_PT_COMMON_H
> +
> +#include <linux/types.h>
> +#include <linux/build_bug.h>
> +#include <linux/bits.h>

In alphabetical order.

Thanks
Nicolin

next prev parent reply	other threads:[~2025-09-10  3:41 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-03 17:46 [PATCH v5 00/15] Consolidate iommu page table implementations (AMD) Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 01/15] genpt: Generic Page Table base API Jason Gunthorpe
2025-09-10  3:40   ` Nicolin Chen [this message]
2025-09-15 15:51     ` Jason Gunthorpe
2025-09-18  7:14       ` Nicolin Chen
2025-09-18 14:49         ` Jason Gunthorpe
2025-09-18 19:43           ` Nicolin Chen
2025-09-18  6:49   ` Tian, Kevin
2025-09-18 18:06     ` Jason Gunthorpe
2025-09-19  8:11       ` Tian, Kevin
2025-09-19 14:31         ` Jason Gunthorpe
2025-09-24  9:20           ` Tian, Kevin
2025-09-22 14:45   ` [External] : " ALOK TIWARI
2025-09-22 17:05     ` Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 02/15] genpt: Add Documentation/ files Jason Gunthorpe
2025-09-11  4:23   ` Nicolin Chen
2025-09-15 15:42     ` Jason Gunthorpe
2025-09-18  6:55   ` Tian, Kevin
2025-09-19 14:42     ` Jason Gunthorpe
2025-09-24  9:21       ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 03/15] iommupt: Add the basic structure of the iommu implementation Jason Gunthorpe
2025-09-11  5:38   ` Nicolin Chen
2025-09-15 15:36     ` Jason Gunthorpe
2025-09-18  6:58   ` Tian, Kevin
2025-09-19 15:26     ` Jason Gunthorpe
2025-09-24  9:22       ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 04/15] iommupt: Add the AMD IOMMU v1 page table format Jason Gunthorpe
2025-09-18  7:05   ` Tian, Kevin
2025-09-19 18:19     ` Jason Gunthorpe
2025-09-24  9:23       ` Tian, Kevin
2025-10-07 12:28     ` Jason Gunthorpe
2025-10-08  9:43   ` Vasant Hegde
2025-10-08 13:08     ` Jason Gunthorpe
2025-10-09 11:44       ` Vasant Hegde
2025-09-03 17:46 ` [PATCH v5 05/15] iommupt: Add iova_to_phys op Jason Gunthorpe
2025-09-18  7:08   ` Tian, Kevin
2025-09-19 18:35     ` Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 06/15] iommupt: Add unmap_pages op Jason Gunthorpe
2025-09-24  9:28   ` Tian, Kevin
2025-09-24 12:23     ` Jason Gunthorpe
2025-09-26  7:23       ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 07/15] iommupt: Add map_pages op Jason Gunthorpe
2025-09-26  7:47   ` Tian, Kevin
2025-09-29 16:44     ` Jason Gunthorpe
2025-10-07 12:08   ` Vasant Hegde
2025-10-07 13:11     ` Jason Gunthorpe
2025-10-08  9:52       ` Vasant Hegde
2025-09-03 17:46 ` [PATCH v5 08/15] iommupt: Add read_and_clear_dirty op Jason Gunthorpe
2025-09-26  7:48   ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 09/15] iommupt: Add a kunit test for Generic Page Table Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 10/15] iommupt: Add a mock pagetable format for iommufd selftest to use Jason Gunthorpe
2025-09-26  7:50   ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 11/15] iommufd: Change the selftest to use iommupt instead of xarray Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 12/15] iommupt: Add the x86 64 bit page table format Jason Gunthorpe
2025-09-26  7:57   ` Tian, Kevin
2025-09-29 16:17     ` Jason Gunthorpe
2025-10-08 10:05   ` Vasant Hegde
2025-10-08 13:03     ` Jason Gunthorpe
2025-10-09 11:43       ` Vasant Hegde
2025-09-03 17:46 ` [PATCH v5 13/15] iommu/amd: Use the generic iommu page table Jason Gunthorpe
2025-09-25 12:07   ` Ankit Soni
2025-09-25 12:32     ` Jason Gunthorpe
2025-09-25 12:39       ` Ankit Soni
2025-10-08  9:47   ` Vasant Hegde
2025-09-03 17:46 ` [PATCH v5 14/15] iommu/amd: Remove AMD io_pgtable support Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 15/15] iommupt: Add a kunit test for the IOMMU implementation Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aMDzLC9nV47Xvud9@nvidia.com \
    --to=nicolinc@nvidia.com \
    --cc=aik@amd.com \
    --cc=alejandro.j.jimenez@oracle.com \
    --cc=corbet@lwn.net \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@nvidia.com \
    --cc=jgowans@amazon.com \
    --cc=joro@8bytes.org \
    --cc=justinstitt@google.com \
    --cc=kevin.tian@intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=llvm@lists.linux.dev \
    --cc=michael.roth@amd.com \
    --cc=morbo@google.com \
    --cc=nathan@kernel.org \
    --cc=nick.desaulniers+lkml@gmail.com \
    --cc=ojeda@kernel.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=patches@lists.linux.dev \
    --cc=robin.murphy@arm.com \
    --cc=shuah@kernel.org \
    --cc=suravee.suthikulpanit@amd.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.