Re: [PATCH v5 01/15] genpt: Generic Page Table base API

patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

From: Jason Gunthorpe <jgg@nvidia.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>,
	"iommu@lists.linux.dev" <iommu@lists.linux.dev>,
	Joerg Roedel <joro@8bytes.org>,
	Justin Stitt <justinstitt@google.com>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"linux-kselftest@vger.kernel.org"
	<linux-kselftest@vger.kernel.org>,
	"llvm@lists.linux.dev" <llvm@lists.linux.dev>,
	Bill Wendling <morbo@google.com>,
	Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <nick.desaulniers+lkml@gmail.com>,
	Miguel Ojeda <ojeda@kernel.org>,
	Robin Murphy <robin.murphy@arm.com>,
	Shuah Khan <shuah@kernel.org>,
	Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
	Will Deacon <will@kernel.org>, Alexey Kardashevskiy <aik@amd.com>,
	Alejandro Jimenez <alejandro.j.jimenez@oracle.com>,
	James Gowans <jgowans@amazon.com>,
	Michael Roth <michael.roth@amd.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	"patches@lists.linux.dev" <patches@lists.linux.dev>
Subject: Re: [PATCH v5 01/15] genpt: Generic Page Table base API
Date: Thu, 18 Sep 2025 15:06:57 -0300	[thread overview]
Message-ID: <20250918180657.GA2132010@nvidia.com> (raw)
In-Reply-To: <BN9PR11MB527669A84AD24A550FDDD85A8C16A@BN9PR11MB5276.namprd11.prod.outlook.com>

On Thu, Sep 18, 2025 at 06:49:08AM +0000, Tian, Kevin wrote:
> > This is enough to implement the 8 initial format variations with all of
> > their features:
> >  * Entries comprised of contiguous blocks of IO PTEs for larger page
> >    sizes (AMDv1, ARMv8)
> >  * Multi-level tables, up to 6 levels. Runtime selected top level
> >  * Runtime variable table level size (ARM's concatenated tables)
> >  * Expandable top level (AMDv1)
> 
> any more context about this one? how is it different from the earlier
> "runtime selected top level"?

How about:

 * The size of the top table level can be selected at runtime (ARM's
   concatenated tables)
 * The number of levels in the table can optionally increase dynamically
   during map (AMDv1)

> > --- /dev/null
> > +++ b/drivers/iommu/generic_pt/pt_common.h
> > @@ -0,0 +1,355 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES
> > + *
> > + * This header is included after the format. It contains definitions
> > + * that build on the format definitions to create the basic format API.
> > + *
> > + * The format API is listed here, with kdocs, in alphabetical order. The
> 
> Is alphabetical order important here? It's not strictly followed, e.g.:

Let's drop it, I think you are right the grouping is better for the
generated kdoc anyhow.

> > + * functions without bodies are implemented in the format using the
> > pattern:
> > + *     static inline FMTpt_XXX(..) {..}
> > + *     #define pt_XXX FMTpt_XXX
> 
> or provided by pt_fmt_defaults.h

 * The format API is listed here, with kdocs. The functions without bodies are
 * implemented in the format using the pattern:
 *     static inline FMTpt_XXX(..) {..}
 *     #define pt_XXX FMTpt_XXX
 *
 * If the format doesn't implement a function then pt_fmt_defaults.h can provide
 * a generic version.

> > + * The routines marked "@pts: Entry to query" operate on the entire
> > contiguous
> > + * entry and can be called with a pts->index pointing to any sub item that
> > makes
> > + * up that entry.
> > + *
> > + * The header order is:
> > + *  pt_defs.h
> > + *  fmt_XX.h
> 
> s/fmt_XX.h/FMT.h/

Oops, yep

> > +/**
> > + * pt_entry_make_write_dirty() - Make an entry dirty
> > + * @pts: Table index to change
> 
> it's about the entire entry instead of a specific index? if yes then
> "entry to change" makes more sense.

Yes, cut and pasto. All the "_entry_" function should internally
replicate to all the contiguous items.

> > +/**
> > + * pt_entry_oa_full() - Return the full OA for an entry
> > + * @pts: Entry to query
> 
> s/full/exact/?

OK

> > +/**
> > + * pt_has_system_page() - True if level 0 can install a PAGE_SHIFT entry
> > + * @common: Page table to query
> 
> pt_has_system_page_size()

Ok

> > +/**
> > + * pt_install_leaf_entry() - Write a leaf entry to the table
> > + * @pts: Table index to change
> > + * @oa: Output Address for this leaf
> > + * @oasz_lg2: Size in VA for this leaf
> > + * @attrs: Attributes to modify the entry
> > + *
> > + * A leaf OA entry will return PT_ENTRY_OA from pt_load_entry(). It
> > translates
> > + * the VA indicated by pts to the given OA.
> > + *
> > + * For a single item non-contiguous entry oasz_lg2 is pt_table_item_lg2sz().
> > + * For contiguous it is pt_table_item_lg2sz() + num_contig_lg2.
> 
> this sounds a fixed thing then could be judged within the function instead of
> letting the caller to set?

The point is the caller specifies the size of the entry using
oasz_lg2, the documentation above is explaining how that math works.

> > +/**
> > + * pt_max_output_address_lg2() - Return the maximum OA the table
> > format can hold
> > + * @common: Page table to query
> 
> pt_max_oa_lg2()

Ok

> > + *  leaf
> > + *     An entry that results in an output address. I.e. a physical memory addr
> 
> "I.e. a physical ..." is redundant to what OA already explains

Yep

> > + *  table
> > + *     A linear array of entries representing the translation items for that
> > + *     level.
> 
> to not mix 'entry' and 'item' in one description:
> 
> "A linear array of translation items for that level"

Ok

> > + *  item
> > + *     A single position in a table
> 
> 'position' is called 'index'

Yep

> > + *  entry
> > + *     A single logical element in a table. If contiguous pages are not
> > + *     supported then item and entry are the same thing, otherwise entry
> > refers
> > + *     to the all the items that comprise a single contiguous translation.
> 
> 'refers to all the items"

Yep

> > + *  item/entry_size
> > + *     The number of bytes of VA the table translates for.
> > + *     If the item is a table entry then the next table covers
> > + *     this size. If the entry is an output address then the
> 
> s/is/translates/

Don't follow?

> > +enum pt_entry_type {
> > +	PT_ENTRY_EMPTY,
> > +	PT_ENTRY_TABLE,
> 
> add a comment to be consistent with following line

	/* Item points to a lower table level */

> > +/*
> > + * Try to install a new table pointer. The locking methodology requires this
> > to
> > + * be atomic (multiple threads can race to install a pointer) the losing
> > threads
> 
> "... install a pointer). The losing threads..."

Yep

> > +static inline bool pt_feature(const struct pt_common *common,
> > +			      unsigned int feature_nr)
> > +{
> > +	if (PT_FORCE_ENABLED_FEATURES & BIT(feature_nr))
> > +		return true;
> > +	if (!PT_SUPPORTED_FEATURE(feature_nr))
> > +		return false;
> > +	return common->features & BIT(feature_nr);
> > +}
> 
> common->features is already verified in pt_init_common(). So above is
> kind of an optimization using compiler to filter out static checks in fast
> path?

Yes.

> > +/*
> > + * PT_WARN_ON is used for invariants that the kunit should be checking
> > can't
> > + * happen.
> > + */
> > +#if IS_ENABLED(CONFIG_DEBUG_GENERIC_PT)
> > +#define PT_WARN_ON WARN_ON
> > +#else
> > +static inline bool PT_WARN_ON(bool condition)
> > +{
> > +	return false;
> > +}
> > +#endif
> 
> Then call it PT_DBG_WARN_ON() to be more explicit?

Ah, I'd rather leave it..
 
> btw looks there is no plain WARN_ON() used in generic-pt. Just be curious
> about the rationale behind. Is it a new trend to contain all warnings under
> a debug option?

It is sort of like VM_WARN_ON(), the places that got put are largely
performance path so you don't want it enabled unless doing debugging.

Generally the idea is to use PT_WARN_ON() on performance path cases
only, and leave normal stuff to normal WARN_ON.

> > +/* These all work on the VA type */
> > +#define log2_to_int(a_lg2) log2_to_int_t(pt_vaddr_t, a_lg2)
> > +#define log2_to_max_int(a_lg2) log2_to_max_int_t(pt_vaddr_t, a_lg2)
> > +#define log2_div(a, b_lg2) log2_div_t(pt_vaddr_t, a, b_lg2)
> > +#define log2_div_eq(a, b, c_lg2) log2_div_eq_t(pt_vaddr_t, a, b, c_lg2)
> > +#define log2_mod(a, b_lg2) log2_mod_t(pt_vaddr_t, a, b_lg2)
> > +#define log2_mod_eq_max(a, b_lg2) log2_mod_eq_max_t(pt_vaddr_t, a,
> > b_lg2)
> > +#define log2_set_mod(a, val, b_lg2) log2_set_mod_t(pt_vaddr_t, a, val,
> > b_lg2)
> > +#define log2_set_mod_max(a, b_lg2) log2_set_mod_max_t(pt_vaddr_t, a,
> > b_lg2)
> > +#define log2_mul(a, b_lg2) log2_mul_t(pt_vaddr_t, a, b_lg2)
> > +#define log2_ffs(a) log2_ffs_t(pt_vaddr_t, a)
> > +#define log2_fls(a) log2_fls_t(pt_vaddr_t, a)
> > +#define log2_ffz(a) log2_ffz_t(pt_vaddr_t, a)
> 
> above three are not related to log2

Hrm, maybe not, but I also don't have a better name.. It needs the
type multiplexer. Let me think on it.

> > +
> > +/* If not supplied by the format then contiguous pages are not supported */
> > +#ifndef pt_entry_num_contig_lg2
> > +static inline unsigned int pt_entry_num_contig_lg2(const struct pt_state
> > *pts)
> > +{
> > +	return ilog2(1);
> > +}
> > +
> > +static inline unsigned short pt_contig_count_lg2(const struct pt_state *pts)
> > +{
> > +	return ilog2(1);
> > +}
> 
> what is the difference between above two helpers?

Oh, pt_contig_count_lg2 didn't get kdocs because they are internal
helpers to build other functions..

Like this:

/*
 * If not supplied by the format then contiguous pages are not supported.
 *
 * If contiguous pages are supported then the format must also provide
 * pt_contig_count_lg2() if it supports a single contiguous size per level,
 * or pt_possible_sizes() if it supports multiple sizes per level.
 */
#ifndef pt_entry_num_contig_lg2
static inline unsigned int pt_entry_num_contig_lg2(const struct pt_state *pts)
{
	return ilog2(1);
}

/*
 * Return the number of contiguous OA items forming an entry at this table level
 */
static inline unsigned short pt_contig_count_lg2(const struct pt_state *pts)
{
	return ilog2(1);
}
#endif

> It's currently not implemented by any driver so will have the default version
> returning 0. and it is only used by default pt_possible_sizes(), which then
> returns only one page size accordingly.

ARM and RISCV use it. AMD is the only format that support more than
one size of contiguous per level.

> > +#ifndef pt_pgsz_lg2_to_level
> > +static inline unsigned int pt_pgsz_lg2_to_level(struct pt_common *common,
> > +						unsigned int pgsize_lg2)
> > +{
> > +	return (pgsize_lg2 - PT_GRANULE_LG2SZ) /
> > +	       (PT_TABLEMEM_LG2SZ - ilog2(PT_ITEM_WORD_SIZE));
> > +	return 0;
> > +}
> > +#endif
>
> remove the 2nd 'return'

Yep

> > +/* If not supplied by the format then dirty tracking is not supported */
> > +#ifndef pt_entry_write_is_dirty
> > +static inline bool pt_entry_write_is_dirty(const struct pt_state *pts)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline void pt_entry_set_write_clean(struct pt_state *pts)
> > +{
> > +}
> > +
> > +static inline bool pt_dirty_supported(struct pt_common *common)
> > +{
> > +	return true;
> 
> should return false here.

Huh.. Seems like yes. This is only used by the kunit, but I have to
figure out why the kunit doesn't fail..
 
> > + * Format supplies either:
> > + *   pt_entry_oa - OA is at the start of a contiguous entry
> > + * or
> > + *   pt_item_oa  - OA is correct for every item in a contiguous entry
> 
> what is the meaning of 'correct'?

I will use 'adjusted' instead.

> I have a problem understanding _pt_entry_oa_fast() here.
> 
> Obviously pt_entry_oa/pt_item_oa generates different oa for
> a given pts, based on the aligned size. why is it ok to alias
> a common macro to either of them? looks the assumption
> is that the caller doesn't care about the offset within the
> entry range e.g. will do its own masking. 

Yes, this is correct. 

> Probably some comment is welcomed to clarify it.

 * The internal helper _pt_entry_oa_fast() allows generating
 * an efficient pt_entry_oa_exact(), it doesn't care which
 * option is selected.

> > +#ifndef pt_has_system_page
> > +static inline bool pt_has_system_page(const struct pt_common *common)
> > +{
> > +	return PT_GRANULE_LG2SZ == PAGE_SHIFT;
> > +}
> > +#endif
> 
> will there be a implementation supporting system page size while breaking
> above check? if not it could be moved to pt_common.h

Yes, DART does in my draft and I expect to rework ARMv8 to have
variable page sizes.

> > +
> > +/**
> > + * pt_item_fully_covered() - Check if the item or entry is entirely contained
> > + *                           within pts->range
> 
> when using pts it's more accurate to call it pt_entry_fully_covered()

Not so much related to PTS, as pts could be either, but it makes more
sense to refer to table poitner as an entry than does it to refer to a
contiguous entry an item.

> > +	PT_FEAT_FLUSH_RANGE,
> > +	/**
> > +	 * @PT_FEAT_FLUSH_RANGE_NO_GAPS: Like PT_FEAT_FLUSH_RANGE
> > except that
> > +	 * the optimization objective is to only flush IOVA that has been
> > +	 * changed. This mode is suitable for cases like hypervisor shadowing
> > +	 * where flushing unchanged ranges may cause the hypervisor to
> > reparse
> > +	 * significant amount of page table.
> > +	 */
> > +	PT_FEAT_FLUSH_RANGE_NO_GAPS,
> 
> FLUSH_RANGE and FLUSH_RANGE_NO_GAPS are mutually exclusive but
> one format must select one? then we could just keep one flag (NO_GAP)
> then feature off means FLUSH_RANGE.

I think at least one or two more flushing modes will be needed, and
they would be mutually exclusive. It is setup with this one hot
encoding because of how the pt_feature function compile optimization
works.

Thanks,
Jason

next prev parent reply	other threads:[~2025-09-18 18:07 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-03 17:46 [PATCH v5 00/15] Consolidate iommu page table implementations (AMD) Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 01/15] genpt: Generic Page Table base API Jason Gunthorpe
2025-09-10  3:40   ` Nicolin Chen
2025-09-15 15:51     ` Jason Gunthorpe
2025-09-18  7:14       ` Nicolin Chen
2025-09-18 14:49         ` Jason Gunthorpe
2025-09-18 19:43           ` Nicolin Chen
2025-09-18  6:49   ` Tian, Kevin
2025-09-18 18:06     ` Jason Gunthorpe [this message]
2025-09-19  8:11       ` Tian, Kevin
2025-09-19 14:31         ` Jason Gunthorpe
2025-09-24  9:20           ` Tian, Kevin
2025-09-22 14:45   ` [External] : " ALOK TIWARI
2025-09-22 17:05     ` Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 02/15] genpt: Add Documentation/ files Jason Gunthorpe
2025-09-11  4:23   ` Nicolin Chen
2025-09-15 15:42     ` Jason Gunthorpe
2025-09-18  6:55   ` Tian, Kevin
2025-09-19 14:42     ` Jason Gunthorpe
2025-09-24  9:21       ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 03/15] iommupt: Add the basic structure of the iommu implementation Jason Gunthorpe
2025-09-11  5:38   ` Nicolin Chen
2025-09-15 15:36     ` Jason Gunthorpe
2025-09-18  6:58   ` Tian, Kevin
2025-09-19 15:26     ` Jason Gunthorpe
2025-09-24  9:22       ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 04/15] iommupt: Add the AMD IOMMU v1 page table format Jason Gunthorpe
2025-09-18  7:05   ` Tian, Kevin
2025-09-19 18:19     ` Jason Gunthorpe
2025-09-24  9:23       ` Tian, Kevin
2025-10-07 12:28     ` Jason Gunthorpe
2025-10-08  9:43   ` Vasant Hegde
2025-10-08 13:08     ` Jason Gunthorpe
2025-10-09 11:44       ` Vasant Hegde
2025-09-03 17:46 ` [PATCH v5 05/15] iommupt: Add iova_to_phys op Jason Gunthorpe
2025-09-18  7:08   ` Tian, Kevin
2025-09-19 18:35     ` Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 06/15] iommupt: Add unmap_pages op Jason Gunthorpe
2025-09-24  9:28   ` Tian, Kevin
2025-09-24 12:23     ` Jason Gunthorpe
2025-09-26  7:23       ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 07/15] iommupt: Add map_pages op Jason Gunthorpe
2025-09-26  7:47   ` Tian, Kevin
2025-09-29 16:44     ` Jason Gunthorpe
2025-10-07 12:08   ` Vasant Hegde
2025-10-07 13:11     ` Jason Gunthorpe
2025-10-08  9:52       ` Vasant Hegde
2025-09-03 17:46 ` [PATCH v5 08/15] iommupt: Add read_and_clear_dirty op Jason Gunthorpe
2025-09-26  7:48   ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 09/15] iommupt: Add a kunit test for Generic Page Table Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 10/15] iommupt: Add a mock pagetable format for iommufd selftest to use Jason Gunthorpe
2025-09-26  7:50   ` Tian, Kevin
2025-09-03 17:46 ` [PATCH v5 11/15] iommufd: Change the selftest to use iommupt instead of xarray Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 12/15] iommupt: Add the x86 64 bit page table format Jason Gunthorpe
2025-09-26  7:57   ` Tian, Kevin
2025-09-29 16:17     ` Jason Gunthorpe
2025-10-08 10:05   ` Vasant Hegde
2025-10-08 13:03     ` Jason Gunthorpe
2025-10-09 11:43       ` Vasant Hegde
2025-09-03 17:46 ` [PATCH v5 13/15] iommu/amd: Use the generic iommu page table Jason Gunthorpe
2025-09-25 12:07   ` Ankit Soni
2025-09-25 12:32     ` Jason Gunthorpe
2025-09-25 12:39       ` Ankit Soni
2025-10-08  9:47   ` Vasant Hegde
2025-09-03 17:46 ` [PATCH v5 14/15] iommu/amd: Remove AMD io_pgtable support Jason Gunthorpe
2025-09-03 17:46 ` [PATCH v5 15/15] iommupt: Add a kunit test for the IOMMU implementation Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250918180657.GA2132010@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=aik@amd.com \
    --cc=alejandro.j.jimenez@oracle.com \
    --cc=corbet@lwn.net \
    --cc=iommu@lists.linux.dev \
    --cc=jgowans@amazon.com \
    --cc=joro@8bytes.org \
    --cc=justinstitt@google.com \
    --cc=kevin.tian@intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=llvm@lists.linux.dev \
    --cc=michael.roth@amd.com \
    --cc=morbo@google.com \
    --cc=nathan@kernel.org \
    --cc=nick.desaulniers+lkml@gmail.com \
    --cc=ojeda@kernel.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=patches@lists.linux.dev \
    --cc=robin.murphy@arm.com \
    --cc=shuah@kernel.org \
    --cc=suravee.suthikulpanit@amd.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).