Linux Documentation
 help / color / mirror / Atom feed
* Re: [RFC PATCH v2 02/14] kcov: fix INIT_TRACK race in kcov_dataflow
From: Yunseong Kim @ 2026-06-12  7:25 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Andrey Konovalov,
	Dmitry Vyukov, Andrew Morton, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Nathan Chancellor, Nicolas Schier,
	Nick Desaulniers, Bill Wendling, Justin Stitt, Kees Cook,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Jonathan Corbet, Shuah Khan, linux-kernel, kasan-dev,
	rust-for-linux, linux-kbuild, llvm, linux-mm, linux-kselftest,
	workflows, linux-doc, Yeoreum Yun, sashiko-bot
In-Reply-To: <CAG_fn=V1+_xLgCZgdLnT7Y-muRO0CXkrNKkC8AzrqzWoL4eR8w@mail.gmail.com>

Hi Alexander,

> On Thu, Jun 11, 2026 at 6:21 PM Yunseong Kim <yunseong.kim@est.tech> wrote:
>>
>> [snip...]
>> Reported-by: sashiko-bot <sashiko-bot@kernel.org>
>> Closes: https://sashiko.dev/#/patchset/20260603-kcov-dataflow-next-20260603-v2-0-fee0939de2c4%40est.tech
>> Signed-off-by: Yunseong Kim <yunseong.kim@est.tech>
> 
> Can we please avoid this?
> kcov_dataflow.c is being introduced in the same series, there is no
> need to send a buggy commit and a follow-up fix - just squash the two
> together and note the changes after Signed-off-by: separated by a
> triple dash.

Thank you for your guide. I'll remove it in the next patch set.

Best regards,
Yunseong

^ permalink raw reply

* Re: [PATCH v2 0/4] mm: split the file's i_mmap tree for NUMA
From: Huang Shijie @ 2026-06-12  7:02 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, vbabka, shakeel.butt, rppt, mhocko, corbet, skhan,
	linux, dinguyen, schuster.simon, James.Bottomley, deller, djbw,
	willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei
In-Reply-To: <airY5q_SspdbQDbi@lucifer>

On Thu, Jun 11, 2026 at 05:00:49PM +0100, Lorenzo Stoakes wrote:
> Hi Huang,
> 
> You seem to be replacing the file rmap altogether here, so you really ought
> to have sent this as an RFC so we could discuss it as a community first.
No problem.

> 
> Especially so as Pedro had publicly mentioned his plans to implement
> something similar here, so coordination would have been appreciated.
Yes. I am very happy to work with Pedro.

> 
> Anyway, as Pedro has pointed out, the code is overly complicated, it's far
> too configurable (not always a good thing), and the locking implementation
> is questionable.
I can make the code more simple. :)

> 
> You seem to be adding a whole bunch of open-coded complexity too, which is
> not something we want. Abstraction is key for the rmap.
> 
> You're also not adding any kdoc comments or really many comments at all,
> and you've not added any tests (though perhaps it's difficult given how
> core this is).
Got it.

> 
> So I would suggest that perhaps any respin should be sent as an RFC so we
> can engage in that conversation and ensure we're all on the same page?
> 
> Especially since Pedro plans to send an alternative, simpler, solution I
> believe.
> 
> It's also not helpful that you haven't examined the non-NUMA case :)
> perhaps your particular server behaves a certain way that this approach
> aids, but regresses other NUMA configurations?

emm. I ever hoped someone can help me to test this patch set on the non-NUMA
server.

It seems I should find some non-NUMA server before I send out the patch set. :)

> 
> We'd really need to be sure of this before accepting invasive changes like
> this.
Okay.

Thanks
Huang Shijie


^ permalink raw reply

* Re: [RFC PATCH v2 02/14] kcov: fix INIT_TRACK race in kcov_dataflow
From: Alexander Potapenko @ 2026-06-12  6:55 UTC (permalink / raw)
  To: Yunseong Kim
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Andrey Konovalov,
	Dmitry Vyukov, Andrew Morton, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Nathan Chancellor, Nicolas Schier,
	Nick Desaulniers, Bill Wendling, Justin Stitt, Kees Cook,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Jonathan Corbet, Shuah Khan, linux-kernel, kasan-dev,
	rust-for-linux, linux-kbuild, llvm, linux-mm, linux-kselftest,
	workflows, linux-doc, Yeoreum Yun, sashiko-bot
In-Reply-To: <20260611-b4-kcov-dataflow-v2-v2-2-0a261da3987c@est.tech>

On Thu, Jun 11, 2026 at 6:21 PM Yunseong Kim <yunseong.kim@est.tech> wrote:
>
> Two threads calling KCOV_DF_INIT_TRACK concurrently could both observe
> df->area == NULL, drop the lock to allocate, and then both assign their
> allocation to df->area, leaking one buffer.
>
> Fix by rechecking df->area after re-acquiring the lock. If another
> thread won the race, free the allocation and return -EBUSY. This
> matches the pattern used by KCOV_INIT_TRACE in kernel/kcov.c.
>
> Reported-by: sashiko-bot <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260603-kcov-dataflow-next-20260603-v2-0-fee0939de2c4%40est.tech
> Signed-off-by: Yunseong Kim <yunseong.kim@est.tech>

Can we please avoid this?
kcov_dataflow.c is being introduced in the same series, there is no
need to send a buggy commit and a follow-up fix - just squash the two
together and note the changes after Signed-off-by: separated by a
triple dash.

^ permalink raw reply

* Re: [PATCH v6 01/12] PCI: liveupdate: Set up FLB handler for the PCI core
From: Mike Rapoport @ 2026-06-12  6:54 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: David Matlack, kexec, linux-doc, linux-kernel, linux-mm,
	linux-pci, Adithya Jayachandran, Alexander Graf, Alex Williamson,
	Bjorn Helgaas, Chris Li, David Rientjes, Jacob Pan,
	Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
	Lukas Wunner, Parav Pandit, Pranjal Shrivastava, Pratyush Yadav,
	Saeed Mahameed, Samiullah Khawaja, Shuah Khan, Vipin Sharma,
	William Tu, Yi Liu
In-Reply-To: <178124130274.908199.14827357870284807134.b4-review@b4>

On Fri, Jun 12, 2026 at 05:15:02AM +0000, Pasha Tatashin wrote:
> On Fri, 22 May 2026 20:23:59 +0000, David Matlack <dmatlack@google.com> wrote:
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 2fb1c75afd16..6c618830cf61 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -20530,6 +20530,16 @@ L:	linux-pci@vger.kernel.org
> >  S:	Supported
> >  F:	Documentation/PCI/pci-error-recovery.rst
> >  
> > +PCI LIVE UPDATE
> > +M:	David Matlack <dmatlack@google.com>
> 
> Please add Pratyush, Mike, and myself so we are notified directly of 
> incoming patches, the same as with other areas where the liveupdate/ 
> tree is specified.

Or we can add PCI liveupdate files to LIVEUPDATE entry.

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v2 3/4] mm/fs: split the file's i_mmap tree
From: Huang Shijie @ 2026-06-12  6:44 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko, corbet,
	skhan, linux, dinguyen, schuster.simon, James.Bottomley, deller,
	djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, riel, harry,
	will, brian.ruley, rmk+kernel, dave.anglin, linux-mm, linux-doc,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-fsdevel,
	nvdimm, linux-perf-users, linux-trace-kernel, zhongyuan,
	fangbaoshun, yingzhiwei
In-Reply-To: <aiqFgGbIo1Psy3pI@pedro-suse.lan>

On Thu, Jun 11, 2026 at 12:11:27PM +0100, Pedro Falcato wrote:
> Hi,
> 
> On Thu, Jun 11, 2026 at 02:18:59PM +0800, Huang Shijie wrote:
> > In the UnixBench tests, there is a test "execl" which tests
> > the execve system call.
> >   For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > When we test our server with "./Run -c 384 execl",
> > the test result is not good enough. The i_mmap locks contended heavily on
> > "libc.so" and "ld.so". The i_mmap tree for "libc.so" can be
> > over 6000 VMAs, all the VMAs can be in different NUMA mode. The insert/remove
> > operations do not run quickly enough.
> 
> I _really_ would have appreciated some coordination here, because I said I was
> going to take a look at it. I have something that I think is much simpler
Okay, no problem. 

I waited for more then a month, I thought you are busy at other
things. So I spent more then a week to finish the patch set v2.


> in practice. These patches are also way too complex to be dropped just before
> the merge window.
> 
> Some comments:
> 
> > 
> >  In order to reduce the competition of the i_mmap lock, this patch does
> > following:
> >    1.) Split the single i_mmap tree into several sibling trees:
> >        Each tree has a lock. The CONFIG_SPLIT_I_MMAP is used to
> >        turn on/off this feature.
> 
> There is no need for a config option. This needs to Just Work.
> 
> >    2.) Introduce a new field "tree_idx" for vm_area_struct to save the
> >        sibling tree index for this VMA.
> 
> This is possibly contentious, but there are holes in vm_area_struct.
> So I think this is fine.
> 
> >    3.) Introduce a new field "vma_count" for address_space.
> >        The new mapping_mapped() will use it.
> >    4.) Rewrite the vma_interval_tree_foreach()
> >    5.) Rewrite the lock functions.	
> > 
> >  After this patch, the VMA insert/remove operations will work faster,
> > and we can get over 400% performance improvement with the above test.
> > 
> > Signed-off-by: Huang Shijie <huangsj@hygon.cn>
> > ---
> >  fs/Kconfig               |   8 ++
> >  fs/hugetlbfs/inode.c     |  20 ++++-
> >  fs/inode.c               |  75 ++++++++++++++++-
> >  include/linux/fs.h       | 174 ++++++++++++++++++++++++++++++++++++++-
> >  include/linux/mm.h       |  80 ++++++++++++++++++
> >  include/linux/mm_types.h |   3 +
> >  mm/internal.h            |   3 +-
> >  mm/mmap.c                |  11 ++-
> >  mm/nommu.c               |  23 ++++--
> >  mm/pagewalk.c            |   2 +-
> >  mm/vma.c                 |  72 +++++++++++-----
> >  mm/vma_init.c            |   3 +
> >  12 files changed, 436 insertions(+), 38 deletions(-)
> > 
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 43cb06de297f..e24804f70432 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -9,6 +9,14 @@ menu "File systems"
> >  config DCACHE_WORD_ACCESS
> >         bool
> >  
> > +config SPLIT_I_MMAP
> > +	bool "Split the file's i_mmap to several trees"
> > +	default n
> > +	help
> > +	  Split the file's i_mmap to several trees, each tree has a separate
> > +	  lock. This will reduce the lock contention of file's i_mmap tree,
> > +	  but it will cost more memory for per inode.
> > +
> >  config VALIDATE_FS_PARSER
> >  	bool "Validate filesystem parameter description"
> >  	help
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index da5b41ea5bdd..68d8308418dd 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -891,6 +891,23 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
> >   */
> >  static struct lock_class_key hugetlbfs_i_mmap_rwsem_key;
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		lockdep_set_class(&mapping->i_mmap[i].rwsem,
> > +				&hugetlbfs_i_mmap_rwsem_key);
> > +	}
> > +}
> > +#else
> > +static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
> > +{
> > +	lockdep_set_class(&mapping->i_mmap_rwsem, &hugetlbfs_i_mmap_rwsem_key);
> > +}
> > +#endif
> > +
> >  static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> >  					struct mnt_idmap *idmap,
> >  					struct inode *dir,
> > @@ -915,8 +932,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> >  
> >  		inode->i_ino = get_next_ino();
> >  		inode_init_owner(idmap, inode, dir, mode);
> > -		lockdep_set_class(&inode->i_mapping->i_mmap_rwsem,
> > -				&hugetlbfs_i_mmap_rwsem_key);
> > +		hugetlbfs_lockdep_set_class(inode->i_mapping);
> >  		inode->i_mapping->a_ops = &hugetlbfs_aops;
> >  		simple_inode_init_ts(inode);
> >  		info->resv_map = resv_map;
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 62c579a0cf7d..cb67ae83f5b3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -214,6 +214,70 @@ static int no_open(struct inode *inode, struct file *file)
> >  	return -ENXIO;
> >  }
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +int split_tree_num;
> > +static int split_tree_align __maybe_unused = 32;
> > +
> > +static void __init init_split_tree_num(void)
> > +{
> > +#ifdef CONFIG_NUMA
> > +	split_tree_num = nr_node_ids;
> > +#else
> > +	split_tree_num = ALIGN(nr_cpu_ids, split_tree_align);
> > +#endif
> > +}
> 
> Again, too configurable. I think you're too stuck up on the NUMA case -

If you do not care about the NUMA. The performance will _NOT_ get improved
in our NUMA server. I had ever tested code which do not care about the NUMA,
and I got a bad performance. Avoid the remote access is a very important
thing for the NUMA server.

> which does not matter for many people - and may actively harm NUMA users. If
> I have a 128 core 2 NUMA node system, what should I shard by?
It is easy to extend the tree number for NUMA. :)

For the 128 core 2 NUMA, we can extend to more trees, such as:
   Two trees for each NUMA node.

> 
> > +
> > +static void free_mapping_i_mmap(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	if (!mapping->i_mmap)
> > +		return;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		kfree(mapping->i_mmap[i]);
> > +
> > +	kfree(mapping->i_mmap);
> > +	mapping->i_mmap = NULL;
> > +}
> > +
> > +static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
> > +{
> > +	struct i_mmap_tree *tree;
> > +	int i;
> > +
> > +	/* The extra one is used as terminator in vma_interval_tree_foreach() */
> > +	mapping->i_mmap = kzalloc(sizeof(tree) * (split_tree_num + 1), gfp);
> > +	if (!mapping->i_mmap)
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		tree = kzalloc_node(sizeof(*tree), gfp, i);
> > +		if (!tree)
> > +			goto nomem;
> > +
> > +		tree->root = RB_ROOT_CACHED;
> > +		init_rwsem(&tree->rwsem);
> 
> This (as-is) should blow up with lockdep + the locking loops down there.
okay, I will check it later.

thanks a lot.
> 
> > +
> > +		mapping->i_mmap[i] = tree;
> > +	}
> > +	return 0;
> > +nomem:
> > +	free_mapping_i_mmap(mapping);
> > +	return -ENOMEM;
> > +}
> 
> Honestly, it's likely that a simple static array in struct address_space
The array size is not fixed, so we cannot add a static array in address_space.

> suffices. I would not go through the trouble of getting everything very
> tight and NUMA correct.
> 
> > +#else
> > +static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
> > +{
> > +	mapping->i_mmap = RB_ROOT_CACHED;
> > +	init_rwsem(&mapping->i_mmap_rwsem);
> > +	return 0;
> > +}
> > +
> > +static void free_mapping_i_mmap(struct address_space *mapping) { }
> > +static void __init init_split_tree_num(void) {}
> > +#endif
> > +
> >  /**
> >   * inode_init_always_gfp - perform inode structure initialisation
> >   * @sb: superblock inode belongs to
> > @@ -302,9 +366,14 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
> >  #endif
> >  	inode->i_flctx = NULL;
> >  
> > -	if (unlikely(security_inode_alloc(inode, gfp)))
> > +	if (init_mapping_i_mmap(mapping, gfp))
> >  		return -ENOMEM;
> >  
> > +	if (unlikely(security_inode_alloc(inode, gfp))) {
> > +		free_mapping_i_mmap(mapping);
> > +		return -ENOMEM;
> > +	}
> > +
> >  	this_cpu_inc(nr_inodes);
> >  
> >  	return 0;
> > @@ -380,6 +449,7 @@ void __destroy_inode(struct inode *inode)
> >  	if (inode->i_default_acl && !is_uncached_acl(inode->i_default_acl))
> >  		posix_acl_release(inode->i_default_acl);
> >  #endif
> > +	free_mapping_i_mmap(&inode->i_data);
> >  	this_cpu_dec(nr_inodes);
> >  }
> >  EXPORT_SYMBOL(__destroy_inode);
> > @@ -480,9 +550,7 @@ EXPORT_SYMBOL(inc_nlink);
> >  static void __address_space_init_once(struct address_space *mapping)
> >  {
> >  	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
> > -	init_rwsem(&mapping->i_mmap_rwsem);
> >  	spin_lock_init(&mapping->i_private_lock);
> > -	mapping->i_mmap = RB_ROOT_CACHED;
> >  }
> >  
> >  void address_space_init_once(struct address_space *mapping)
> > @@ -2619,6 +2687,7 @@ void __init inode_init(void)
> >  					&i_hash_mask,
> >  					0,
> >  					0);
> > +	init_split_tree_num();
> >  }
> >  
> >  void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index cd46615b8f53..f4b3645b61df 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -450,6 +450,25 @@ struct mapping_metadata_bhs {
> >  	struct list_head list;	/* The list of bhs (b_assoc_buffers) */
> >  };
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +/*
> > + * struct i_mmap_tree - A single sibling tree of the file's split i_mmap.
> > + * @root: The red/black interval tree root.
> > + * @rwsem: Protects insert/remove operations on this sibling tree.
> > + * @vma_count: Number of VMAs in this sibling tree.
> > + *
> > + * When CONFIG_SPLIT_I_MMAP is enabled, the file's single i_mmap tree is
> > + * split into split_tree_num sibling trees, each with its own lock. This
> > + * reduces lock contention by allowing concurrent VMA insert/remove
> > + * operations on different sibling trees.
> > + */
> > +struct i_mmap_tree {
> > +	struct rb_root_cached	root;
> > +	struct rw_semaphore	rwsem;
> > +	atomic_t		vma_count;
> 
> I don't see what you need this vma_count for? I get the one in address_space,
> but this one does not seem useful.
For non-NUMA case, we can use it to determine which tree we should put the new
VMA.
Round-robin is not good enough for a dynamic system.

> 
> > +};
> > +#endif
> > +
> >  /**
> >   * struct address_space - Contents of a cacheable, mappable object.
> >   * @host: Owner, either the inode or the block_device.
> > @@ -461,8 +480,13 @@ struct mapping_metadata_bhs {
> >   * @gfp_mask: Memory allocation flags to use for allocating pages.
> >   * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
> >   * @nr_thps: Number of THPs in the pagecache (non-shmem only).
> > - * @i_mmap: Tree of private and shared mappings.
> > - * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
> > + * @i_mmap: Tree of private and shared mappings. When CONFIG_SPLIT_I_MMAP
> > + *   is enabled, this is an array of split_tree_num struct i_mmap_tree
> > + *   pointers (plus a NULL terminator).
> 
> NULL terminator wastes more memory, so I would really strongly avoid it as
> well.
any better idea?

> 
> > + * @vma_count: Total number of VMAs across all sibling trees (only when
> > + *   CONFIG_SPLIT_I_MMAP is enabled). Used by mapping_mapped().
> > + * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable (only when
> > + *   CONFIG_SPLIT_I_MMAP is disabled; otherwise per-tree rwsem is used).
> 
> So, there are very good reasons why you still need an i_mmap_rwsem protecting
> state, even with split mmap trees. Which I'll go into later.
> 
> >   * @nrpages: Number of page entries, protected by the i_pages lock.
> >   * @writeback_index: Writeback starts here.
> >   * @a_ops: Methods.
> > @@ -480,14 +504,19 @@ struct address_space {
> >  	/* number of thp, only for non-shmem files */
> >  	atomic_t		nr_thps;
> >  #endif
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +	struct i_mmap_tree	**i_mmap;
> > +	atomic_t		vma_count;
> > +#else
> >  	struct rb_root_cached	i_mmap;
> > +	struct rw_semaphore	i_mmap_rwsem;
> > +#endif
> >  	unsigned long		nrpages;
> >  	pgoff_t			writeback_index;
> >  	const struct address_space_operations *a_ops;
> >  	unsigned long		flags;
> >  	errseq_t		wb_err;
> >  	spinlock_t		i_private_lock;
> > -	struct rw_semaphore	i_mmap_rwsem;
> 
> See d3b1a9a778e1 ("fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.")
Got it.
> 
> >  } __attribute__((aligned(sizeof(long)))) __randomize_layout;
> >  	/*
> >  	 * On most architectures that alignment is already the case; but
> > @@ -508,6 +537,133 @@ static inline bool mapping_tagged(const struct address_space *mapping, xa_mark_t
> >  	return xa_marked(&mapping->i_pages, tag);
> >  }
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +static inline int mapping_mapped(const struct address_space *mapping)
> > +{
> > +	return	atomic_read(&mapping->vma_count);
> 
> Now that I think of it, I don't think we need atomic_t, only unsigned long +
> READ_ONCE() suffices. Increments can race just fine, we don't expect any 
> consistency there - if you want consistency you probably hold the i_mmap lock.
> 
okay. I will check it.

> > +}
> > +
> > +static inline void inc_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	atomic_inc(&tree->vma_count);
> > +	atomic_inc(&mapping->vma_count);
> > +}
> > +
> > +static inline void dec_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	atomic_dec(&tree->vma_count);
> > +	atomic_dec(&mapping->vma_count);
> > +}
> 
> This probably shouldn't be in linux/fs.h.
> 
> > +
> > +static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mapping)
> > +{
> > +	return (struct rb_root_cached *)mapping->i_mmap;
> > +}
> > +
> > +static inline void i_mmap_tree_lock_write(struct address_space *mapping,
> > +					struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	down_write(&tree->rwsem);
> > +}
> > +
> > +static inline void i_mmap_tree_unlock_write(struct address_space *mapping,
> > +					struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	up_write(&tree->rwsem);
> > +}
> > +
> > +#define i_mmap_lock_write_prepare(mapping)
> > +#define i_mmap_unlock_write_complete(mapping)
> 
> It's unclear to me why you added write_prepare() and write_complete().
> 
> > +
> > +extern int split_tree_num;
> > +static inline void i_mmap_lock_write(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		down_write(&mapping->i_mmap[i]->rwsem);
> 
> Oof, this is an incredibly large hammer. This is basically why I think keeping
> i_mmap_rwsem (in a different form) is required. You do not want to take $nr_cpus
> locks (read _or_ write). For my design, I keep i_mmap_rwsem, but I invert its
> meaning - taking it in write = I'm reading from the tree; taking it in read =
> I'm writing to the tree. This provides some lighter-weight exclusion between
> rmap walks and rmap tree manipulation.
okay, it seem your method is better. I am waiting for your patch.

> 
> _Technically_, you shouldn't need to always take a lock when manipulating the
> tree. A pattern like mnt_hold_writers()/mnt_get_write_access() can probably
> work well. But it may be too complex ATM.
> 
> 
> Also, note that you pretty much do not want i_mmap_lock_write() users after
> the conversion is done.
> 
> > +}
> > +
> > +static inline int i_mmap_trylock_write(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		if (!down_write_trylock(&mapping->i_mmap[i]->rwsem)) {
> > +			while (i--)
> > +				up_write(&mapping->i_mmap[i]->rwsem);
> > +			return 0;
> > +		}
> > +	}
> > +	return 1;
> > +}
> > +
> > +static inline void i_mmap_unlock_write(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		up_write(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline int i_mmap_trylock_read(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		if (!down_read_trylock(&mapping->i_mmap[i]->rwsem)) {
> > +			while (i--)
> > +				up_read(&mapping->i_mmap[i]->rwsem);
> > +			return 0;
> > +		}
> > +	}
> > +	return 1;
> > +}
> > +
> > +static inline void i_mmap_lock_read(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		down_read(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline void i_mmap_unlock_read(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		up_read(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline void i_mmap_assert_locked(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		lockdep_assert_held(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline void i_mmap_assert_write_locked(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		lockdep_assert_held_write(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +#else
> > +
> >  static inline void i_mmap_lock_write(struct address_space *mapping)
> >  {
> >  	down_write(&mapping->i_mmap_rwsem);
> > @@ -561,6 +717,18 @@ static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mappi
> >  	return &mapping->i_mmap;
> >  }
> >  
> > +static inline void inc_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma) { }
> > +static inline void dec_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma) { }
> > +
> > +#define i_mmap_lock_write_prepare(mapping)	i_mmap_lock_write(mapping)
> > +#define i_mmap_unlock_write_complete(mapping)	i_mmap_unlock_write(mapping)
> > +#define i_mmap_tree_lock_write(mapping, vma)
> > +#define i_mmap_tree_unlock_write(mapping, vma)
> > +
> > +#endif
> > +
> >  /*
> >   * Might pages of this file have been modified in userspace?
> >   * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0a45c6a8b9f2..9aa8119fa9bf 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4041,11 +4041,91 @@ struct vm_area_struct *vma_interval_tree_iter_first(struct rb_root_cached *root,
> >  struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node,
> >  				unsigned long start, unsigned long last);
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +extern int split_tree_num;
> > +
> > +static inline int smallest_tree_idx(struct file *file)
> > +{
> > +	struct address_space *mapping = file->f_mapping;
> > +	int tmp = INT_MAX, count;
> > +	int i, j = 0;
> > +
> > +	/*
> > +	 * Since a not 100% accurate value is still okay,
> > +	 * we do not need any lock here.
> > +	 */
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		count = atomic_read(&mapping->i_mmap[i]->vma_count);
> > +		if (count < tmp) {
> > +			j = i;
> > +			tmp = count;
> > +			if (!tmp)
> > +				break;
> > +		}
> > +	}
> 
> Ohh, I see why you want the per-subtree vma_count now. But is this a net-win?
It keep the trees as even as possible.

> I think doing something like vma-pointer-hashing or just smp_processor_id()
> would work a-ok.
> 
> > +	return j;
> > +}
> > +
> > +static inline void vma_set_tree_idx(struct vm_area_struct *vma)
> > +{
> > +#ifdef CONFIG_NUMA
> > +	vma->tree_idx = numa_node_id();
> > +#else
> > +	vma->tree_idx = smallest_tree_idx(vma->vm_file);
> > +#endif
> > +}
> > +
> > +static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
> > +					struct address_space *mapping)
> > +{
> > +	return &mapping->i_mmap[vma->tree_idx]->root;
> > +}
> > +
> > +/* Find the first valid VMA in the sibling trees */
> > +static inline struct vm_area_struct *first_vma(struct i_mmap_tree ***__r,
> > +				unsigned long start, unsigned long last)
> > +{
> > +	struct vm_area_struct *vma = NULL;
> > +	struct i_mmap_tree **tree = *__r;
> > +	struct rb_root_cached *root;
> > +
> > +	while (*tree) {
> > +		root = &(*tree)->root;
> > +		tree++;
> > +		vma = vma_interval_tree_iter_first(root, start, last);
> > +		if (vma)
> > +			break;
> > +	}
> > +
> > +	/* Save for the next loop */
> > +	*__r = tree;
> > +	return vma;
> > +}
> > +
> > +/*
> > + * Please use get_i_mmap_root() to get the @root.
> > + * @_tmp is referenced to avoid unused variable warning.
> > + */
> > +#define vma_interval_tree_foreach(vma, root, start, last)		\
> > +	for (struct i_mmap_tree **_r = (struct i_mmap_tree **)(root),	\
> > +		**_tmp = (vma = first_vma(&_r, start, last)) ? _r : NULL;\
> > +	     ((_tmp && vma) || (vma = first_vma(&_r, start, last)));	\
> > +		vma = vma_interval_tree_iter_next(vma, start, last))
> > +#else
> >  /* Please use get_i_mmap_root() to get the @root */
> >  #define vma_interval_tree_foreach(vma, root, start, last)		\
> >  	for (vma = vma_interval_tree_iter_first(root, start, last);	\
> >  	     vma; vma = vma_interval_tree_iter_next(vma, start, last))
> >  
> > +static inline void vma_set_tree_idx(struct vm_area_struct *vma) { }
> > +
> > +static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
> > +					struct address_space *mapping)
> > +{
> > +	return &mapping->i_mmap;
> > +}
> > +#endif
> > +
> >  void anon_vma_interval_tree_insert(struct anon_vma_chain *node,
> >  				   struct rb_root_cached *root);
> >  void anon_vma_interval_tree_remove(struct anon_vma_chain *node,
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index a308e2c23b82..8d6aab3346ce 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -1072,6 +1072,9 @@ struct vm_area_struct {
> >  #ifdef __HAVE_PFNMAP_TRACKING
> >  	struct pfnmap_track_ctx *pfnmap_track_ctx;
> >  #endif
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +	int tree_idx;			/* The sibling tree index for the VMA */
> > +#endif
> 
> FTR the struct hole isn't here, but right after vm_lock_seq or vm_refcnt in
> most configs.
okay, thanks.
I did not notice the struct hole issue.
> 
> >  } __randomize_layout;
> >  
> >  /* Clears all bits in the VMA flags bitmap, non-atomically. */
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 5a2ddcf68e0b..2d35cacffd19 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1888,7 +1888,8 @@ static inline void maybe_rmap_unlock_action(struct vm_area_struct *vma,
> >  
> >  	VM_WARN_ON_ONCE(vma_is_anonymous(vma));
> >  	file = vma->vm_file;
> > -	i_mmap_unlock_write(file->f_mapping);
> > +	i_mmap_tree_unlock_write(file->f_mapping, vma);
> > +	i_mmap_unlock_write_complete(file->f_mapping);
> >  	action->hide_from_rmap_until_complete = false;
> >  }
> >  
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index d714fdb357e5..70036ec9dcaa 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1825,15 +1825,20 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> >  			struct address_space *mapping = file->f_mapping;
> >  
> >  			get_file(file);
> > -			i_mmap_lock_write(mapping);
> > +			i_mmap_lock_write_prepare(mapping);
> > +			i_mmap_tree_lock_write(mapping, mpnt);
> > +
> >  			if (vma_is_shared_maywrite(tmp))
> >  				mapping_allow_writable(mapping);
> >  			flush_dcache_mmap_lock(mapping);
> >  			/* insert tmp into the share list, just after mpnt */
> >  			vma_interval_tree_insert_after(tmp, mpnt,
> > -					get_i_mmap_root(mapping));
> > +					get_rb_root(mpnt, mapping));
> > +			inc_mapping_vma(mapping, tmp);
> 
> Honestly, would prefer to hide all of these details from mmap.
yes, we can. 

But we need to change the functions in mm/interval_tree.c

> 
> >  			flush_dcache_mmap_unlock(mapping);
> > -			i_mmap_unlock_write(mapping);
> > +
> > +			i_mmap_tree_unlock_write(mapping, mpnt);
> > +			i_mmap_unlock_write_complete(mapping);
> >  		}
> >  
> >  		if (!(tmp->vm_flags & VM_WIPEONFORK))
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 0f18ffc658e9..1f2c60a220f6 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -567,11 +567,16 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
> >  	if (vma->vm_file) {
> >  		struct address_space *mapping = vma->vm_file->f_mapping;
> >  
> > -		i_mmap_lock_write(mapping);
> > +		i_mmap_lock_write_prepare(mapping);
> > +		i_mmap_tree_lock_write(mapping, vma);
> > +
> >  		flush_dcache_mmap_lock(mapping);
> > -		vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
> > +		vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
> > +		inc_mapping_vma(mapping, vma);
> >  		flush_dcache_mmap_unlock(mapping);
> > -		i_mmap_unlock_write(mapping);
> > +
> > +		i_mmap_tree_unlock_write(mapping, vma);
> > +		i_mmap_unlock_write_complete(mapping);
> >  	}
> >  }
> >  
> > @@ -583,11 +588,16 @@ static void cleanup_vma_from_mm(struct vm_area_struct *vma)
> >  		struct address_space *mapping;
> >  		mapping = vma->vm_file->f_mapping;
> >  
> > -		i_mmap_lock_write(mapping);
> > +		i_mmap_lock_write_prepare(mapping);
> > +		i_mmap_tree_lock_write(mapping, vma);
> > +
> >  		flush_dcache_mmap_lock(mapping);
> > -		vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
> > +		vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
> > +		dec_mapping_vma(mapping, vma);
> >  		flush_dcache_mmap_unlock(mapping);
> > -		i_mmap_unlock_write(mapping);
> > +
> > +		i_mmap_tree_unlock_write(mapping, vma);
> > +		i_mmap_unlock_write_complete(mapping);
> >  	}
> >  }
> >  
> > @@ -1063,6 +1073,7 @@ unsigned long do_mmap(struct file *file,
> >  	if (file) {
> >  		region->vm_file = get_file(file);
> >  		vma->vm_file = get_file(file);
> > +		vma_set_tree_idx(vma);
> 
> This is unrelated, shouldn't be done here.
> 
> >  	}
> >  
> >  	down_write(&nommu_region_sem);
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index 8df1b5077951..d5745519d95a 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -809,7 +809,7 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
> >  	if (!check_ops_safe(ops))
> >  		return -EINVAL;
> >  
> > -	lockdep_assert_held(&mapping->i_mmap_rwsem);
> > +	i_mmap_assert_locked(mapping);
> 
> This kind of conversion should be done in a separate step.
> 
> >  	vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), first_index,
> >  				  first_index + nr - 1) {
> >  		/* Clip to the vma */
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 6159650c1b42..2055758064a9 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -234,22 +234,23 @@ static void __vma_link_file(struct vm_area_struct *vma,
> >  		mapping_allow_writable(mapping);
> >  
> >  	flush_dcache_mmap_lock(mapping);
> > -	vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
> > +	vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
> > +	inc_mapping_vma(mapping, vma);
> 
> inc_mapping_vma() should probably be done implicitly by insertion?
Yes, we can. 
It is more grace to hide it in vma_interval_tree_insert.

> 
> >  	flush_dcache_mmap_unlock(mapping);
> >  }
> >  
> > -/*
> > - * Requires inode->i_mapping->i_mmap_rwsem
> > - */
> >  static void __remove_shared_vm_struct(struct vm_area_struct *vma,
> >  				      struct address_space *mapping)
> >  {
> > +	i_mmap_tree_lock_write(mapping, vma);
> >  	if (vma_is_shared_maywrite(vma))
> >  		mapping_unmap_writable(mapping);
> >  
> >  	flush_dcache_mmap_lock(mapping);
> > -	vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
> > +	vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
> > +	dec_mapping_vma(mapping, vma);
> >  	flush_dcache_mmap_unlock(mapping);
> > +	i_mmap_tree_unlock_write(mapping, vma);
> >  }
> >  
> >  /*
> > @@ -297,8 +298,9 @@ static void vma_prepare(struct vma_prepare *vp)
> >  			uprobe_munmap(vp->adj_next, vp->adj_next->vm_start,
> >  				      vp->adj_next->vm_end);
> >  
> > -		i_mmap_lock_write(vp->mapping);
> > +		i_mmap_lock_write_prepare(vp->mapping);
> >  		if (vp->insert && vp->insert->vm_file) {
> > +			i_mmap_tree_lock_write(vp->mapping, vp->insert);
> >  			/*
> >  			 * Put into interval tree now, so instantiated pages
> >  			 * are visible to arm/parisc __flush_dcache_page
> > @@ -307,6 +309,7 @@ static void vma_prepare(struct vma_prepare *vp)
> >  			 */
> >  			__vma_link_file(vp->insert,
> >  					vp->insert->vm_file->f_mapping);
> > +			i_mmap_tree_unlock_write(vp->mapping, vp->insert);
> >  		}
> >  	}
> >  
> > @@ -318,12 +321,17 @@ static void vma_prepare(struct vma_prepare *vp)
> >  	}
> >  
> >  	if (vp->file) {
> > +		i_mmap_tree_lock_write(vp->mapping, vp->vma);
> >  		flush_dcache_mmap_lock(vp->mapping);
> >  		vma_interval_tree_remove(vp->vma,
> > -					get_i_mmap_root(vp->mapping));
> > -		if (vp->adj_next)
> > +					get_rb_root(vp->vma, vp->mapping));
> > +		dec_mapping_vma(vp->mapping, vp->vma);
> > +		if (vp->adj_next) {
> > +			i_mmap_tree_lock_write(vp->mapping, vp->adj_next);
> >  			vma_interval_tree_remove(vp->adj_next,
> > -					get_i_mmap_root(vp->mapping));
> > +					get_rb_root(vp->adj_next, vp->mapping));
> > +			dec_mapping_vma(vp->mapping, vp->adj_next);
> > +		}
> >  	}
> >  
> >  }
> > @@ -340,12 +348,17 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> >  			 struct mm_struct *mm)
> >  {
> >  	if (vp->file) {
> > -		if (vp->adj_next)
> > +		if (vp->adj_next) {
> >  			vma_interval_tree_insert(vp->adj_next,
> > -					get_i_mmap_root(vp->mapping));
> > +					get_rb_root(vp->adj_next, vp->mapping));
> > +			inc_mapping_vma(vp->mapping, vp->adj_next);
> > +			i_mmap_tree_unlock_write(vp->mapping, vp->adj_next);
> > +		}
> >  		vma_interval_tree_insert(vp->vma,
> > -					get_i_mmap_root(vp->mapping));
> > +					get_rb_root(vp->vma, vp->mapping));
> > +		inc_mapping_vma(vp->mapping, vp->vma);
> >  		flush_dcache_mmap_unlock(vp->mapping);
> > +		i_mmap_tree_unlock_write(vp->mapping, vp->vma);
> >  	}
> >  
> >  	if (vp->remove && vp->file) {
> > @@ -370,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> >  	}
> >  
> >  	if (vp->file) {
> > -		i_mmap_unlock_write(vp->mapping);
> > +		i_mmap_unlock_write_complete(vp->mapping);
> >  
> >  		if (!vp->skip_vma_uprobe) {
> >  			uprobe_mmap(vp->vma);
> > @@ -1799,12 +1812,12 @@ static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb)
> >  	int i;
> >  
> >  	mapping = vb->vmas[0]->vm_file->f_mapping;
> > -	i_mmap_lock_write(mapping);
> > +	i_mmap_lock_write_prepare(mapping);
> >  	for (i = 0; i < vb->count; i++) {
> >  		VM_WARN_ON_ONCE(vb->vmas[i]->vm_file->f_mapping != mapping);
> >  		__remove_shared_vm_struct(vb->vmas[i], mapping);
> >  	}
> > -	i_mmap_unlock_write(mapping);
> > +	i_mmap_unlock_write_complete(mapping);
> >  
> >  	unlink_file_vma_batch_init(vb);
> >  }
> > @@ -1836,10 +1849,13 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
> >  
> >  	if (file) {
> >  		mapping = file->f_mapping;
> > -		i_mmap_lock_write(mapping);
> > +		i_mmap_lock_write_prepare(mapping);
> > +		i_mmap_tree_lock_write(mapping, vma);
> >  		__vma_link_file(vma, mapping);
> > -		if (!hold_rmap_lock)
> > -			i_mmap_unlock_write(mapping);
> > +		if (!hold_rmap_lock) {
> > +			i_mmap_tree_unlock_write(mapping, vma);
> > +			i_mmap_unlock_write_complete(mapping);
> > +		}
> >  	}
> >  }
> >  
> > @@ -2164,6 +2180,23 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
> >  	}
> >  }
> 
> I can but hope that all of the above is quite simplified before we get to the
> "making file rmap more complicated" bit.
:(
If we can do not care about the ARM device, we can make it simple.

Thanks
Huang Shijie


^ permalink raw reply

* Re: [PATCH v2 1/4] mm: use mapping_mapped to simplify the code
From: Huang Shijie @ 2026-06-12  6:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, vbabka, shakeel.butt, rppt, mhocko, corbet, skhan,
	linux, dinguyen, schuster.simon, James.Bottomley, deller, djbw,
	willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei
In-Reply-To: <airZn524Ip8VsWra@lucifer>

Hi Lorenzo & Pedro,
On Thu, Jun 11, 2026 at 04:52:54PM +0100, Lorenzo Stoakes wrote:
> On Thu, Jun 11, 2026 at 02:18:57PM +0800, Huang Shijie wrote:
> > Use mapping_mapped() to simplify the code, make
> > the code tidy and clean.
> >
> > Signed-off-by: Huang Shijie <huangsj@hygon.cn>
> 
> Yeah as Pedro said this one could just be sent separately, and I in fact
> suggest you do that :) So:
> 
Thank you Pedro and Lorenzo.
I can send a separate patch later.

Thanks
Huang Shijie


^ permalink raw reply

* Re: [PATCH v6 01/12] PCI: liveupdate: Set up FLB handler for the PCI core
From: Pasha Tatashin @ 2026-06-12  5:15 UTC (permalink / raw)
  To: David Matlack
  Cc: kexec, linux-doc, linux-kernel, linux-mm, linux-pci,
	Adithya Jayachandran, Alexander Graf, Alex Williamson,
	Bjorn Helgaas, Chris Li, David Rientjes, Jacob Pan,
	Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
	Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
	Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
	Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <20260522202410.3104264-2-dmatlack@google.com>

On Fri, 22 May 2026 20:23:59 +0000, David Matlack <dmatlack@google.com> wrote:
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2fb1c75afd16..6c618830cf61 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -20530,6 +20530,16 @@ L:	linux-pci@vger.kernel.org
>  S:	Supported
>  F:	Documentation/PCI/pci-error-recovery.rst
>  
> +PCI LIVE UPDATE
> +M:	David Matlack <dmatlack@google.com>

Please add Pratyush, Mike, and myself so we are notified directly of 
incoming patches, the same as with other areas where the liveupdate/ 
tree is specified.

>
> diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
> new file mode 100644
> index 000000000000..737e7b9366db
> --- /dev/null
> +++ b/drivers/pci/liveupdate.c
> @@ -0,0 +1,145 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright (c) 2026, Google LLC.
> + * David Matlack <dmatlack@google.com>
> + */
> +
> +/**
> + * DOC: PCI Live Update
> + *
> + * The PCI subsystem participates in the Live Update process to enable drivers
> + * to preserve their PCI devices across kexec.
> + *
> + * File-Lifecycle-Bound (FLB) Data
> + * ===============================

...

> + *
> + * PCI device preservation across Live Update is built on top of the Live Update
> + * Orchestrator's (LUO) support for file preservation across kexec. Drivers

I prefer to just use acronyms FLB, and LUO, but have links to the actual 
documentations about them.

So, something like this:

  * :ref:`FLB <flb>` Data
  * =====================
  *
  * PCI device preservation across Live Update is built on top of the
  * :ref:`LUO <luo>` support for file preservation across kexec. Drivers

And also add _luo and _flb to Documentation/core-api/liveupdate.rst

.. _luo:

 ========================
 Live Update Orchestrator
 ========================

.. _flb:

 LUO File Lifecycle Bound Global Data
 ====================================

> [ ... skip 17 lines ... ]
> + *
> + *  * ``pci_liveupdate_register_flb(driver_file_handler)``
> + *  * ``pci_liveupdate_unregister_flb(driver_file_handler)``
> + */
> +
> +#define pr_fmt(fmt) "PCI: liveupdate: " fmt

Nit, may be:

> +
> +#include <linux/io.h>
> +#include <linux/kexec_handover.h>
> +#include <linux/kho/abi/pci.h>
> +#include <linux/liveupdate.h>
> +#include <linux/mutex.h>
> +#include <linux/mm.h>

Please sort alphabetically.

> [ ... skip 12 lines ... ]
> +	 * future to increase the chances that there is enough room to preserve
> +	 * devices that are not yet present on the system (e.g. VFs, hot-plugged
> +	 * devices).
> +	 */
> +	for_each_pci_dev(dev)
> +		max_nr_devices++;

I think, we want to use kho_block [1] (it is in liveupdate/next branch) 
to allow number of supported devices to be dynamic.

To support this, we would redefine the ABI and tracking structures like 
so:

/* include/linux/kho/abi/pci.h */
struct pci_ser {
	u64 devices;      /* Phys address of the first block header of kho_block_set */
	u64 nr_devices;   /* Total count of active preserved devices */
} __packed;

/* drivers/pci/liveupdate.c */
struct pci_flb_outgoing {
	struct pci_ser *ser;            /* Points to the FDT/KHO-allocated ABI struct */
	struct kho_block_set block_set;  /* Controls the active blocks on the fly */
};

In  __pci_liveupdate_preserve_device() , we would search for 
and reuse any inactive  pci_dev_ser  slot first, and only call 
kho_block_set_grow() to expand if no inactive slots are available.

In pci_liveupdate_unpreserve_device(), we would simply 
mark the  pci_dev_ser as inactive.

>
> diff --git a/include/linux/pci_liveupdate.h b/include/linux/pci_liveupdate.h
> new file mode 100644
> index 000000000000..8ec98beefcb4
> --- /dev/null
> +++ b/include/linux/pci_liveupdate.h
> @@ -0,0 +1,30 @@
> [ ... skip 24 lines ... ]
> +static inline void pci_liveupdate_unregister_flb(struct liveupdate_file_handler *fh)
> +{
> +}
> +#endif
> +
> +#endif /* LINUX_PCI_LIVEUPDATE_H */

[1] https://lore.kernel.org/all/20260603154402.468928-1-pasha.tatashin@soleen.com/

Preserving: In  __pci_liveupdate_preserve_device() , we would search for 
Unpreserving: In  pci_liveupdate_unpreserve_device(), we would simply

Preserving: In  __pci_liveupdate_preserve_device() , we would search for 
Unpreserving: In  pci_liveupdate_unpreserve_device(), we would simply 

-- 
Pasha Tatashin <pasha.tatashin@soleen.com>

^ permalink raw reply

* [RFC PATCH 2/2] kasan: hw_tags: Add boot option to elide free time poisoning
From: Dev Jain @ 2026-06-12  4:44 UTC (permalink / raw)
  To: ryabinin.a.a, akpm, corbet
  Cc: Dev Jain, glider, andreyknvl, dvyukov, vincenzo.frascino,
	kasan-dev, linux-mm, linux-kernel, skhan, workflows, linux-doc,
	linux-arm-kernel, ryan.roberts, anshuman.khandual, kaleshsingh,
	21cnbao, david, will, catalin.marinas
In-Reply-To: <20260612044425.763060-1-dev.jain@arm.com>

Introduce a boot option to tag only at allocation time of the objects. This
reduces KASAN MTE overhead, the tradeoff being reduced ability
of catching bugs.

Now, when a memory object will be freed, it will retain the random tag it
had at allocation time. This compromises on catching UAF bugs, till the
time the object is not reallocated.

Hence, not catching "use-after-free-before-reallocation" and not catching
"double-free" will be the compromise for reduced KASAN overhead.

Keep this as a boot time feature to prevent building two kernel images.

To implement the feature, we need to effectively render kasan_poison()
redundant for hw tags case, but keep it working in the case where it is
used not in an object-freeing code path, but the redzoning path (which
means, poisoning the tail end of a vmalloc or kmalloc allocation).

We achieve this by overloading the poison values for the hw tags case: we
define the four poison values as 0x0E, 0x1E, 0x2E, 0x3E. In kasan_poison(),
if we arrive with KASAN_SLAB_REDZONE or KASAN_PAGE_REDZONE, do a bitwise
OR on the value of the tag to make it equal to KASAN_TAG_INVALID.

If not, then, if init is true, zero out the memory and bail out.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 Documentation/dev-tools/kasan.rst |  4 +++
 mm/kasan/hw_tags.c                | 43 ++++++++++++++++++++++++++++++-
 mm/kasan/kasan.h                  | 23 ++++++++++++++++-
 3 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst
index 4968b2aa60c80..b0c30584b5062 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -146,6 +146,10 @@ disabling KASAN altogether or controlling its features:
 - ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc
   allocations (default: ``on``).
 
+- ``kasan.tag_only_on_alloc=off`` or ``=on`` disables or enables skipping
+  free-time tagging (poisoning) while keeping allocation-time tagging enabled
+  (default: ``off``).
+
 - ``kasan.page_alloc.sample=<sampling interval>`` makes KASAN tag only every
   Nth page_alloc allocation with the order equal or greater than
   ``kasan.page_alloc.sample.order``, where N is the value of the ``sample``
diff --git a/mm/kasan/hw_tags.c b/mm/kasan/hw_tags.c
index c1a2b48808ed7..a392e34d11e3a 100644
--- a/mm/kasan/hw_tags.c
+++ b/mm/kasan/hw_tags.c
@@ -41,9 +41,16 @@ enum kasan_arg_vmalloc {
 	KASAN_ARG_VMALLOC_ON,
 };
 
+enum kasan_arg_tag_only_on_alloc {
+	KASAN_ARG_TAG_ONLY_ON_ALLOC_DEFAULT,
+	KASAN_ARG_TAG_ONLY_ON_ALLOC_OFF,
+	KASAN_ARG_TAG_ONLY_ON_ALLOC_ON,
+};
+
 static enum kasan_arg kasan_arg __ro_after_init;
 static enum kasan_arg_mode kasan_arg_mode __ro_after_init;
 static enum kasan_arg_vmalloc kasan_arg_vmalloc __initdata;
+static enum kasan_arg_tag_only_on_alloc kasan_arg_tag_only_on_alloc __initdata;
 
 /*
  * Whether the selected mode is synchronous, asynchronous, or asymmetric.
@@ -63,6 +70,10 @@ EXPORT_SYMBOL_GPL(kasan_flag_vmalloc);
 /* Whether to check write accesses only. */
 static bool kasan_flag_write_only = false;
 
+/* Whether to skip free-time tagging. */
+DEFINE_STATIC_KEY_FALSE(kasan_flag_tag_only_on_alloc);
+EXPORT_SYMBOL_GPL(kasan_flag_tag_only_on_alloc);
+
 #define PAGE_ALLOC_SAMPLE_DEFAULT	1
 #define PAGE_ALLOC_SAMPLE_ORDER_DEFAULT	3
 
@@ -154,6 +165,23 @@ static int __init early_kasan_flag_write_only(char *arg)
 }
 early_param("kasan.write_only", early_kasan_flag_write_only);
 
+/* kasan.tag_only_on_alloc=off/on */
+static int __init early_kasan_flag_tag_only_on_alloc(char *arg)
+{
+	if (!arg)
+		return -EINVAL;
+
+	if (!strcmp(arg, "off"))
+		kasan_arg_tag_only_on_alloc = KASAN_ARG_TAG_ONLY_ON_ALLOC_OFF;
+	else if (!strcmp(arg, "on"))
+		kasan_arg_tag_only_on_alloc = KASAN_ARG_TAG_ONLY_ON_ALLOC_ON;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+early_param("kasan.tag_only_on_alloc", early_kasan_flag_tag_only_on_alloc);
+
 static inline const char *kasan_mode_info(void)
 {
 	if (kasan_mode == KASAN_MODE_ASYNC)
@@ -270,14 +298,27 @@ void __init kasan_init_hw_tags(void)
 		break;
 	}
 
+	switch (kasan_arg_tag_only_on_alloc) {
+	case KASAN_ARG_TAG_ONLY_ON_ALLOC_DEFAULT:
+		/* Default is specified by kasan_flag_tag_only_on_alloc. */
+		break;
+	case KASAN_ARG_TAG_ONLY_ON_ALLOC_OFF:
+		static_branch_disable(&kasan_flag_tag_only_on_alloc);
+		break;
+	case KASAN_ARG_TAG_ONLY_ON_ALLOC_ON:
+		static_branch_enable(&kasan_flag_tag_only_on_alloc);
+		break;
+	}
+
 	kasan_init_tags();
 
 	/* KASAN is now initialized, enable it. */
 	kasan_enable();
 
-	pr_info("KernelAddressSanitizer initialized (hw-tags, mode=%s, vmalloc=%s, stacktrace=%s, write_only=%s)\n",
+	pr_info("KernelAddressSanitizer initialized (hw-tags, mode=%s, vmalloc=%s, tag_only_on_alloc=%s, stacktrace=%s, write_only=%s)\n",
 		kasan_mode_info(),
 		str_on_off(kasan_vmalloc_enabled()),
+		str_on_off(kasan_tag_only_on_alloc_enabled()),
 		str_on_off(kasan_stack_collection_enabled()),
 		str_on_off(kasan_flag_write_only));
 }
diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
index fc9169a547662..4fa8abb312faa 100644
--- a/mm/kasan/kasan.h
+++ b/mm/kasan/kasan.h
@@ -33,6 +33,7 @@ static inline bool kasan_stack_collection_enabled(void)
 #include "../slab.h"
 
 DECLARE_STATIC_KEY_TRUE(kasan_flag_vmalloc);
+DECLARE_STATIC_KEY_FALSE(kasan_flag_tag_only_on_alloc);
 
 enum kasan_mode {
 	KASAN_MODE_SYNC,
@@ -52,6 +53,11 @@ static inline bool kasan_vmalloc_enabled(void)
 	return static_branch_likely(&kasan_flag_vmalloc);
 }
 
+static inline bool kasan_tag_only_on_alloc_enabled(void)
+{
+	return static_branch_unlikely(&kasan_flag_tag_only_on_alloc);
+}
+
 static inline bool kasan_async_fault_possible(void)
 {
 	return kasan_mode == KASAN_MODE_ASYNC || kasan_mode == KASAN_MODE_ASYMM;
@@ -145,12 +151,17 @@ static inline bool kasan_requires_meta(void)
 #define KASAN_SLAB_REDZONE	0xFC  /* redzone for slab object */
 #define KASAN_SLAB_FREE		0xFB  /* freed slab object */
 #define KASAN_VMALLOC_INVALID	0xF8  /* inaccessible space in vmap area */
+#elif defined(CONFIG_KASAN_HW_TAGS)
+#define KASAN_PAGE_FREE		0x0E
+#define KASAN_PAGE_REDZONE	0x1E
+#define KASAN_SLAB_REDZONE	0x2E
+#define KASAN_SLAB_FREE		0x3E
 #else
 #define KASAN_PAGE_FREE		KASAN_TAG_INVALID
 #define KASAN_PAGE_REDZONE	KASAN_TAG_INVALID
 #define KASAN_SLAB_REDZONE	KASAN_TAG_INVALID
 #define KASAN_SLAB_FREE		KASAN_TAG_INVALID
-#define KASAN_VMALLOC_INVALID	KASAN_TAG_INVALID /* only used for SW_TAGS */
+#define KASAN_VMALLOC_INVALID	KASAN_TAG_INVALID
 #endif
 
 #ifdef CONFIG_KASAN_GENERIC
@@ -478,6 +489,16 @@ static inline u8 kasan_random_tag(void) { return 0; }
 
 static inline void kasan_poison(const void *addr, size_t size, u8 value, bool init)
 {
+	if (kasan_tag_only_on_alloc_enabled()) {
+		if ((value != KASAN_SLAB_REDZONE) && (value != KASAN_PAGE_REDZONE)) {
+			if (init)
+				memset((void *)kasan_reset_tag(addr), 0, size);
+			return;
+		}
+	}
+
+	value |= 0xF0;
+
 	if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK))
 		return;
 	if (WARN_ON(size & KASAN_GRANULE_MASK))
-- 
2.43.0


^ permalink raw reply related

* [RFC PATCH 1/2] kasan: hw_tags: Use KASAN_PAGE_REDZONE for vmalloc redzoning
From: Dev Jain @ 2026-06-12  4:44 UTC (permalink / raw)
  To: ryabinin.a.a, akpm, corbet
  Cc: Dev Jain, glider, andreyknvl, dvyukov, vincenzo.frascino,
	kasan-dev, linux-mm, linux-kernel, skhan, workflows, linux-doc,
	linux-arm-kernel, ryan.roberts, anshuman.khandual, kaleshsingh,
	21cnbao, david, will, catalin.marinas
In-Reply-To: <20260612044425.763060-1-dev.jain@arm.com>

In preparation for adding "tag only on alloc" boot time option, use
KASAN_PAGE_REDZONE instead of KASAN_TAG_INVALID for poisoning the tail end
of the vmalloc allocation.

Although both values are the same for hw tags, KASAN_SLAB_REDZONE is used
for poisoning the tail end of a kmalloc object allocation, so maintain
the pattern.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/kasan/hw_tags.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/kasan/hw_tags.c b/mm/kasan/hw_tags.c
index cbef5e450954e..c1a2b48808ed7 100644
--- a/mm/kasan/hw_tags.c
+++ b/mm/kasan/hw_tags.c
@@ -375,7 +375,7 @@ void *__kasan_unpoison_vmalloc(const void *start, unsigned long size,
 	redzone_start = round_up((unsigned long)start + size,
 				 KASAN_GRANULE_SIZE);
 	redzone_size = round_up(redzone_start, PAGE_SIZE) - redzone_start;
-	kasan_poison((void *)redzone_start, redzone_size, KASAN_TAG_INVALID,
+	kasan_poison((void *)redzone_start, redzone_size, KASAN_PAGE_REDZONE,
 		     flags & KASAN_VMALLOC_INIT);
 
 	/*
-- 
2.43.0


^ permalink raw reply related

* [RFC PATCH 0/2] kasan: hw_tags: Add option to tag only at allocation time
From: Dev Jain @ 2026-06-12  4:44 UTC (permalink / raw)
  To: ryabinin.a.a, akpm, corbet
  Cc: Dev Jain, glider, andreyknvl, dvyukov, vincenzo.frascino,
	kasan-dev, linux-mm, linux-kernel, skhan, workflows, linux-doc,
	linux-arm-kernel, ryan.roberts, anshuman.khandual, kaleshsingh,
	21cnbao, david, will, catalin.marinas

Introduce a boot option to tag only at allocation time of the objects. This
reduces KASAN MTE overhead, the tradeoff being reduced ability of
catching bugs.

Now, when a memory object will be freed, it will retain the random tag it
had at allocation time. This compromises on catching UAF bugs, till the
time the object is not reallocated, at which point it will have a new
random tag.

Hence, not catching "use-after-free-before-reallocation" and not catching
"double-free" will be the compromise for reduced KASAN overhead.

This is an RFC because we are not clear about the performance benefit.

Android folks, please help with testing!

---
Applies on Linus master (9716c086c8e8).

Dev Jain (2):
  kasan: hw_tags: Use KASAN_PAGE_REDZONE for vmalloc redzoning
  kasan: hw_tags: Add boot option to elide free time poisoning

 Documentation/dev-tools/kasan.rst |  4 +++
 mm/kasan/hw_tags.c                | 45 +++++++++++++++++++++++++++++--
 mm/kasan/kasan.h                  | 23 +++++++++++++++-
 3 files changed, 69 insertions(+), 3 deletions(-)

-- 
2.43.0


^ permalink raw reply

* Re: [PATCH v2 3/7] seg6: add End.M.GTP6.E behavior
From: Yuya Kusakabe @ 2026-06-12  3:14 UTC (permalink / raw)
  To: andrea
  Cc: Yuya Kusakabe, andrea.mayer, davem, edumazet, dsahern, kuba,
	pabeni, horms, justin.iurman, shuah, corbet, skhan, linux-kernel,
	netdev, linux-kselftest, linux-doc, stefano.salsano, ahabdels
In-Reply-To: <20260605032001.2f46e6a55f69896d29da69df@common-net.org>

From: Yuya Kusakabe <yuya.kusakabe@gmail.com>

Hi Andrea,

Thank you for the review. The points shared with patch 2 (NF_HOOK
split removal, drop reasons via your prep series, reverse christmas
tree, the missing frag_off check, BAD_INNER scoping, the repeated
size-selection ternary, iptunnel_handle_offloads(), the fixed source
port, and the RFC 6040 wording) will be addressed as described in my
patch 2 reply and apply here the same way. Below are the
End.M.GTP6.E-specific points.

> SEG6_LOCAL_MOBILE_SRC_ADDR (the "src" attribute) is copied verbatim into
> the outer IPv6 source address. In patch 2 (End.M.GTP4.E) the same
> attribute is used as a template from which bits are extracted to form
> the IPv4 source address, and may be entirely unused depending on
> v4_mask_len.
> This UAPI overload needs revision.

Agreed. With v4_mask_len gone, End.M.GTP4.E will not take src at all
(the IPv4 SA will be recovered purely from the inbound IPv6 SA, see
the patch 2 reply), which removes the verbatim-vs-template overload.
In the new SEG6_MOBILE_* namespace I plan to give SEG6_MOBILE_SRC_ADDR
a single meaning for the IPv6-emitting behaviors
(End.M.GTP6.E/D/D.Di): the outer IPv6 source address, used verbatim.
The one remaining non-verbatim consumer would be H.M.GTP4.D, where the
configured address acts as the RFC 9433 Figure 12 "Source UPF Prefix"
template with exactly the 32 IPv4 SA bits overlaid at
v6_src_prefix_len. H.M.GTP4.D posts last in the per-behavior order, so
if you prefer the two semantics not to share one attribute name, I can
give the template a distinctly named attribute in that series.

> udp6_set_csum() already handles the CHECKSUM_PARTIAL + pseudo-header seed
> setup and also covers the GSO case. Using it would avoid open-coding this
> sequence.

Will switch to udp6_set_csum(), thanks. It is also more correct than
the open-coded sequence: for a non-GSO inner that arrives
CHECKSUM_PARTIAL it resolves the inner checksum via local checksum
offload instead of clobbering csum_start.

> seg6_lookup_any_nexthop() already calls skb_dst_drop() internally. The
> explicit call above is redundant.

Will remove.

> Nit: fc_dst_len is int in struct fib6_config (IPv6 prefix length, range
> 0..128); the (unsigned int) cast is not needed.

This check will move into the attribute parser of the new explicit
locator-length attribute (see the patch 2 reply), so the fib6_config
peek and the cast both go away.

Thanks,
Yuya

^ permalink raw reply

* [PATCH v4 3/3] hwmon: Add documentation for SQ24860
From: Ziming Zhu @ 2026-06-12  3:03 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	Shuah Khan, linux-hwmon, devicetree, linux-kernel, linux-doc,
	Ziming Zhu
In-Reply-To: <20260612030304.5165-1-zmzhu0630@163.com>

From: Ziming Zhu <ziming.zhu@silergycorp.com>

Document the supported sysfs attributes for the Silergy SQ24860 PMBus
hwmon driver.

Signed-off-by: Ziming Zhu <ziming.zhu@silergycorp.com>
---
 Documentation/hwmon/index.rst   |  1 +
 Documentation/hwmon/sq24860.rst | 96 +++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)
 create mode 100644 Documentation/hwmon/sq24860.rst

diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst
index 8b655e5d6b68..6184b88e2095 100644
--- a/Documentation/hwmon/index.rst
+++ b/Documentation/hwmon/index.rst
@@ -243,6 +243,7 @@ Hardware Monitoring Kernel Drivers
    smsc47m1
    sparx5-temp
    spd5118
+   sq24860
    stpddc60
    surface_fan
    sy7636a-hwmon
diff --git a/Documentation/hwmon/sq24860.rst b/Documentation/hwmon/sq24860.rst
new file mode 100644
index 000000000000..f0182b955d8a
--- /dev/null
+++ b/Documentation/hwmon/sq24860.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Kernel driver sq24860
+=====================
+
+Supported chips:
+
+  * Silergy SQ24860
+
+    Prefix: 'sq24860'
+
+Author:
+
+	Ziming Zhu <ziming.zhu@silergycorp.com>
+
+Description
+------------
+
+This driver implements support for the Silergy SQ24860 eFuse. The device is an
+integrated circuit protection and power management device with a PMBus
+interface.
+
+The device supports direct format for reading input voltage, output voltage,
+auxiliary voltage, input current, input power, and temperature.
+
+The current and power measurement scale depends on the resistor connected
+between the IMON pin and ground. The resistor value can be configured with the
+``silergy,rimon-micro-ohms`` device tree property. See
+``Documentation/devicetree/bindings/hwmon/pmbus/silergy,sq24860.yaml`` for details.
+
+Due to the specificities of the chip, all history reset attributes are tied
+together. Resetting the history of one sensor resets the history of all sensors.
+
+Sysfs entries
+-------------
+
+The following attributes are supported. Limits are read-write; all other
+attributes are read-only.
+
+======================= ======================================================
+in1_label               "vin"
+in1_input               Measured input voltage.
+in1_average             Average measured input voltage.
+in1_min                 Minimum input voltage limit.
+in1_lcrit               Critical low input voltage limit.
+in1_max                 Maximum input voltage limit.
+in1_crit                Critical high input voltage limit.
+in1_min_alarm           Input voltage low warning alarm.
+in1_lcrit_alarm         Input voltage low fault alarm.
+in1_max_alarm           Input voltage high warning alarm.
+in1_crit_alarm          Input voltage high fault alarm.
+in1_highest             Historical maximum input voltage.
+in1_lowest              Historical minimum input voltage.
+in1_reset_history       Write any value to reset history.
+
+in2_label               "vmon"
+in2_input               Measured auxiliary input voltage.
+
+in3_label               "vout1"
+in3_input               Measured output voltage.
+in3_average             Average measured output voltage.
+in3_min                 Minimum output voltage limit.
+in3_min_alarm           Output voltage low alarm.
+in3_lowest              Historical minimum output voltage.
+in3_reset_history       Write any value to reset history.
+
+curr1_label             "iin"
+curr1_input             Measured input current.
+curr1_average           Average measured input current.
+curr1_max               Maximum input current warning limit.
+curr1_crit              Critical input over-current fault limit.
+curr1_max_alarm         Input current warning alarm.
+curr1_crit_alarm        Input over-current fault alarm.
+curr1_highest           Historical maximum input current.
+curr1_reset_history     Write any value to reset history.
+
+power1_label            "pin"
+power1_input            Measured input power.
+power1_average          Average measured input power.
+power1_max              Maximum input power warning limit.
+power1_alarm            Input power warning alarm.
+power1_input_highest    Historical maximum input power.
+power1_reset_history    Write any value to reset history.
+
+temp1_input             Measured temperature.
+temp1_average           Average measured temperature.
+temp1_max               Maximum temperature warning limit.
+temp1_crit              Critical temperature fault limit.
+temp1_max_alarm         Temperature warning alarm.
+temp1_crit_alarm        Temperature fault alarm.
+temp1_highest           Historical maximum temperature.
+temp1_reset_history     Write any value to reset history.
+
+samples                 Number of samples used for average values.
+======================= ======================================================
+
-- 
2.25.1


^ permalink raw reply related

* [PATCH v4 0/3] Add Silergy SQ24860 support
From: Ziming Zhu @ 2026-06-12  3:03 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	Shuah Khan, linux-hwmon, devicetree, linux-kernel, linux-doc,
	Ziming Zhu

From: Ziming Zhu <ziming.zhu@silergycorp.com>

Changes in v4:
- dt-bindings: Collected Reviewed-by tag from Conor Dooley.
- hwmon: pmbus: sq24860: Fixed signedness issue on PMBus limits where
  negative user inputs were silently parsed as large positive unsigned 
  values. Now casting limit values to s16 to properly intercept negative
  bounds.
- hwmon: pmbus: sq24860: Fixed PMBUS_IIN_OC_FAULT_LIMIT handling to 
  silently clamp out-of-range lower limits to the nearest supported
  hardware value (SQ24860_IIN_OCF_OFF) instead of returning -EINVAL, 
  complying with hwmon ABI conventions.
- Fixed function parenthesis alignments reported by checkpatch.

Changes in v3:
- fix remaining checkpatch issues in the SQ24860 driver
- use C comments consistently in the driver
- drop unused header files
- make GIMON a constant in the gain calculation helper
- use proper 64-bit division for the calibration gain calculation
- return -EINVAL when the calculated gain does not fit
- reject PMBUS_IIN_OC_FAULT_LIMIT values outside the hardware range
- treat malformed silergy,rimon-micro-ohms as an error
- sort sq24860 correctly in Documentation/hwmon/index.rst

Ziming Zhu (3):
  dt-bindings: hwmon: pmbus: Add bindings for Silergy SQ24860
  hwmon: pmbus: Add support for Silergy SQ24860
  hwmon: Add documentation for SQ24860

 .../bindings/hwmon/pmbus/silergy,sq24860.yaml |  74 +++
 Documentation/hwmon/index.rst                 |   1 +
 Documentation/hwmon/sq24860.rst               |  96 ++++
 drivers/hwmon/pmbus/Kconfig                   |  19 +
 drivers/hwmon/pmbus/Makefile                  |   1 +
 drivers/hwmon/pmbus/sq24860.c                 | 430 ++++++++++++++++++
 6 files changed, 621 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/hwmon/pmbus/silergy,sq24860.yaml
 create mode 100644 Documentation/hwmon/sq24860.rst
 create mode 100644 drivers/hwmon/pmbus/sq24860.c

-- 
2.25.1


^ permalink raw reply

* [PATCH v4 2/3] hwmon: pmbus: Add support for Silergy SQ24860
From: Ziming Zhu @ 2026-06-12  3:03 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	Shuah Khan, linux-hwmon, devicetree, linux-kernel, linux-doc,
	Ziming Zhu
In-Reply-To: <20260612030304.5165-1-zmzhu0630@163.com>

From: Ziming Zhu <ziming.zhu@silergycorp.com>

Add PMBus hwmon support for the Silergy SQ24860 eFuse.

The driver reports input voltage, output voltage, auxiliary voltage,
input current, input power, and temperature. It also exposes peak,
average, and minimum history attributes, sample count configuration,
and maps the manufacturer-specific VIREF register to the generic input
over-current fault limit attribute.

The IMON resistor value is read from the silergy,rimon-micro-ohms device
property and used to configure the input current calibration gain.

Signed-off-by: Ziming Zhu <ziming.zhu@silergycorp.com>
---
 drivers/hwmon/pmbus/Kconfig   |  19 ++
 drivers/hwmon/pmbus/Makefile  |   1 +
 drivers/hwmon/pmbus/sq24860.c | 430 ++++++++++++++++++++++++++++++++++
 3 files changed, 450 insertions(+)
 create mode 100644 drivers/hwmon/pmbus/sq24860.c

diff --git a/drivers/hwmon/pmbus/Kconfig b/drivers/hwmon/pmbus/Kconfig
index 8f4bff375ecb..a905b5af137c 100644
--- a/drivers/hwmon/pmbus/Kconfig
+++ b/drivers/hwmon/pmbus/Kconfig
@@ -612,6 +612,25 @@ config SENSORS_STEF48H28
 	  This driver can also be built as a module. If so, the module will
 	  be called stef48h28.
 
+config SENSORS_SQ24860
+	tristate "Silergy SQ24860"
+	help
+	  If you say yes here you get hardware monitoring support for Silergy
+	  SQ24860 eFuse.
+
+	  This driver can also be built as a module. If so, the module will
+	  be called sq24860.
+
+config SENSORS_SQ24860_REGULATOR
+	bool "Regulator support for SQ24860"
+	depends on SENSORS_SQ24860 && REGULATOR
+	default SENSORS_SQ24860
+	help
+	  If you say yes here you get regulator support for Silergy SQ24860.
+	  The regulator is registered through the PMBus regulator framework and
+	  can be used to control the output exposed by the device.
+	  This option is only useful if regulator framework support is needed.
+
 config SENSORS_STPDDC60
 	tristate "ST STPDDC60"
 	help
diff --git a/drivers/hwmon/pmbus/Makefile b/drivers/hwmon/pmbus/Makefile
index 7129b62bc00f..86bc93c6c091 100644
--- a/drivers/hwmon/pmbus/Makefile
+++ b/drivers/hwmon/pmbus/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_SENSORS_PM6764TR)	+= pm6764tr.o
 obj-$(CONFIG_SENSORS_PXE1610)	+= pxe1610.o
 obj-$(CONFIG_SENSORS_Q54SJ108A2)	+= q54sj108a2.o
 obj-$(CONFIG_SENSORS_STEF48H28)	+= stef48h28.o
+obj-$(CONFIG_SENSORS_SQ24860)	+= sq24860.o
 obj-$(CONFIG_SENSORS_STPDDC60)	+= stpddc60.o
 obj-$(CONFIG_SENSORS_TDA38640)	+= tda38640.o
 obj-$(CONFIG_SENSORS_TPS25990)	+= tps25990.o
diff --git a/drivers/hwmon/pmbus/sq24860.c b/drivers/hwmon/pmbus/sq24860.c
new file mode 100644
index 000000000000..30202a4b34cf
--- /dev/null
+++ b/drivers/hwmon/pmbus/sq24860.c
@@ -0,0 +1,430 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Author: Ziming Zhu <ziming.zhu@silergycorp.com>
+ */
+
+#include <linux/bitfield.h>
+#include <linux/err.h>
+#include <linux/i2c.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/math64.h>
+
+#include "pmbus.h"
+
+#define SQ24860_IIN_CAL_GAIN		0x38
+#define SQ24860_READ_VAUX		0xd0
+#define SQ24860_READ_VIN_MIN		0xd1
+#define SQ24860_READ_VIN_PEAK		0xd2
+#define SQ24860_READ_IIN_PEAK		0xd4
+#define SQ24860_READ_PIN_PEAK		0xd5
+#define SQ24860_READ_TEMP_AVG		0xd6
+#define SQ24860_READ_TEMP_PEAK		0xd7
+#define SQ24860_READ_VOUT_MIN		0xda
+#define SQ24860_READ_VIN_AVG		0xdc
+#define SQ24860_READ_VOUT_AVG		0xdd
+#define SQ24860_READ_IIN_AVG		0xde
+#define SQ24860_READ_PIN_AVG		0xdf
+#define SQ24860_VIREF			0xe0
+#define SQ24860_PK_MIN_AVG		0xea
+#define PK_MIN_AVG_RST_PEAK		BIT(7)
+#define PK_MIN_AVG_RST_AVG		BIT(6)
+#define PK_MIN_AVG_RST_MIN		BIT(5)
+#define PK_MIN_AVG_AVG_CNT		GENMASK(2, 0)
+#define SQ24860_MFR_WRITE_PROTECT	0xf8
+#define SQ24860_UNLOCKED		BIT(7)
+
+#define SQ24860_8B_SHIFT		2
+#define SQ24860_IIN_OCF_NUM		1000000
+#define SQ24860_IIN_OCF_DIV		129278
+#define SQ24860_IIN_OCF_OFF		165
+
+#define PK_MIN_AVG_RST_MASK		(PK_MIN_AVG_RST_PEAK | \
+					 PK_MIN_AVG_RST_AVG  | \
+					 PK_MIN_AVG_RST_MIN)
+#define SQ24860_MAX_SAMPLES		BIT(FIELD_MAX(PK_MIN_AVG_AVG_CNT))
+/*
+ * Arbitrary default Rimon value: 1.6kOhm
+ */
+#define SQ24860_DEFAULT_RIMON		1600000000
+#define SQ24860_GIMON			18180
+
+#define SQ24860_VAUX_DIV		20
+
+static int sq24860_write_iin_cal_gain(struct i2c_client *client, u32 rimon)
+{
+	u64 temp = 6400ULL * 1000000000ULL * 1000ULL;
+	u64 denom;
+	u64 word;
+
+	if (!rimon)
+		return -EINVAL;
+
+	denom = (u64)rimon * SQ24860_GIMON;
+	word = div64_u64(temp, denom);
+	if (!word || word > U16_MAX)
+		return -EINVAL;
+
+	return i2c_smbus_write_word_data(client, SQ24860_IIN_CAL_GAIN,
+					(u16)word);
+}
+
+static int sq24860_mfr_write_protect_set(struct i2c_client *client,
+					 u8 protect)
+{
+	u8 val;
+
+	switch (protect) {
+	case 0:
+		val = 0xa2;
+		break;
+	case PB_WP_ALL:
+		val = 0x0;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return pmbus_write_byte_data(client, -1, SQ24860_MFR_WRITE_PROTECT,
+				     val);
+}
+
+static int sq24860_mfr_write_protect_get(struct i2c_client *client)
+{
+	int ret = pmbus_read_byte_data(client, -1, SQ24860_MFR_WRITE_PROTECT);
+
+	if (ret < 0)
+		return ret;
+
+	return (ret & SQ24860_UNLOCKED) ? 0 : PB_WP_ALL;
+}
+
+static int sq24860_read_word_data(struct i2c_client *client,
+				  int page, int phase, int reg)
+{
+	int ret;
+
+	switch (reg) {
+	case PMBUS_VIRT_READ_VIN_MAX:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_VIN_PEAK);
+		break;
+
+	case PMBUS_VIRT_READ_VIN_MIN:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_VIN_MIN);
+		break;
+
+	case PMBUS_VIRT_READ_VIN_AVG:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_VIN_AVG);
+		break;
+
+	case PMBUS_VIRT_READ_VOUT_MIN:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_VOUT_MIN);
+		break;
+
+	case PMBUS_VIRT_READ_VOUT_AVG:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_VOUT_AVG);
+		break;
+
+	case PMBUS_VIRT_READ_IIN_AVG:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_IIN_AVG);
+		break;
+
+	case PMBUS_VIRT_READ_IIN_MAX:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_IIN_PEAK);
+		break;
+
+	case PMBUS_VIRT_READ_TEMP_AVG:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_TEMP_AVG);
+		break;
+
+	case PMBUS_VIRT_READ_TEMP_MAX:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_TEMP_PEAK);
+		break;
+
+	case PMBUS_VIRT_READ_PIN_AVG:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_PIN_AVG);
+		break;
+
+	case PMBUS_VIRT_READ_PIN_MAX:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_PIN_PEAK);
+		break;
+
+	case PMBUS_VIRT_READ_VMON:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_VAUX);
+		if (ret < 0)
+			break;
+		ret = DIV_ROUND_CLOSEST(ret, SQ24860_VAUX_DIV);
+		break;
+
+	case PMBUS_VIN_UV_WARN_LIMIT:
+	case PMBUS_VIN_UV_FAULT_LIMIT:
+	case PMBUS_VIN_OV_WARN_LIMIT:
+	case PMBUS_VIN_OV_FAULT_LIMIT:
+	case PMBUS_VOUT_UV_WARN_LIMIT:
+	case PMBUS_IIN_OC_WARN_LIMIT:
+	case PMBUS_OT_WARN_LIMIT:
+	case PMBUS_OT_FAULT_LIMIT:
+	case PMBUS_PIN_OP_WARN_LIMIT:
+		/*
+		 * These registers provide an 8 bits value instead of a
+		 * 10bits one. Just shifting twice the register value is
+		 * enough to make the sensor type conversion work, even
+		 * if the datasheet provides different m, b and R for
+		 * those.
+		 */
+		ret = pmbus_read_word_data(client, page, phase, reg);
+		if (ret < 0)
+			break;
+		ret <<= SQ24860_8B_SHIFT;
+		break;
+
+	case PMBUS_IIN_OC_FAULT_LIMIT:
+		/*
+		 * VIREF directly sets the over-current limit at which the eFuse
+		 * will turn the FET off and trigger a fault. Expose it through
+		 * this generic property instead of a manufacturer specific one.
+		 */
+		ret = pmbus_read_byte_data(client, page, SQ24860_VIREF);
+		if (ret < 0)
+			break;
+		ret = DIV_ROUND_CLOSEST(ret * SQ24860_IIN_OCF_NUM,
+					SQ24860_IIN_OCF_DIV);
+		ret += SQ24860_IIN_OCF_OFF;
+		break;
+
+	case PMBUS_VIRT_SAMPLES:
+		ret = pmbus_read_byte_data(client, page, SQ24860_PK_MIN_AVG);
+		if (ret < 0)
+			break;
+		ret = BIT(FIELD_GET(PK_MIN_AVG_AVG_CNT, ret));
+		break;
+
+	case PMBUS_VIRT_RESET_TEMP_HISTORY:
+	case PMBUS_VIRT_RESET_VIN_HISTORY:
+	case PMBUS_VIRT_RESET_IIN_HISTORY:
+	case PMBUS_VIRT_RESET_PIN_HISTORY:
+	case PMBUS_VIRT_RESET_VOUT_HISTORY:
+		ret = 0;
+		break;
+
+	default:
+		ret = -ENODATA;
+		break;
+	}
+
+	return ret;
+}
+
+static int sq24860_write_word_data(struct i2c_client *client,
+				   int page, int reg, u16 value)
+{
+	int ret;
+
+	switch (reg) {
+	case PMBUS_VIN_UV_WARN_LIMIT:
+	case PMBUS_VIN_UV_FAULT_LIMIT:
+	case PMBUS_VIN_OV_WARN_LIMIT:
+	case PMBUS_VIN_OV_FAULT_LIMIT:
+	case PMBUS_VOUT_UV_WARN_LIMIT:
+	case PMBUS_IIN_OC_WARN_LIMIT:
+	case PMBUS_OT_WARN_LIMIT:
+	case PMBUS_OT_FAULT_LIMIT:
+	case PMBUS_PIN_OP_WARN_LIMIT:
+		value = max_t(s16, (s16)value, 0);
+		value >>= SQ24860_8B_SHIFT;
+		value = clamp_val(value, 0, 0xff);
+		ret = pmbus_write_word_data(client, page, reg, value);
+		break;
+
+	case PMBUS_IIN_OC_FAULT_LIMIT:
+		value = max_t(s16, (s16)value, SQ24860_IIN_OCF_OFF);
+		value -= SQ24860_IIN_OCF_OFF;
+		value = DIV_ROUND_CLOSEST(((unsigned int)value) * SQ24860_IIN_OCF_DIV,
+					  SQ24860_IIN_OCF_NUM);
+		value = clamp_val(value, 0, 0x3f);
+		ret = pmbus_write_byte_data(client, page, SQ24860_VIREF, value);
+		break;
+
+	case PMBUS_VIRT_SAMPLES:
+		value = clamp_val(value, 1, SQ24860_MAX_SAMPLES);
+		value = ilog2(value);
+		ret = pmbus_update_byte_data(client, page, SQ24860_PK_MIN_AVG,
+					     PK_MIN_AVG_AVG_CNT,
+					     FIELD_PREP(PK_MIN_AVG_AVG_CNT, value));
+		break;
+
+	case PMBUS_VIRT_RESET_TEMP_HISTORY:
+	case PMBUS_VIRT_RESET_VIN_HISTORY:
+	case PMBUS_VIRT_RESET_IIN_HISTORY:
+	case PMBUS_VIRT_RESET_PIN_HISTORY:
+	case PMBUS_VIRT_RESET_VOUT_HISTORY:
+		/*
+		 * SQ24860 has history resets based on MIN/AVG/PEAK instead of per
+		 * sensor type. Exposing this quirk in hwmon is not desirable so
+		 * reset MIN, AVG and PEAK together. Even is there effectively only
+		 * one reset, which resets everything, expose the 5 entries so
+		 * userspace is not required map a sensor type to another to trigger
+		 * a reset
+		 */
+		ret = pmbus_update_byte_data(client, 0, SQ24860_PK_MIN_AVG,
+					     PK_MIN_AVG_RST_MASK,
+					     PK_MIN_AVG_RST_MASK);
+		break;
+
+	default:
+		ret = -ENODATA;
+		break;
+	}
+
+	return ret;
+}
+
+static int sq24860_read_byte_data(struct i2c_client *client,
+				  int page, int reg)
+{
+	int ret;
+
+	switch (reg) {
+	case PMBUS_WRITE_PROTECT:
+		ret = sq24860_mfr_write_protect_get(client);
+		break;
+
+	default:
+		ret = -ENODATA;
+		break;
+	}
+
+	return ret;
+}
+
+static int sq24860_write_byte_data(struct i2c_client *client,
+				   int page, int reg, u8 byte)
+{
+	int ret;
+
+	switch (reg) {
+	case PMBUS_WRITE_PROTECT:
+		ret = sq24860_mfr_write_protect_set(client, byte);
+		break;
+
+	default:
+		ret = -ENODATA;
+		break;
+	}
+
+	return ret;
+}
+
+#if IS_ENABLED(CONFIG_SENSORS_SQ24860_REGULATOR)
+static const struct regulator_desc sq24860_reg_desc[] = {
+	PMBUS_REGULATOR_ONE_NODE("vout"),
+};
+#endif
+
+static const struct pmbus_driver_info sq24860_base_info = {
+	.pages = 1,
+	.format[PSC_VOLTAGE_IN] = direct,
+	.m[PSC_VOLTAGE_IN] = 64,
+	.b[PSC_VOLTAGE_IN] = 0,
+	.R[PSC_VOLTAGE_IN] = 0,
+	.format[PSC_VOLTAGE_OUT] = direct,
+	.m[PSC_VOLTAGE_OUT] = 64,
+	.b[PSC_VOLTAGE_OUT] = 0,
+	.R[PSC_VOLTAGE_OUT] = 0,
+	.format[PSC_TEMPERATURE] = direct,
+	.m[PSC_TEMPERATURE] = 1,
+	.b[PSC_TEMPERATURE] = 0,
+	.R[PSC_TEMPERATURE] = 0,
+	/*
+	 * Current and power measurements depend on the calibration gain
+	 * programmed from the board-specific IMON resistor value.
+	 */
+	.format[PSC_CURRENT_IN] = direct,
+	.m[PSC_CURRENT_IN] = 16,
+	.b[PSC_CURRENT_IN] = 0,
+	.R[PSC_CURRENT_IN] = 0,
+	.format[PSC_POWER] = direct,
+	.m[PSC_POWER] = 2,
+	.b[PSC_POWER] = 0,
+	.R[PSC_POWER] = 0,
+	.func[0] = PMBUS_HAVE_VIN |
+		   PMBUS_HAVE_VOUT |
+		   PMBUS_HAVE_VMON |
+		   PMBUS_HAVE_IIN |
+		   PMBUS_HAVE_PIN |
+		   PMBUS_HAVE_TEMP |
+		   PMBUS_HAVE_STATUS_VOUT |
+		   PMBUS_HAVE_STATUS_IOUT |
+		   PMBUS_HAVE_STATUS_INPUT |
+		   PMBUS_HAVE_STATUS_TEMP |
+		   PMBUS_HAVE_SAMPLES,
+	.read_word_data = sq24860_read_word_data,
+	.write_word_data = sq24860_write_word_data,
+	.read_byte_data = sq24860_read_byte_data,
+	.write_byte_data = sq24860_write_byte_data,
+
+#if IS_ENABLED(CONFIG_SENSORS_SQ24860_REGULATOR)
+	.reg_desc = sq24860_reg_desc,
+	.num_regulators = ARRAY_SIZE(sq24860_reg_desc),
+#endif
+};
+
+static const struct i2c_device_id sq24860_i2c_id[] = {
+	{ "sq24860" },
+	{}
+};
+MODULE_DEVICE_TABLE(i2c, sq24860_i2c_id);
+
+static const struct of_device_id sq24860_of_match[] = {
+	{ .compatible = "silergy,sq24860" },
+	{}
+};
+MODULE_DEVICE_TABLE(of, sq24860_of_match);
+
+static int sq24860_probe(struct i2c_client *client)
+{
+	struct device *dev = &client->dev;
+	struct pmbus_driver_info *info;
+	u32 rimon;
+	int ret;
+
+	if (device_property_read_u32(dev, "silergy,rimon-micro-ohms", &rimon))
+		rimon = SQ24860_DEFAULT_RIMON;
+	ret = sq24860_write_iin_cal_gain(client, rimon);
+	if (ret < 0)
+		return dev_err_probe(&client->dev, ret,
+					     "Failed to set gain\n");
+	info = devm_kmemdup(dev, &sq24860_base_info, sizeof(*info), GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
+
+	return pmbus_do_probe(client, info);
+}
+
+static struct i2c_driver sq24860_driver = {
+	.driver = {
+		.name = "sq24860",
+		.of_match_table = sq24860_of_match,
+	},
+	.probe = sq24860_probe,
+	.id_table = sq24860_i2c_id,
+};
+module_i2c_driver(sq24860_driver);
+
+MODULE_AUTHOR("Ziming Zhu <ziming.zhu@silergycorp.com>");
+MODULE_DESCRIPTION("PMBUS driver for SQ24860 eFuse");
+MODULE_LICENSE("GPL");
+MODULE_IMPORT_NS("PMBUS");
-- 
2.25.1


^ permalink raw reply related

* [PATCH v4 1/3] dt-bindings: hwmon: pmbus: Add bindings for Silergy SQ24860
From: Ziming Zhu @ 2026-06-12  3:03 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	Shuah Khan, linux-hwmon, devicetree, linux-kernel, linux-doc,
	Ziming Zhu, Conor Dooley
In-Reply-To: <20260612030304.5165-1-zmzhu0630@163.com>

From: Ziming Zhu <ziming.zhu@silergycorp.com>

Add devicetree binding documentation for the Silergy SQ24860 eFuse.

The device is a PMBus hardware monitoring device which reports voltage,
current, power, and temperature telemetry. The board-specific IMON
resistor value is described with silergy,rimon-micro-ohms.

Signed-off-by: Ziming Zhu <ziming.zhu@silergycorp.com>

Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
---
 .../bindings/hwmon/pmbus/silergy,sq24860.yaml | 74 +++++++++++++++++++
 1 file changed, 74 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/hwmon/pmbus/silergy,sq24860.yaml

diff --git a/Documentation/devicetree/bindings/hwmon/pmbus/silergy,sq24860.yaml b/Documentation/devicetree/bindings/hwmon/pmbus/silergy,sq24860.yaml
new file mode 100644
index 000000000000..03ef82c11e1a
--- /dev/null
+++ b/Documentation/devicetree/bindings/hwmon/pmbus/silergy,sq24860.yaml
@@ -0,0 +1,74 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+
+$id: http://devicetree.org/schemas/hwmon/pmbus/silergy,sq24860.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Silergy SQ24860 eFuse
+
+maintainers:
+  - Ziming Zhu <ziming.zhu@silergycorp.com>
+
+description:
+  The Silergy SQ24860 is an integrated, high-current circuit protection and
+  power management device with PMBus interface.
+
+properties:
+  compatible:
+    const: silergy,sq24860
+
+  reg:
+    maxItems: 1
+
+  silergy,rimon-micro-ohms:
+    description:
+      Micro-ohms value of the resistance installed between the IMON pin and
+      the ground reference.
+
+  interrupts:
+    description: PMBus SMBAlert interrupt.
+    maxItems: 1
+
+  regulators:
+    type: object
+    description:
+      List of regulators provided by this controller.
+
+    properties:
+      vout:
+        $ref: /schemas/regulator/regulator.yaml#
+        type: object
+        unevaluatedProperties: false
+
+    additionalProperties: false
+
+required:
+  - compatible
+  - reg
+  - silergy,rimon-micro-ohms
+
+additionalProperties: false
+
+examples:
+  - |
+
+    i2c {
+        #address-cells = <1>;
+        #size-cells = <0>;
+
+        hw-monitor@40 {
+            compatible = "silergy,sq24860";
+            reg = <0x40>;
+
+            interrupt-parent = <&gpio>;
+            interrupts = <42 8>;
+            silergy,rimon-micro-ohms = <1600000000>;
+
+            regulators {
+                cpu0_vout: vout {
+                    regulator-name = "main_cpu0";
+                };
+            };
+        };
+    };
-- 
2.25.1


^ permalink raw reply related

* [nsa:xlnx/fix/buf-mmap-multibuffer 27173/27391] htmldocs: Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/i3c/adi,i3c-master.yaml
From: kernel test robot @ 2026-06-12  2:58 UTC (permalink / raw)
  To: Jorge Marques
  Cc: oe-kbuild-all, Nuno Sa, Frank Li, Alexandre Belloni, linux-doc

tree:   https://github.com/nunojsa/linux xlnx/fix/buf-mmap-multibuffer
head:   a26a8baba71e866951f6abf4fc6c0504770c272e
commit: c54362d581209a7421fe0d0caf70ff172289da75 [27173/27391] i3c: master: Add driver for Analog Devices I3C Controller IP
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project f43d6834093b19baf79beda8c0337ab020ac5f17)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260612/202606120453.2rP7tq7s-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606120453.2rP7tq7s-lkp@intel.com/

All warnings (new ones prefixed by >>):

   from /zdci/src/kernel-tests/bisect-test-build-error.sh:102: main
   Warning: Documentation/devicetree/bindings/iio/adc/adi,ltc2308.yaml references a file that doesn't exist: Documentation/devicetree/bindings/iio/adc/adc.txt
   Warning: Documentation/devicetree/bindings/iio/adc/adi,ltc2308.yaml references a file that doesn't exist: Documentation/devicetree/bindings/iio/adc/adc.txt
   Warning: Documentation/devicetree/bindings/regulator/siliconmitus,sm5703-regulator.yaml references a file that doesn't exist: Documentation/devicetree/bindings/mfd/siliconmitus,sm5703.yaml
   Warning: Documentation/hwmon/g762.rst references a file that doesn't exist: Documentation/devicetree/bindings/hwmon/g762.txt
>> Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/i3c/adi,i3c-master.yaml
   Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt
   Using alabaster theme
--
     from /zdci/src/kernel-tests/bisect-test-build-error.sh:102: main
   Warning: Documentation/devicetree/bindings/iio/adc/adi,ltc2308.yaml references a file that doesn't exist: Documentation/devicetree/bindings/iio/adc/adc.txt
   Warning: Documentation/devicetree/bindings/iio/adc/adi,ltc2308.yaml references a file that doesn't exist: Documentation/devicetree/bindings/iio/adc/adc.txt
   Warning: Documentation/devicetree/bindings/regulator/siliconmitus,sm5703-regulator.yaml references a file that doesn't exist: Documentation/devicetree/bindings/mfd/siliconmitus,sm5703.yaml
   Warning: Documentation/hwmon/g762.rst references a file that doesn't exist: Documentation/devicetree/bindings/hwmon/g762.txt
>> Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/i3c/adi,i3c-master.yaml
   Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
From: Shanker Donthineni @ 2026-06-12  1:13 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Vladimir Murzin, Jason Gunthorpe,
	linux-arm-kernel@lists.infradead.org, Mark Rutland,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	Vikram Sethi, Jason Sequeira, Shanker Donthineni
In-Reply-To: <IA1PR12MB6089049028A73A2078FC6831C71B2@IA1PR12MB6089.namprd12.prod.outlook.com>

Hi Will,

On 6/11/2026 8:39 AM, sdonthineni@nvidia.com wrote:
>
> -----Original Message-----
> From: Will Deacon <will@kernel.org>
> Sent: Thursday, June 11, 2026 8:34 AM
> To: Shanker Donthineni <sdonthineni@nvidia.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>; Vladimir Murzin <vladimir.murzin@arm.com>; Jason Gunthorpe <jgg@nvidia.com>; linux-arm-kernel@lists.infradead.org; Mark Rutland <mark.rutland@arm.com>; linux-kernel@vger.kernel.org; linux-doc@vger.kernel.org; Vikram Sethi <vsethi@nvidia.com>; Jason Sequeira <jsequeira@nvidia.com>
> Subject: Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
>
> External email: Use caution opening links or attachments
>
>
> On Wed, Jun 10, 2026 at 11:48:22AM -0500, Shanker Donthineni wrote:
>> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
>> observed by a peripheral before an older, non-overlapping Device-nGnR*
>> store to the same peripheral. This breaks the program-order guarantee
>> that software expects for Device-nGnR* accesses and can leave a
>> peripheral in an incorrect state, as a load is observed before an
>> earlier store takes effect.
>>
>> The erratum can occur only when all of the following apply:
>>
>>    - A PE executes a Device-nGnR* store followed by a younger
>>      Device-nGnR* load.
>>    - The store is not a store-release.
>>    - The accesses target the same peripheral and do not overlap in bytes.
>>    - There is at most one intervening Device-nGnR* store in program
>>      order, and there are no intervening Device-nGnR* loads.
>>    - There is no DSB, and no DMB that orders loads, between the store and
>>      the load.
>>    - Specific micro-architectural and timing conditions occur.
>>
>> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain
>> str* to stlr* (Store-Release), which removes the "store is not a
>> store-release" condition for every device write the kernel issues.
>> Because writel() and writel_relaxed() are both built on __raw_writel()
>> in asm-generic/io.h, patching the raw variants covers both the
>> non-relaxed and relaxed APIs without touching the higher layers. Note
>> that writel()'s own barrier sits before the store, so it does not
>> order the store against a subsequent readl(); the store-release
>> promotion is what provides that ordering.
>>
>> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a
>> new ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only
>> activated on parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs
>> continue to use the plain str* sequence.
>>
>> Note: stlr* only supports base-register addressing, so affected CPUs
>> use a base-register stlr* path. Unaffected CPUs keep the original
>> offset-addressed str* sequence introduced by commit d044d6ba6f02
>> ("arm64: io: permit offset addressing").
>>
>> The __const_memcpy_toio_aligned32() and
>> __const_memcpy_toio_aligned64() helpers are left unchanged. These
>> helpers are intended for write-combining mappings, which are Normal-NC
>> on arm64. Replacing their contiguous str* groups would defeat the
>> write-combining behavior used to improve store performance.
>>
>> Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
>> Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
>> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>> ---
>> Changes since v2:
>>    - Reworked the raw MMIO write helpers so unaffected CPUs keep the
>>      existing offset-addressed STR sequence, while affected CPUs use the
>>      base-register STLR path.
>>    - Updated the commit message to match the code changes.
>>    - Rebased on top of the arm64 for-next/errata branch:
>>      
>> https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h
>> =for-next/errata
>>
>> Changes since v1:
>>    - Updated the commit message based on feedback from Vladimir Murzin.
>>
>>   Documentation/arch/arm64/silicon-errata.rst |  2 ++
>>   arch/arm64/Kconfig                          | 23 ++++++++++++++++
>>   arch/arm64/include/asm/io.h                 | 30 +++++++++++++++++++++
>>   arch/arm64/kernel/cpu_errata.c              |  8 ++++++
>>   arch/arm64/tools/cpucaps                    |  1 +
>>   5 files changed, 64 insertions(+)
>>
>> diff --git a/Documentation/arch/arm64/silicon-errata.rst
>> b/Documentation/arch/arm64/silicon-errata.rst
>> index ad09bbb10da80..fc45125dc2f80 100644
>> --- a/Documentation/arch/arm64/silicon-errata.rst
>> +++ b/Documentation/arch/arm64/silicon-errata.rst
>> @@ -298,6 +298,8 @@ stable kernels.
>>   +----------------+-----------------+-----------------+-----------------------------+
>>   | NVIDIA         | Carmel Core     | N/A             | NVIDIA_CARMEL_CNP_ERRATUM   |
>>   
>> +----------------+-----------------+-----------------+----------------
>> -------------+
>> +| NVIDIA         | Olympus core    | T410-OLY-1027   | NVIDIA_OLYMPUS_1027_ERRATUM |
>> ++----------------+-----------------+-----------------+-----------------------------+
>>   | NVIDIA         | Olympus core    | T410-OLY-1029   | ARM64_ERRATUM_4118414       |
>>   +----------------+-----------------+-----------------+-----------------------------+
>>   | NVIDIA         | T241 GICv3/4.x  | T241-FABRIC-4   | N/A                         |
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index
>> c65cef81be86a..d633eb70de1ac 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075
>>
>>          If unsure, say Y.
>>
>> +config NVIDIA_OLYMPUS_1027_ERRATUM
>> +     bool "NVIDIA Olympus: device store/load ordering erratum"
>> +     default y
>> +     help
>> +       This option adds an alternative code sequence to work around an
>> +       NVIDIA Olympus core erratum where a Device-nGnR* store can be
>> +       observed by a peripheral after a younger Device-nGnR* load to the
>> +       same peripheral. This breaks the program order that drivers rely
>> +       on for MMIO and can leave a device in an incorrect state.
>> +
>> +       The workaround promotes the raw MMIO store helpers
>> +       (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
>> +       required ordering. Because writel() and writel_relaxed() are built
>> +       on __raw_writel(), both are covered without changes to the higher
>> +       layers.
>> +
>> +       The fix is applied through the alternatives framework, so enabling
>> +       this option does not by itself activate the workaround: it is
>> +       patched in only when an affected CPU is detected, and is a no-op on
>> +       unaffected CPUs.
>> +
>> +       If unsure, say Y.
>> +
>>   config ARM64_ERRATUM_834220
>>        bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
>>        depends on KVM
>> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
>> index 8cbd1e96fd50b..801223e754c90 100644
>> --- a/arch/arm64/include/asm/io.h
>> +++ b/arch/arm64/include/asm/io.h
>> @@ -22,10 +22,22 @@
>>   /*
>>    * Generic IO read/write.  These perform native-endian accesses.
>>    */
>> +static __always_inline bool arm64_needs_device_store_release(void)
>> +{
>> +     return alternative_has_cap_unlikely(
>> +                             ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
>> +}
>> +
>>   #define __raw_writeb __raw_writeb
>>   static __always_inline void __raw_writeb(u8 val, volatile void
>> __iomem *addr)  {
>>        volatile u8 __iomem *ptr = addr;
>> +
>> +     if (arm64_needs_device_store_release()) {
>> +             asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
>> +             return;
>> +     }
>> +
>>        asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));  }
> Use an 'else' clause instead of the early return? (similarly for the other changes).
>
> I still reckon you should do something with the memcpy-to-io routines.
> A simple option could be to make dgh() a dmb on parts with the erratum?
> That at least moves the barrier out of the loop.

Thanks Will. I looked again at both the arm64 comments and the generic iomap_copy.c
contract, and I’m not convinced that making dgh() a dmb is the right fit for this
path. Based on the documented comments, callers should not assume ordering from
these helpers; if ordering is required around a memcpy, the call site should already
be providing the necessary barriers.

Related data point in generic lib/iomap_copy.c:

/**
  * __iowrite32_copy - copy data to MMIO space, in 32-bit units
  * @to: destination, in MMIO space (must be 32-bit aligned)
  * @from: source (must be 32-bit aligned)
  * @count: number of 32-bit quantities to copy
  *
  * Copy data from kernel space to MMIO space, in units of 32 bits at a
  * time.  Order of access is not guaranteed, nor is a memory barrier
  * performed afterwards.
  */
#ifndef __iowrite32_copy
void __iowrite32_copy(void __iomem *to, const void *from, size_t count)

/**
  * __iowrite64_copy - copy data to MMIO space, in 64-bit or 32-bit units
  * @to: destination, in MMIO space (must be 64-bit aligned)
  * @from: source (must be 64-bit aligned)
  * @count: number of 64-bit quantities to copy
  *
  * Copy data from kernel space to MMIO space, in units of 32 or 64 bits at a
  * time.  Order of access is not guaranteed, nor is a memory barrier
  * performed afterwards.
  */
#ifndef __iowrite64_copy
void __iowrite64_copy(void __iomem *to, const void *from, size_t count)

/**
  * __iowrite32_copy - copy data to MMIO space, in 32-bit units
  * @to: destination, in MMIO space (must be 32-bit aligned)
  * @from: source (must be 32-bit aligned)
  * @count: number of 32-bit quantities to copy
  *
  * Copy data from kernel space to MMIO space, in units of 32 bits at a
  * time.  Order of access is not guaranteed, nor is a memory barrier
  * performed afterwards.
  */
#ifndef __iowrite32_copy
void __iowrite32_copy(void __iomem *to, const void *from, size_t count)


The arm64 comment says in arch/arm64/asm/io.h:

/*
  * The ARM64 iowrite implementation is intended to support drivers that want to
  * use write combining. For instance PCI drivers using write combining with a 64
  * byte __iowrite64_copy() expect to get a 64 byte MemWr TLP on the PCIe bus.
  *
  * Newer ARM core have sensitive write combining buffers, it is important that
  * the stores be contiguous blocks of store instructions. Normal memcpy
  * approaches have a very low chance to generate write combining.
  *
  * Since this is the only API on ARM64 that should be used with write combining
  * it also integrates the DGH hint which is supposed to lower the latency to
  * emit the large TLP from the CPU.
  */

So my reading is that dgh() in the arm64 implementation is there for the
write-combining/gathering behavior. Replacing it with dmb would make this
path stronger than the generic API contract and could penalize performance
of the WC use case.

For the scalar MMIO helpers, the workaround promotes the raw writes to
store-release on affected CPUs as v1/v2 shown below. For the memcpy-toIO
helpers, could you please clarify the specific reason for adding a dmb despite
the documented no-ordering contract? Is the concern that some drivers may
be relying on ordering across memcpy_toio_*() today even though the API
does not guarantee it, and that we should cover those cases defensively?

Would prefer to avoid replacing DGH() with DMB unless there is a strong
reason to do so. Please let me know if I can post the v4 patch with
the change below, while keeping DGH() as-is in the memcpy-toIO path.

  #define __raw_writeb __raw_writeb
  static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
  {
-       volatile u8 __iomem *ptr = addr;
-       asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+       asm volatile(ALTERNATIVE("strb %w0, [%1]",
+                                "stlrb %w0, [%1]",
+                                ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+                    : : "rZ" (val), "r" (addr));
  }

  #define __raw_writew __raw_writew
  static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
  {
-       volatile u16 __iomem *ptr = addr;
-       asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+       asm volatile(ALTERNATIVE("strh %w0, [%1]",
+                                "stlrh %w0, [%1]",
+                                ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+                    : : "rZ" (val), "r" (addr));
  }

  #define __raw_writel __raw_writel
  static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
  {
-       volatile u32 __iomem *ptr = addr;
-       asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+       asm volatile(ALTERNATIVE("str %w0, [%1]",
+                                "stlr %w0, [%1]",
+                                ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+                    : : "rZ" (val), "r" (addr));
  }

  #define __raw_writeq __raw_writeq
  static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
  {
-       volatile u64 __iomem *ptr = addr;
-       asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
+       asm volatile(ALTERNATIVE("str %x0, [%1]",
+                                "stlr %x0, [%1]",
+                                ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+                    : : "rZ" (val), "r" (addr));
  }


-Shanker


^ permalink raw reply

* Re: [PATCH v3 02/12] x86/resctrl: Add data structures and definitions for PLZA configuration
From: Reinette Chatre @ 2026-06-11 23:40 UTC (permalink / raw)
  To: Babu Moger, corbet, tony.luck, Dave.Martin, james.morse, tglx, bp,
	dave.hansen
  Cc: skhan, x86, mingo, hpa, akpm, rdunlap, pawan.kumar.gupta,
	feng.tang, dapeng1.mi, kees, elver, lirongqing, paulmck, bhelgaas,
	seanjc, alexandre.chartre, yazen.ghannam, peterz, chang.seok.bae,
	kim.phillips, xin, naveen, thomas.lendacky, linux-doc,
	linux-kernel, eranian, peternewman
In-Reply-To: <e84fdbc324b312ff137d279ec154e3827c0aed81.1777591497.git.babu.moger@amd.com>

Hi Babu,

On 4/30/26 4:24 PM, Babu Moger wrote:
> Privilege Level Zero Association (PLZA) is configured per logical processor
> via MSR_IA32_PQR_PLZA_ASSOC (0xc00003fc). Software must program RMID and
> CLOSID association fields and their enable bits using the layout defined
> for the MSR.
> 
> Define MSR_IA32_PQR_PLZA_ASSOC and the RMID_EN, CLOSID_EN, and PLZA_EN bit
> masks in asm/msr-index.h. Add union msr_pqr_plza_assoc in arch resctrl
> internal.h

Above paragraph captures what can be seen from the patch. Please check entire
series for this since many changelogs in this series verbatim describes the code
changes in patch without helping reader understand why those changes are made.


> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 9dc6b610e4e2..623628d3c643 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1287,10 +1287,17 @@
>  /* - AMD: */
>  #define MSR_IA32_MBA_BW_BASE		0xc0000200
>  #define MSR_IA32_SMBA_BW_BASE		0xc0000280
> +#define MSR_IA32_PQR_PLZA_ASSOC		0xc00003fc
>  #define MSR_IA32_L3_QOS_ABMC_CFG	0xc00003fd
>  #define MSR_IA32_L3_QOS_EXT_CFG		0xc00003ff
>  #define MSR_IA32_EVT_CFG_BASE		0xc0000400
>  
> +/* Lower 32 bits of MSR_IA32_PQR_PLZA_ASSOC */
> +#define RMID_EN				BIT(31)
> +/* Upper 32 bits of MSR_IA32_PQR_PLZA_ASSOC */
> +#define CLOSID_EN			BIT(15)
> +#define PLZA_EN				BIT(31)
> +

This is unexpected. So far resctrl has only defined the MSR numbers in this file, not
the individual fields. This seems a legitimate use of msr-index.h but creates inconsistency
with how the fields of the other resctrl registers are defined. This may be ok so I am
looking past this for now. Since I am not familiar with this use I am looking at other
patterns of this and it seems that the register fields are usually defined right after
the register to make this relationship clear and also use more verbose naming to establish
this relationship ... I do not think such cryptic names should be used without context
in such a global scope. Please compare with how other fields are defined at this scope.

> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index e3cfa0c10e92..1c2f87ffb0ea 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -222,6 +222,33 @@ union l3_qos_abmc_cfg {
>  	unsigned long full;
>  };
>  
> +/*
> + * PLZA is programmed by writing to MSR_IA32_PQR_PLZA_ASSOC. Bitfield
> + * layout for MSR_IA32_PQR_PLZA_ASSOC (Privilege Level Zero Association).

These comments are valuable to describe how resctrl should interact with
this register so it would help to be specific and document any and all
constraints.

For example, I seem to remember that all fields except PLZA_EN are required
to be identical on all CPUs. Please document that and any other constraints here.

> + *
> + * @rmid		: The RMID to be configured for PLZA.

What does "to be configured" mean? It seems to imply that when resctrl
writes to @rmid then the setting does not take immediate effect but would
take effect at some future "configure" time?

> + * @reserved1		: Reserved.
> + * @rmid_en		: Associate RMID or not.

Please elaborate ... what is RMID associated with? What does "or not" imply? 
Here it will help to document relationship with MSR_IA32_PQR_ASSOC.

> + * @closid		: The CLOSID to be configured for PLZA.
> + * @reserved2		: Reserved.
> + * @closid_en		: Associate CLOSID or not.

Same comments as for RMID

> + * @reserved3		: Reserved.
> + * @plza_en		: Configure PLZA or not.

plza_en implies "enable" but the comment mentions "configure". Considering
the other fields are "to be configured" there seems to be relationship but
that is not documented at all. For example, if @plza_en is 1 and resctrl modifies
@rmid should resctrl write "1" to @plza_en again to "configure" the new RMID?

Please add specific detail to help understand how best to interact with this
register. 

> + */
> +union msr_pqr_plza_assoc {
> +	struct {
> +		unsigned long rmid	:12,
> +			      reserved1	:19,
> +			      rmid_en	: 1,
> +			      closid	: 4,
> +			      reserved2	:11,
> +			      closid_en	: 1,
> +			      reserved3	:15,
> +			      plza_en	: 1;
> +	} split;
> +	unsigned long full;
> +};
> +
>  void rdt_ctrl_update(void *arg);
>  
>  int rdt_get_l3_mon_config(struct rdt_resource *r);

Reinette

^ permalink raw reply

* Re: [PATCH v3 01/12] x86/resctrl: Support Privilege-Level Zero Association (PLZA)
From: Reinette Chatre @ 2026-06-11 23:23 UTC (permalink / raw)
  To: Babu Moger, corbet, tony.luck, Dave.Martin, james.morse, tglx, bp,
	dave.hansen
  Cc: skhan, x86, mingo, hpa, akpm, rdunlap, pawan.kumar.gupta,
	feng.tang, dapeng1.mi, kees, elver, lirongqing, paulmck, bhelgaas,
	seanjc, alexandre.chartre, yazen.ghannam, peterz, chang.seok.bae,
	kim.phillips, xin, naveen, thomas.lendacky, linux-doc,
	linux-kernel, eranian, peternewman, sos-linux-ext-patches
In-Reply-To: <f59c7f5404f29b2901af68d8032ee615b7f0efea.1777591496.git.babu.moger@amd.com>

Hi Babu,

On 4/30/26 4:24 PM, Babu Moger wrote:
> Customers have identified an issue while using the QoS resource Control

"Control" -> "control"?

> feature. If a memory bandwidth associated with a CLOSID is aggressively

"a memory bandwidth" -> "memory bandwidth"?

> throttled, and it moves into Kernel mode, the Kernel operations are also

What does "it" refer to here? From text it seems to be the "CLOSID" but that
does not sound right? Should "it" instead be something like "a task with that
CLOSID"?

"Kernel" -> "kernel"?

> aggressively throttled. This can stall forward progress and eventually
> degrade overall system performance. AMD hardware supports a feature
> Privilege-Level Zero Association (PLZA) to change the association of the
> thread as soon as it begins executing.

"change the association of the thread as soon as it begins executing." I am
not able to parse this.

> 
> Privilege-Level Zero Association (PLZA) allows the user to specify a CLOSID
> and/or RMID associated with execution in Privilege-Level Zero. When enabled
> on a HW thread, when the thread enters Privilege-Level Zero, transactions

Could you please use consistent terminology throughout this series? This patch
uses "HW thread"/"thread", the next patch then switches to "logical processor",
and then by patch #4 the term seems to settle on "CPU". Could this just be
"CPU" from here and throughout series to be consistent and easier to read?

What is meant with "transactions"?  Is this just about memory transactions?
Using this term combined with earlier "memory bandwidth" related problem description
hints that this feature just impacts memory bandwidth allocation but from what
I understand this impacts all allocation (CLOSID of all resources) and monitoring.

Could "transactions" be replaced with "allocation and monitoring" and be
more accurate?

> associated with that thread will be associated with the PLZA CLOSID and/or
> RMID. Otherwise, the HW thread will be associated with the CLOSID and RMID
> identified by PQR_ASSOC.
> 
> Add PLZA support to resctrl and introduce a kernel parameter that allows
> enabling or disabling the feature at boot time.
> 
> The GLBE feature details are documented in:

"GLBE" -> "PLZA"?

> 
>   AMD64 Zen6 Platform Quality of Service (PQOS) Extensions:
>   Publication # 69193 Revision: 1.00, Issue Date: March 2026
> 
> available at https://bugzilla.kernel.org/show_bug.cgi?id=206537

Please follow same style as what you used in the assignable counter enabling where
this URL is provided via a "Link:" tag and then the text can refer to it. Specifically,
	Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [1]

> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> v3: Code did not change. Patch order cahnged.
>     Added documentation link.
> 
> v2: Rebased on top of the latest tip.
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 2 +-
>  arch/x86/include/asm/cpufeatures.h              | 1 +
>  arch/x86/kernel/cpu/resctrl/core.c              | 2 ++
>  arch/x86/kernel/cpu/scattered.c                 | 1 +

Please split changes to other subsystems and make these changes
obvious with their own subject prefix to avoid sneaking changes into
other subsystems via resctrl.

Reinette

^ permalink raw reply

* Re: [PATCH net-next v2 00/15] mptcp: pm: drop TCP TS with ADD_ADDRv6 + port
From: patchwork-bot+netdevbpf @ 2026-06-11 22:50 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: martineau, geliang, davem, edumazet, kuba, pabeni, horms, netdev,
	mptcp, linux-kernel, corbet, skhan, linux-doc, linux-kselftest,
	ncardwell, kuniyu, shuah
In-Reply-To: <20260605-net-next-mptcp-add-addr6-port-ts-v2-0-758e7ca73f4d@kernel.org>

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 05 Jun 2026 19:21:44 +1000 you wrote:
> Up to this series, it was possible to add a "signal" MPTCP endpoint with
> an IPv6 address and a port, or to directly request to send an ADD_ADDR
> with a v6 address and a port, but the expected ADD_ADDR wasn't sent when
> TCP timestamps was used for the connection.
> 
> In fact, such signalling option cannot be sent when TCP timestamps is
> used due to a lack of option space: the limit is at 40 bytes, and, with
> padding, TCP timestamps is taking 12 bytes, while an ADD_ADDR IPv6 +
> port is taking 30 bytes. The selected solution here is to simply drop
> the TCP timestamps option when such ADD_ADDR of 30 bytes needs to be
> sent.
> 
> [...]

Here is the summary with links:
  - [net-next,v2,01/15] mptcp: options: suboptions sizes can be negative
    https://git.kernel.org/netdev/net-next/c/f4a58ffbd4cf
  - [net-next,v2,02/15] mptcp: pm: avoid computing rm_addr size twice
    https://git.kernel.org/netdev/net-next/c/a8bffec089d5
  - [net-next,v2,03/15] mptcp: pm: avoid computing add_addr size twice
    https://git.kernel.org/netdev/net-next/c/06c62385be85
  - [net-next,v2,04/15] mptcp: introduce add_addr_v6_port_drop_ts sysctl knob
    https://git.kernel.org/netdev/net-next/c/30ff28fdc4da
  - [net-next,v2,05/15] tcp: allow mptcp to drop TS for some packets
    https://git.kernel.org/netdev/net-next/c/1c3e7e043977
  - [net-next,v2,06/15] mptcp: pm: drop TCP TS with ADD_ADDRv6 + port
    https://git.kernel.org/netdev/net-next/c/23eeaad0d89d
  - [net-next,v2,07/15] selftests: mptcp: validate ADD_ADDRv6 + TS + port
    https://git.kernel.org/netdev/net-next/c/dd7fb53c21c3
  - [net-next,v2,08/15] selftests: mptcp: always check sent/dropped ADD_ADDRs
    https://git.kernel.org/netdev/net-next/c/5558517b0001
  - [net-next,v2,09/15] mptcp: pm: use for_each_subflow helper
    https://git.kernel.org/netdev/net-next/c/f81689172429
  - [net-next,v2,10/15] mptcp: pm: rename add_entry structure to add_addr
    https://git.kernel.org/netdev/net-next/c/350d76dd6e79
  - [net-next,v2,11/15] mptcp: pm: uniform announced addresses helpers
    https://git.kernel.org/netdev/net-next/c/7d4dacc8ccca
  - [net-next,v2,12/15] mptcp: pm: remove add_ prefix from timer
    https://git.kernel.org/netdev/net-next/c/938490767e37
  - [net-next,v2,13/15] mptcp: pm: make mptcp_pm_add_addr_send_ack static
    https://git.kernel.org/netdev/net-next/c/d0f866e64897
  - [net-next,v2,14/15] mptcp: pm: avoid using del_timer directly
    https://git.kernel.org/netdev/net-next/c/6ea199a938da
  - [net-next,v2,15/15] mptcp: options: rst: drop unused skb parameter
    https://git.kernel.org/netdev/net-next/c/6545a8c34703

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v3 00/12] [PATCH v3 00/12] x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Reinette Chatre @ 2026-06-11 21:53 UTC (permalink / raw)
  To: Babu Moger, corbet, tony.luck, Dave.Martin, james.morse, tglx, bp,
	dave.hansen
  Cc: skhan, x86, mingo, hpa, akpm, rdunlap, pawan.kumar.gupta,
	feng.tang, dapeng1.mi, kees, elver, lirongqing, paulmck, bhelgaas,
	seanjc, alexandre.chartre, yazen.ghannam, peterz, chang.seok.bae,
	kim.phillips, xin, naveen, thomas.lendacky, linux-doc,
	linux-kernel, eranian, peternewman, sos-linux-ext-patches
In-Reply-To: <cover.1777591496.git.babu.moger@amd.com>

Hi Babu,

On 4/30/26 4:24 PM, Babu Moger wrote:
> Design
> ======
> 
> A new sysfs file, info/kernel_mode, holds a single global policy that
> selects what kernel work is steered and which rdtgroup it is steered

How should "selects *what* kernel work is steered" be interpreted? Do these
modes not all apply to *all* kernel work? 

> to.  Reads describe the supported modes and the currently-active
> binding; writes change the policy or rebind to a different group.
> Look at the thread below for design discussion.
> https://lore.kernel.org/lkml/14a8ad0a-e842-4268-871a-0762f1169e03@intel.com/
> 

...

> Examples
> ========
> 
> (See Documentation/filesystems/resctrl.rst, "kernel_mode" and
> "kmode_cpus" sections, for the full UAPI.)
> 
>   # Mount resctrl
>   # mount -t resctrl resctrl /sys/fs/resctrl
>   # cd /sys/fs/resctrl
> 
>   # Read the supported modes.  The active mode is bracketed and reports
>   # the bound "<ctrl>/<mon>/" group; other supported modes report
>   # ":group=none" because nothing is bound to them.
>   # cat info/kernel_mode
>   [inherit_ctrl_and_mon:group=//]

This is unexpected since associating a group to this mode implies that this
group is used to manage allocations and monitoring of kernel work but this
is not true, right? From what I understand there should be no group associated with
this default "inherit_ctrl_and_mon" mode. 

>   global_assign_ctrl_inherit_mon_per_cpu:group=none
>   global_assign_ctrl_assign_mon_per_cpu:group=none

nit: "none" does not reflect state as clearly as "unset"/"uninitialized"/"NA" 

> 
>   # Create a CTRL_MON group plus a MON child and bind both the kernel
>   # CLOSID and RMID to them.
>   # mkdir ctrl1
>   # mkdir ctrl1/mon_groups/mon1
>   # echo "global_assign_ctrl_assign_mon_per_cpu:group=ctrl1/mon1/" \
>           > info/kernel_mode
>   # cat info/kernel_mode
>   inherit_ctrl_and_mon:group=none
>   global_assign_ctrl_inherit_mon_per_cpu:group=none
>   [global_assign_ctrl_assign_mon_per_cpu:group=ctrl1/mon1/]
> 
>   # kmode_cpus and kmode_cpus_list are visible only on the bound group.
>   # ls ctrl1/kmode_cpus*
>   ctrl1/kmode_cpus  ctrl1/kmode_cpus_list

Since it is ctrl1/mon1 that was bound, should these CPU files not appear
in ctrl1/mon_groups/mon1 ?

> 
>   # Restrict the binding to a CPU subset; the write is incremental.

Does "incremental" mean that if the file contains CPUs 0-3 then writing
"4" would set the CPUs to 0-4? This does not sound right since it is
expected that user space can remove CPUs also?

>   # echo 0-3 > ctrl1/kmode_cpus_list
>   # cat ctrl1/kmode_cpus
>   f
>   # cat ctrl1/kmode_cpus_list
>   0-3
> 
>   # Empty masks are rejected; use info/kernel_mode to reset to
>   # "every online CPU".
>   # echo "" > ctrl1/kmode_cpus_list
>   bash: echo: write error: Invalid argument
>   # cat info/last_cmd_status
>   Empty mask not allowed; use info/kernel_mode to unbind

Why are empty masks rejected/not allowed?

> 
>   # Disable kernel-mode steering (back to inherit, default group).

This sounds like kernel work is steered to default group which I 
do not think is accurate for the "inherit_ctrl_and_mon" mode.

>   # echo "inherit_ctrl_and_mon" > info/kernel_mode
> 
> Tested on AMD with PLZA; the generic bits build clean on x86 without
> PLZA support and are no-ops at runtime.

Reinette



^ permalink raw reply

* Re: [PATCH v6 02/20] nfsd: add protocol support for CB_NOTIFY
From: Chuck Lever @ 2026-06-11 21:33 UTC (permalink / raw)
  To: Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan
  Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
	Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
	linux-doc, linux-nfs
In-Reply-To: <20260611-dir-deleg-v6-2-4c45080e5f3f@kernel.org>


On Thu, Jun 11, 2026, at 1:50 PM, Jeff Layton wrote:

> diff --git a/Documentation/sunrpc/xdr/nfs4_1.x 
> b/Documentation/sunrpc/xdr/nfs4_1.x
> index 5b45547b2ebc..632f5b579c39 100644
> --- a/Documentation/sunrpc/xdr/nfs4_1.x
> +++ b/Documentation/sunrpc/xdr/nfs4_1.x
> @@ -45,19 +45,165 @@ pragma header nfs4;
>  /*
>   * Basic typedefs for RFC 1832 data type definitions
>   */
> -typedef hyper		int64_t;
> -typedef unsigned int	uint32_t;
> +typedef int                  int32_t;
> +typedef unsigned int         uint32_t;
> +typedef hyper                int64_t;
> +typedef unsigned hyper       uint64_t;
> +
> +const NFS4_VERIFIER_SIZE        = 8;
> +const NFS4_FHSIZE               = 128;
> +
> +enum nfsstat4 {
> + NFS4_OK                = 0,    /* everything is okay      */
> + NFS4ERR_PERM           = 1,    /* caller not privileged   */
> + NFS4ERR_NOENT          = 2,    /* no such file/directory  */
> + NFS4ERR_IO             = 5,    /* hard I/O error          */
> + NFS4ERR_NXIO           = 6,    /* no such device          */
> + NFS4ERR_ACCESS         = 13,   /* access denied           */
> + NFS4ERR_EXIST          = 17,   /* file already exists     */
> + NFS4ERR_XDEV           = 18,   /* different filesystems   */
> +
> + /*
> +  * Please do not allocate value 19; it was used in NFSv3
> +  * and we do not want a value in NFSv3 to have a different
> +  * meaning in NFSv4.x.
> +  */
> +
> + NFS4ERR_NOTDIR         = 20,   /* should be a directory   */
> + NFS4ERR_ISDIR          = 21,   /* should not be directory */
> + NFS4ERR_INVAL          = 22,   /* invalid argument        */
> + NFS4ERR_FBIG           = 27,   /* file exceeds server max */
> + NFS4ERR_NOSPC          = 28,   /* no space on filesystem  */
> + NFS4ERR_ROFS           = 30,   /* read-only filesystem    */
> + NFS4ERR_MLINK          = 31,   /* too many hard links     */
> + NFS4ERR_NAMETOOLONG    = 63,   /* name exceeds server max */
> + NFS4ERR_NOTEMPTY       = 66,   /* directory not empty     */
> + NFS4ERR_DQUOT          = 69,   /* hard quota limit reached*/
> + NFS4ERR_STALE          = 70,   /* file no longer exists   */
> + NFS4ERR_BADHANDLE      = 10001,/* Illegal filehandle      */
> + NFS4ERR_BAD_COOKIE     = 10003,/* READDIR cookie is stale */
> + NFS4ERR_NOTSUPP        = 10004,/* operation not supported */
> + NFS4ERR_TOOSMALL       = 10005,/* response limit exceeded */
> + NFS4ERR_SERVERFAULT    = 10006,/* undefined server error  */
> + NFS4ERR_BADTYPE        = 10007,/* type invalid for CREATE */
> + NFS4ERR_DELAY          = 10008,/* file "busy" - retry     */
> + NFS4ERR_SAME           = 10009,/* nverify says attrs same */
> + NFS4ERR_DENIED         = 10010,/* lock unavailable        */
> + NFS4ERR_EXPIRED        = 10011,/* lock lease expired      */
> + NFS4ERR_LOCKED         = 10012,/* I/O failed due to lock  */
> + NFS4ERR_GRACE          = 10013,/* in grace period         */
> + NFS4ERR_FHEXPIRED      = 10014,/* filehandle expired      */
> + NFS4ERR_SHARE_DENIED   = 10015,/* share reserve denied    */
> + NFS4ERR_WRONGSEC       = 10016,/* wrong security flavor   */
> + NFS4ERR_CLID_INUSE     = 10017,/* clientid in use         */
> +
> + /* NFS4ERR_RESOURCE is not a valid error in NFSv4.1 */
> + NFS4ERR_RESOURCE       = 10018,/* resource exhaustion     */
> +
> + NFS4ERR_MOVED          = 10019,/* filesystem relocated    */
> + NFS4ERR_NOFILEHANDLE   = 10020,/* current FH is not set   */
> + NFS4ERR_MINOR_VERS_MISMATCH= 10021,/* minor vers not supp */
> + NFS4ERR_STALE_CLIENTID = 10022,/* server has rebooted     */
> + NFS4ERR_STALE_STATEID  = 10023,/* server has rebooted     */
> + NFS4ERR_OLD_STATEID    = 10024,/* state is out of sync    */
> + NFS4ERR_BAD_STATEID    = 10025,/* incorrect stateid       */
> + NFS4ERR_BAD_SEQID      = 10026,/* request is out of seq.  */
> + NFS4ERR_NOT_SAME       = 10027,/* verify - attrs not same */
> + NFS4ERR_LOCK_RANGE     = 10028,/* overlapping lock range  */
> + NFS4ERR_SYMLINK        = 10029,/* should be file/directory*/
> + NFS4ERR_RESTOREFH      = 10030,/* no saved filehandle     */
> + NFS4ERR_LEASE_MOVED    = 10031,/* some filesystem moved   */
> + NFS4ERR_ATTRNOTSUPP    = 10032,/* recommended attr not sup*/
> + NFS4ERR_NO_GRACE       = 10033,/* reclaim outside of grace*/
> + NFS4ERR_RECLAIM_BAD    = 10034,/* reclaim error at server */
> + NFS4ERR_RECLAIM_CONFLICT= 10035,/* conflict on reclaim    */
> + NFS4ERR_BADXDR         = 10036,/* XDR decode failed       */
> + NFS4ERR_LOCKS_HELD     = 10037,/* file locks held at CLOSE*/
> + NFS4ERR_OPENMODE       = 10038,/* conflict in OPEN and I/O*/
> + NFS4ERR_BADOWNER       = 10039,/* owner translation bad   */
> + NFS4ERR_BADCHAR        = 10040,/* utf-8 char not supported*/
> + NFS4ERR_BADNAME        = 10041,/* name not supported      */
> + NFS4ERR_BAD_RANGE      = 10042,/* lock range not supported*/
> + NFS4ERR_LOCK_NOTSUPP   = 10043,/* no atomic up/downgrade  */
> + NFS4ERR_OP_ILLEGAL     = 10044,/* undefined operation     */
> + NFS4ERR_DEADLOCK       = 10045,/* file locking deadlock   */
> + NFS4ERR_FILE_OPEN      = 10046,/* open file blocks op.    */
> + NFS4ERR_ADMIN_REVOKED  = 10047,/* lockowner state revoked */
> + NFS4ERR_CB_PATH_DOWN   = 10048,/* callback path down      */
> +
> + /* NFSv4.1 errors start here. */
> +
> + NFS4ERR_BADIOMODE      = 10049,
> + NFS4ERR_BADLAYOUT      = 10050,
> + NFS4ERR_BAD_SESSION_DIGEST = 10051,
> + NFS4ERR_BADSESSION     = 10052,
> + NFS4ERR_BADSLOT        = 10053,
> + NFS4ERR_COMPLETE_ALREADY = 10054,
> + NFS4ERR_CONN_NOT_BOUND_TO_SESSION = 10055,
> + NFS4ERR_DELEG_ALREADY_WANTED = 10056,
> + NFS4ERR_BACK_CHAN_BUSY = 10057,/*backchan reqs outstanding*/
> + NFS4ERR_LAYOUTTRYLATER = 10058,
> + NFS4ERR_LAYOUTUNAVAILABLE = 10059,
> + NFS4ERR_NOMATCHING_LAYOUT = 10060,
> + NFS4ERR_RECALLCONFLICT = 10061,
> + NFS4ERR_UNKNOWN_LAYOUTTYPE = 10062,
> + NFS4ERR_SEQ_MISORDERED = 10063,/* unexpected seq.ID in req*/
> + NFS4ERR_SEQUENCE_POS   = 10064,/* [CB_]SEQ. op not 1st op */
> + NFS4ERR_REQ_TOO_BIG    = 10065,/* request too big         */
> + NFS4ERR_REP_TOO_BIG    = 10066,/* reply too big           */
> + NFS4ERR_REP_TOO_BIG_TO_CACHE =10067,/* rep. not all cached*/
> + NFS4ERR_RETRY_UNCACHED_REP =10068,/* retry & rep. uncached*/
> + NFS4ERR_UNSAFE_COMPOUND =10069,/* retry/recovery too hard */
> + NFS4ERR_TOO_MANY_OPS   = 10070,/*too many ops in [CB_]COMP*/
> + NFS4ERR_OP_NOT_IN_SESSION =10071,/* op needs [CB_]SEQ. op */
> + NFS4ERR_HASH_ALG_UNSUPP = 10072, /* hash alg. not supp.   */
> +                                /* Error 10073 is unused.  */
> + NFS4ERR_CLIENTID_BUSY  = 10074,/* clientid has state      */
> + NFS4ERR_PNFS_IO_HOLE   = 10075,/* IO to _SPARSE file hole */
> + NFS4ERR_SEQ_FALSE_RETRY= 10076,/* Retry != original req.  */
> + NFS4ERR_BAD_HIGH_SLOT  = 10077,/* req has bad highest_slot*/
> + NFS4ERR_DEADSESSION    = 10078,/*new req sent to dead sess*/
> + NFS4ERR_ENCR_ALG_UNSUPP= 10079,/* encr alg. not supp.     */
> + NFS4ERR_PNFS_NO_LAYOUT = 10080,/* I/O without a layout    */
> + NFS4ERR_NOT_ONLY_OP    = 10081,/* addl ops not allowed    */
> + NFS4ERR_WRONG_CRED     = 10082,/* op done by wrong cred   */
> + NFS4ERR_WRONG_TYPE     = 10083,/* op on wrong type object */
> + NFS4ERR_DIRDELEG_UNAVAIL=10084,/* delegation not avail.   */
> + NFS4ERR_REJECT_DELEG   = 10085,/* cb rejected delegation  */
> + NFS4ERR_RETURNCONFLICT = 10086,/* layout get before return*/
> + NFS4ERR_DELEG_REVOKED  = 10087, /* deleg./layout revoked   */
> + NFS4ERR_PARTNER_NOTSUPP = 10088,
> + NFS4ERR_PARTNER_NO_AUTH = 10089,
> + NFS4ERR_UNION_NOTSUPP = 10090,
> + NFS4ERR_OFFLOAD_DENIED = 10091,
> + NFS4ERR_WRONG_LFS = 10092,
> + NFS4ERR_BADLABEL = 10093,
> + NFS4ERR_OFFLOAD_NO_REQS = 10094,
> + NFS4ERR_NOXATTR = 10095,
> + NFS4ERR_XATTR2BIG = 10096,
> +
> + /* always set this to one more than the last one in the enum */
> + NFS4ERR_FIRST_FREE = 10097
> +};

This value can be leaked onto the wire. The basic enum encoder
checks that these values are part of the .x before sticking
them on the wire.

Please keep the .x document aligned with the specification. If
you need a "maximum value" symbolic constant, please define it
in one of the hand-rolled headers. (I guess this one was copied
over from the existing hand-rolled NFS4ERR definitions).

I see that NFS4ERR_FIRST_FREE is used to determine the numeric
value for NFSERR_EOF.

fs/nfsd/nfs3xdr.c:              if (xdr_stream_encode_bool(xdr, resp->common.err == nfserr_eof) < 0)
fs/nfsd/nfs4xdr.c:      return nfsd4_encode_bool(xdr, readdir->common.err == nfserr_eof);
fs/nfsd/nfsd.h: __be32                  err;    /* 0, nfserr, or nfserr_eof */
fs/nfsd/nfsd.h:#define  nfserr_eof              cpu_to_be32(NFSERR_EOF)
fs/nfsd/nfsxdr.c:               if (xdr_stream_encode_bool(xdr, resp->common.err == nfserr_eof) < 0)
fs/nfsd/vfs.c:          cdp->err = nfserr_eof; /* will be cleared on successful read */
fs/nfsd/vfs.c:  if (err == nfserr_eof || err == nfserr_toosmall)

A better interim approach might be to select an impossible value
for NFSERR_EOF, as is done for the internal NLM error status codes:

fs/lockd/lockd.h:#define nlm__int__drop_reply   cpu_to_be32(30000)
fs/lockd/lockd.h:#define nlm__int__deadlock     cpu_to_be32(30001)
fs/lockd/lockd.h:#define nlm__int__stale_fh     cpu_to_be32(30002)
fs/lockd/lockd.h:#define nlm__int__failed       cpu_to_be32(30003)


> @@ -245,3 +406,88 @@ const FATTR4_ACL_TRUEFORM	= 89;
>  const FATTR4_ACL_TRUEFORM_SCOPE	= 90;
>  const FATTR4_POSIX_DEFAULT_ACL	= 91;
>  const FATTR4_POSIX_ACCESS_ACL	= 92;
> +
> +/*
> + * Directory notification types.
> + */
> +enum notify_type4 {
> +        NOTIFY4_CHANGE_CHILD_ATTRS = 0,
> +        NOTIFY4_CHANGE_DIR_ATTRS = 1,
> +        NOTIFY4_REMOVE_ENTRY = 2,
> +        NOTIFY4_ADD_ENTRY = 3,
> +        NOTIFY4_RENAME_ENTRY = 4,
> +        NOTIFY4_CHANGE_COOKIE_VERIFIER = 5
> +};
> +
> +/* Changed entry information.  */
> +struct notify_entry4 {
> +        component4      ne_file;
> +        fattr4          ne_attrs;
> +};
> +
> +/* Previous entry information */
> +struct prev_entry4 {
> +        notify_entry4   pe_prev_entry;
> +        /* what READDIR returned for this entry */
> +        nfs_cookie4     pe_prev_entry_cookie;
> +};
> +
> +struct notify_remove4 {
> +        notify_entry4   nrm_old_entry;
> +        nfs_cookie4     nrm_old_entry_cookie;
> +};
> +pragma public notify_remove4;
> +
> +struct notify_add4 {
> +        /*
> +         * Information on object
> +         * possibly renamed over.
> +         */
> +        notify_remove4      nad_old_entry<1>;
> +        notify_entry4       nad_new_entry;
> +        /* what READDIR would have returned for this entry */
> +        nfs_cookie4         nad_new_entry_cookie<1>;
> +        prev_entry4         nad_prev_entry<1>;
> +        bool                nad_last_entry;
> +};
> +pragma public notify_add4;
> +
> +struct notify_attr4 {
> +        notify_entry4   na_changed_entry;
> +};
> +pragma public notify_attr4;
> +
> +struct notify_rename4 {
> +        notify_remove4  nrn_old_entry;
> +        notify_add4     nrn_new_entry;
> +};
> +pragma public notify_rename4;
> +
> +struct notify_verifier4 {
> +        verifier4       nv_old_cookieverf;
> +        verifier4       nv_new_cookieverf;
> +};
> +
> +/*
> + * Objects of type notify_<>4 and
> + * notify_device_<>4 are encoded in this.
> + */
> +typedef opaque notifylist4<>;
> +
> +struct notify4 {
> +        /* composed from notify_type4 or notify_deviceid_type4 */
> +        bitmap4         notify_mask;
> +        notifylist4     notify_vals;
> +};
> +
> +struct CB_NOTIFY4args {
> +        stateid4    cna_stateid;
> +        nfs_fh4     cna_fh;
> +        notify4     cna_changes<>;
> +};
> +pragma public CB_NOTIFY4args;
> +
> +struct CB_NOTIFY4res {
> +        nfsstat4    cnr_status;
> +};
> +pragma public CB_NOTIFY4res;

Let's add the "pragma public" directives in the patches where
they are first needed, instead of here. As subsequent patches
are modified, the need for these directives might vanish.


-- 
Chuck Lever

^ permalink raw reply

* Re: [PATCH v4 09/16] riscv: Add Zic64b to cpufeature and hwprobe
From: Andrew Jones @ 2026-06-11 20:50 UTC (permalink / raw)
  To: Guodong Xu
  Cc: Jonathan Corbet, Shuah Khan, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexandre Ghiti, Zong Li, Deepak Gupta, Anup Patel,
	Atish Patra, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Yixun Lan, Chen Wang, Inochi Amaoto, linux-doc, linux-riscv,
	linux-kernel, kvm, kvm-riscv, Paul Walmsley, Conor Dooley,
	devicetree, spacemit, sophgo, linux-kselftest, Palmer Dabbelt,
	Qingwei Hu
In-Reply-To: <20260611-rva23u64-hwprobe-v2-v4-9-3f01a2449488@gmail.com>

On Thu, Jun 11, 2026 at 04:12:46PM -0400, Guodong Xu wrote:
> From: Qingwei Hu <qingwei.hu@bytedance.com>
> 
> Zic64b mandates 64-byte naturally aligned cache blocks and is a
> mandatory extension of the RVA22 and RVA23 profiles.  Allocate a
> RISCV_ISA_EXT_ZIC64B id, parse "zic64b" from the ISA string with a
> validate callback that requires each cbom/cbop/cboz cache block size to
> be 64 bytes when it is present, and export it through hwprobe.
> 
> Link: https://lists.riscv.org/g/tech-unprivileged/topic/question_about_zic64b_and/119631059
> Signed-off-by: Qingwei Hu <qingwei.hu@bytedance.com>
> Co-developed-by: Guodong Xu <docular.xu@gmail.com>
> Signed-off-by: Guodong Xu <docular.xu@gmail.com>
> ---
> v4:
> - Credit Qingwei Hu's earlier Zic64b cpufeature patch: set him as
>   author, with Co-developed-by (Guodong Xu).
> - Validate only the cbom/cbop/cboz block sizes that are present; Zic64b
>   does not imply the CMO extensions (Conor, Qingwei, Greg).
> - Add a Link: to Greg's confirmation on the tech-unprivileged list.
> - Add the missing blank line before the ZIC64B hwprobe.rst entry
>   (Andrew).
> - Did not carry Andrew Jones's v3 Reviewed-by: the validation was
>   rewritten (present block sizes only) and the patch is now authored by
>   Qingwei, so it warrants a fresh review.
> v3: New patch.
> ---
>  Documentation/arch/riscv/hwprobe.rst  |  4 ++++
>  arch/riscv/include/asm/hwcap.h        |  1 +
>  arch/riscv/include/uapi/asm/hwprobe.h |  1 +
>  arch/riscv/kernel/cpufeature.c        | 19 +++++++++++++++++++
>  arch/riscv/kernel/sys_hwprobe.c       |  1 +
>  5 files changed, 26 insertions(+)
>

Reviewed-by: Andrew Jones <andrew.jones@oss.qualcomm.com>

^ permalink raw reply

* Re: [PATCH v5 0/4] PCI: Add DOE support for endpoint
From: Frank Li @ 2026-06-11 20:47 UTC (permalink / raw)
  To: Aksh Garg
  Cc: linux-pci, linux-doc, mani, kwilczynski, bhelgaas, corbet, kishon,
	skhan, lukas, cassel, alistair, linux-arm-kernel, linux-kernel,
	s-vadapalli, danishanwar, srk
In-Reply-To: <20260610100256.1889111-1-a-garg7@ti.com>

On Wed, Jun 10, 2026 at 03:32:52PM +0530, Aksh Garg wrote:
> This patch series introduces the framework for supporting the Data
> Object Exchange (DOE) feature for PCIe endpoint devices. Please refer
> to the documentation added in patch 4 for details on the feature and
> implementation architecture.
>
> The implementation provides a common framework for all PCIe endpoint
> controllers, not specific to any particular SoC vendor.
>

General question, does DOE generate irq when received msg for HOST? I have
not related irq handle code.

Any program to test it? such as pci_endpoint_test, need at least one real
user to use it.

Frank

> The changes since v1 are documented in the respective patch descriptions.
>
> v4: https://lore.kernel.org/all/20260522052434.802034-1-a-garg7@ti.com/
> v3: https://lore.kernel.org/all/20260427051725.223704-1-a-garg7@ti.com/
> v2: https://lore.kernel.org/all/20260401073022.215805-1-a-garg7@ti.com/
> v1 (RFC): https://lore.kernel.org/all/20260213123603.420941-1-a-garg7@ti.com/
>
> Below is a code demonstration showing the integration of DOE-EP APIs with
> EPC drivers.
>
> Note: The provided code is just to show how an EPC driver is expected to
>       utilize the pci_ep_doe_process_request() and pci_ep_doe_abort() APIs,
>       and might not cover all the corner cases. The below implementation
>       also expects the EPC hardware to have some memory buffer to store the
>       data from(for) write_mailbox(read_mailbox) DOE capability registers.
>
> ============================================================================
>
> /* ========== DOE Completion Callback (invoked by DOE-EP core) ========== */
>
> static void doe_completion_cb(struct pci_epc *epc, u8 func_no, u16 cap_offset,
> 			       int status, u16 vendor, u8 type,
> 			       void *response_pl, size_t response_pl_sz)
> {
> 	struct epc_driver *drv = epc_get_drvdata(epc);
> 	u32 *response = (u32 *)response_pl;
> 	u32 header1, header2;
> 	int payload_dw, i;
>
> 	if (readl(drv->base + PF_DOE_CTRL_REG(func_no, cap_offset)) & DOE_CTRL_ABORT) {
> 		/* Aborted: do not send response */
> 		goto free;
> 	}
>
> 	if (status < 0) {
> 		/* Error: set ERROR bit in DOE Status register */
> 		writel(1 << DOE_STATUS_ERROR,
> 		       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
> 		goto free;
> 	}
>
> 	/* Success: write DOE headers first, then response to the read memory */
>
> 	/* Header 1: Vendor ID (bits 15:0) | Type (bits 23:16) */
> 	header1 = (type << 16) | vendor;
> 	writel(header1, drv->base + PF_DOE_RD_MEMORY_WR_REG(func_no, cap_offset));
>
> 	/* Header 2: Length in DW (including 2 DW of headers + payload) */
> 	payload_dw = DIV_ROUND_UP(response_pl_sz, sizeof(u32));
> 	header2 = 2 + payload_dw;  /* 2 header DWs + payload */
> 	writel(header2, drv->base + PF_DOE_RD_MEMORY_WR_REG(func_no, cap_offset));
>
> 	/* Set READY bit to signal response ready */
> 	writel(1 << DOE_STATUS_READY,
> 	       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
>
> 	/* Write response payload DWORDs to Read memory */
> 	for (i = 0; i < payload_dw; i++)
> 		writel(response[i],
> 		       drv->base + PF_DOE_RD_MEMORY_WR_REG(func_no, cap_offset));
>
> 	/* Wait for the memory to empty before clearing the READY bit */
> 	while (!RD_MEMORY_EMPTY()) {/* wait */}
>
> 	writel(0 << DOE_STATUS_READY,
> 	       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
>
> free:
> 	/* unset BUSY bit */
> 	writel(0 << DOE_STATUS_BUSY,
> 	       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
>
> 	kfree(response_pl);
> }
>
> /* ========== DOE Interrupt Handler (triggered on GO bit from root complex) ========== */
>
> static irqreturn_t doe_interrupt_handler(int irq, void *priv)
> {
> 	struct epc_driver *drv = priv;
> 	u16 cap_offset = extract_cap_offset_from_irq(irq);
> 	u8 func_no = extract_func_from_irq(irq);
> 	u32 header1, header2, length_dw, *request;
> 	u16 vendor;
> 	u8 type;
> 	int i, ret;
>
> 	/* Read first header DWORD: Vendor ID (bits 15:0) | Type (bits 23:16) */
> 	header1 = readl(drv->base + PF_DOE_WR_MEMORY_RD_REG(func_no, cap_offset));
> 	vendor = header1 & 0xFFFF;
> 	type = (header1 >> 16) & 0xFF;
>
> 	/* Read second header DWORD: Length in DW (includes 2 DW of headers) */
> 	header2 = readl(drv->base + PF_DOE_WR_MEMORY_RD_REG(func_no, cap_offset));
> 	length_dw = header2 & 0x3FFFF;  /* Bits 17:0 */
>
> 	if (!length_dw)
> 		length_dw = PCI_DOE_MAX_LENGTH;
>
> 	length_dw -= 2;  /* Subtract 2 DW of headers to get payload length */
> 	/* Allocate buffer for complete request (headers + payload) */
> 	request = kzalloc(length_dw * sizeof(u32), GFP_ATOMIC);
> 	if (!request) {
> 		writel(1 << DOE_STATUS_ERROR,
> 		       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
> 		return IRQ_HANDLED;
> 	}
>
> 	/* Read remaining payload DWORDs from Write memory */
> 	for (i = 0; i < length_dw; i++) {
> 		while (WR_MEMORY_EMPTY()) { /* wait */ }
> 		request[i] = readl(drv->base + PF_DOE_WR_MEMORY_RD_REG(func_no, cap_offset));
> 	}
>
> 	mutex_lock(&lock);
> 	/* Check the ABORT bit, if set then return */
> 	if (readl(drv->base + PF_DOE_CTRL_REG(func_no, cap_offset)) & DOE_CTRL_ABORT) {
> 		kfree(request);
> 		mutex_unlock(&lock);
> 		return IRQ_HANDLED;
> 	}
>
> 	/* Set BUSY bit */
> 	writel(1 << DOE_STATUS_BUSY,
> 	       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
> 	mutex_unlock(&lock);
>
> 	/* Hand off to DOE-EP core for asynchronous processing */
> 	ret = pci_ep_doe_process_request(drv->epc, func_no, cap_offset,
> 					 vendor, type, (void *)request,
> 					 length_dw * sizeof(u32),
> 					 doe_completion_cb);
> 	if (ret) {
> 		writel(1 << DOE_STATUS_ERROR,
> 		       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
> 		kfree(request);
> 	}
>
> 	return IRQ_HANDLED;
> }
>
> /* ========== Abort Handler (triggered on ABORT bit from root complex) ========== */
>
> static irqreturn_t doe_abort_handler(int irq, void *priv)
> {
> 	struct epc_driver *drv = priv;
> 	u16 cap_offset = extract_cap_offset_from_irq(irq);
> 	u8 func_no = extract_func_from_irq(irq);
>
> 	mutex_lock(&lock);
>
> 	/* call abort API only if BUSY bit set (pci_ep_doe_process_request() called) */
> 	if (readl(drv->base + PF_DOE_STATUS_REG(func_no, cap_offset)) & DOE_STATUS_BUSY)
> 		pci_ep_doe_abort(drv->epc, func_no, cap_offset);
>
> 	mutex_unlock(&lock);
>
> 	/* Discard Write memory contents */
> 	writel(DOE_WR_MEMORY_CTRL_DISCARD,
> 	       drv->base + PF_DOE_WR_MEMORY_CTRL_REG(func_no, cap_offset));
>
> 	/* Clear status bits */
> 	writel((0 << DOE_STATUS_ERROR) | (0 << DOE_STATUS_READY),
> 	       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
>
> 	return IRQ_HANDLED;
> }
>
> ====================================================================================
>
> Aksh Garg (4):
>   PCI/DOE: Move common definitions to the header file
>   PCI: endpoint: Add DOE mailbox support for endpoint functions
>   PCI: endpoint: Add support for DOE initialization and setup in EPC
>     core
>   Documentation: PCI: Add documentation for DOE endpoint support
>
>  Documentation/PCI/endpoint/index.rst          |   1 +
>  .../PCI/endpoint/pci-endpoint-doe.rst         | 333 ++++++++++
>  drivers/pci/doe.c                             |  11 -
>  drivers/pci/endpoint/Kconfig                  |  14 +
>  drivers/pci/endpoint/Makefile                 |   1 +
>  drivers/pci/endpoint/pci-ep-doe.c             | 594 ++++++++++++++++++
>  drivers/pci/endpoint/pci-epc-core.c           | 104 +++
>  drivers/pci/pci.h                             |  48 ++
>  include/linux/pci-doe.h                       |   8 +
>  include/linux/pci-epc.h                       |   9 +
>  10 files changed, 1112 insertions(+), 11 deletions(-)
>  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-doe.rst
>  create mode 100644 drivers/pci/endpoint/pci-ep-doe.c
>
> --
> 2.34.1
>

^ permalink raw reply

* Re: [PATCH v5 1/4] PCI/DOE: Move common definitions to the header file
From: Frank Li @ 2026-06-11 20:36 UTC (permalink / raw)
  To: Aksh Garg
  Cc: linux-pci, linux-doc, mani, kwilczynski, bhelgaas, corbet, kishon,
	skhan, lukas, cassel, alistair, linux-arm-kernel, linux-kernel,
	s-vadapalli, danishanwar, srk
In-Reply-To: <20260610100256.1889111-2-a-garg7@ti.com>

On Wed, Jun 10, 2026 at 03:32:53PM +0530, Aksh Garg wrote:
> Move common macros and structures from drivers/pci/doe.c to
> drivers/pci/pci.h to allow reuse across root complex and
> endpoint DOE implementations.
>
> PCI_DOE_MAX_LENGTH macro can be used outside the PCI core as well,
> hence move the macro to include/linux/pci-doe.h.
>
> These changes prepare the groundwork for the DOE endpoint implementation
> that will reuse these common definitions.
>
> Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> Signed-off-by: Aksh Garg <a-garg7@ti.com>
> ---

Reviewed-by: Frank Li <Frank.Li@nxp.com>

>
> Changes from v4 to v5:
> - None.
>
> Changes from v3 to v4:
> - None.
>
> Changes from v2 to v3:
> - Rebased on 7.1-rc1.
>
> Changes since v1:
> - Moved the common macros that need not be visible outside the PCI core
>   to drivers/pci/pci.h instead to include/linux/pci-doe.h as suggested
>   by Lukas Wunner
> - Removed the redundant empty inlines guarded with CONFIG_PCI_DOE in
>   include/linux/pci-doe.h.
>
> v4: https://lore.kernel.org/all/20260522052434.802034-2-a-garg7@ti.com/
> v3: https://lore.kernel.org/all/20260427051725.223704-2-a-garg7@ti.com/
> v2: https://lore.kernel.org/all/20260401073022.215805-2-a-garg7@ti.com/
> v1: https://lore.kernel.org/all/20260213123603.420941-3-a-garg7@ti.com/
>
>  drivers/pci/doe.c       | 11 -----------
>  drivers/pci/pci.h       |  9 +++++++++
>  include/linux/pci-doe.h |  3 +++
>  3 files changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> index 7b41da4ec11a..e8d9e95644b3 100644
> --- a/drivers/pci/doe.c
> +++ b/drivers/pci/doe.c
> @@ -28,12 +28,6 @@
>  #define PCI_DOE_TIMEOUT HZ
>  #define PCI_DOE_POLL_INTERVAL	(PCI_DOE_TIMEOUT / 128)
>
> -#define PCI_DOE_FLAG_CANCEL	0
> -#define PCI_DOE_FLAG_DEAD	1
> -
> -/* Max data object length is 2^18 dwords */
> -#define PCI_DOE_MAX_LENGTH	(1 << 18)
> -
>  /**
>   * struct pci_doe_mb - State for a single DOE mailbox
>   *
> @@ -63,11 +57,6 @@ struct pci_doe_mb {
>  #endif
>  };
>
> -struct pci_doe_feature {
> -	u16 vid;
> -	u8 type;
> -};
> -
>  /**
>   * struct pci_doe_task - represents a single query/response
>   *
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 4a14f88e543a..5844deee2b5f 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -683,6 +683,15 @@ struct pci_sriov {
>  	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
>  };
>
> +/* DOE Mailbox state flags */
> +#define PCI_DOE_FLAG_CANCEL	0
> +#define PCI_DOE_FLAG_DEAD	1
> +
> +struct pci_doe_feature {
> +	u16 vid;
> +	u8 type;
> +};
> +
>  #ifdef CONFIG_PCI_DOE
>  void pci_doe_init(struct pci_dev *pdev);
>  void pci_doe_destroy(struct pci_dev *pdev);
> diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> index bd4346a7c4e7..abb9b7ae8029 100644
> --- a/include/linux/pci-doe.h
> +++ b/include/linux/pci-doe.h
> @@ -19,6 +19,9 @@ struct pci_doe_mb;
>  #define PCI_DOE_FEATURE_CMA 1
>  #define PCI_DOE_FEATURE_SSESSION 2
>
> +/* Max data object length is 2^18 dwords */
> +#define PCI_DOE_MAX_LENGTH		(1 << 18)
> +
>  struct pci_doe_mb *pci_find_doe_mailbox(struct pci_dev *pdev, u16 vendor,
>  					u8 type);
>
> --
> 2.34.1
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox