Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [PATCH v4 1/4] KVM: PPC: Introduce KVM_CAP_PPC_COMPAT_CAPS and wire up ioctl
From: Amit Machhiwal @ 2026-06-23 11:11 UTC (permalink / raw)
  To: Vaibhav Jain
  Cc: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan,
	Anushree Mathur, Paolo Bonzini, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP), Jonathan Corbet, Shuah Khan, kvm,
	linux-kernel, linux-doc, lkp
In-Reply-To: <871pe3cazk.fsf@vajain21.in.ibm.com>

Hi Vaibhav,

Thanks for the detailed review. My responses are inline below.

On 2026/06/19 11:44 AM, Vaibhav Jain wrote:
> Hi Amit.
> 
> Thanks for the patch and incorporating V3 review comments. Further
> review comments inline below:
> 
> Amit Machhiwal <amachhiw@linux.ibm.com> writes:
> 
> > Introduce a new capability and ioctl to expose CPU compatibility modes
> > supported by the host processor for nested guests.
> >
> > On IBM POWER systems, newer processor generations (N) can operate in
> > compatibility modes corresponding to earlier generations, like (N-1) and
> > (N-2). This is particularly relevant for nested virtualization, where
> > nested KVM guests may need to run with a specific processor compatibility
> > level.
> >
> > Introduce KVM_CAP_PPC_COMPAT_CAPS capability and the corresponding
> > KVM_PPC_GET_COMPAT_CAPS vm ioctl. The ioctl returns a bitmap describing
> > the compatibility modes supported by the host in respective bit numbers,
> > allowing userspace (e.g., QEMU) to select an appropriate compatibility
> > level when configuring nested KVM guests.
> >
> > The ioctl handling is added in kvm_arch_vm_ioctl() and retrieves host
> > CPU compatibility capabilities via a PowerPC-specific backend
> > implementation when available. The implementation validates the structure
> > size from userspace to ensure forward compatibility and returns
> > appropriate error codes (EINVAL for invalid size, EFAULT for copy
> > failures, ENOTTY if backend is not implemented). The struct
> > kvm_ppc_compat_caps includes a size field to support future ABI
> > extensions.
> >
> > Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> > ---
> >  arch/powerpc/include/asm/kvm_ppc.h  |  1 +
> >  arch/powerpc/include/uapi/asm/kvm.h |  7 ++++++
> >  arch/powerpc/kvm/powerpc.c          | 35 +++++++++++++++++++++++++++++
> >  include/uapi/linux/kvm.h            |  4 ++++
> >  4 files changed, 47 insertions(+)
> >
> > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> > index 0953f2daa466..169ea6a7fbad 100644
> > --- a/arch/powerpc/include/asm/kvm_ppc.h
> > +++ b/arch/powerpc/include/asm/kvm_ppc.h
> > @@ -319,6 +319,7 @@ struct kvmppc_ops {
> >  	bool (*hash_v3_possible)(void);
> >  	int (*create_vm_debugfs)(struct kvm *kvm);
> >  	int (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry);
> > +	int (*get_compat_caps)(struct kvm_ppc_compat_caps *host_caps);
> >  };
> >  
> >  extern struct kvmppc_ops *kvmppc_hv_ops;
> > diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> > index 077c5437f521..8a38be6c3b03 100644
> > --- a/arch/powerpc/include/uapi/asm/kvm.h
> > +++ b/arch/powerpc/include/uapi/asm/kvm.h
> > @@ -437,6 +437,13 @@ struct kvm_ppc_cpu_char {
> >  	__u64	behaviour_mask;		/* valid bits in behaviour */
> >  };
> >  
> > +/* For KVM_PPC_GET_COMPAT_CAPS */
> > +struct kvm_ppc_compat_caps {
> > +	__u64	flags;			/* Reserved for future use */
> > +	__u64	size;			/* Size of this structure */
> Suggesting moving the 'size' as the first member of the struct. That way
> copying the struct from userspace becomes bit easier.

Yeah, I think it would make more sense and will simplify the
copy_from_user() call. I will make the change in v5. I will change to:

  struct kvm_ppc_compat_caps {
  	__u64	size;
  	__u64	flags;
  	__u64	compat_capabilities;
  };

> 
> > +	__u64	compat_capabilities;	/* Capabilities supported by the host */
> > +};
> > +
> >  /*
> >   * Values for character and character_mask.
> >   * These are identical to the values used by H_GET_CPU_CHARACTERISTICS.
> > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> > index 98de68379b18..9153b0034b45 100644
> > --- a/arch/powerpc/kvm/powerpc.c
> > +++ b/arch/powerpc/kvm/powerpc.c
> > @@ -701,6 +701,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >  			}
> >  		}
> >  		break;
> > +#if defined(CONFIG_KVM_BOOK3S_HV_POSSIBLE)
> > +	case KVM_CAP_PPC_COMPAT_CAPS:
> > +		r = 0;
> > +		if (kvmhv_on_pseries())
> > +			r = 1;
> > +		break;
> > +#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
> >  	default:
> >  		r = 0;
> >  		break;
> > @@ -2467,6 +2474,34 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> >  		r = kvm->arch.kvm_ops->svm_off(kvm);
> >  		break;
> >  	}
> > +	case KVM_PPC_GET_COMPAT_CAPS: {
> > +		struct kvm_ppc_compat_caps host_caps;
> > +		u64 user_size;
> > +
> > +		r = -EFAULT;
> > +		/* First, get the size field from userspace to validate */
> > +		if (copy_from_user(&user_size, &((struct kvm_ppc_compat_caps
> > +		     __user *)argp)->size, sizeof(user_size))) {
> move the struct size member to the first field. That way
> from_from_user() call is simplified and you wont have to do some wired
> pointer arithmetic.

Will do as mentioned above.

> 
> 
> > +			goto out;
> > +		}
> > +
> > +		/* Validate size - must be at least the current structure size */
> > +		r = -EINVAL;
> > +		if (user_size < sizeof(host_caps))
> > +			goto out;
> Check should be strengthed to
>  if (user_size != sizeof(host_caps))
> So that in case used space sends a struct larger than what kernel knows
> abt it will be rejected. This will prevent surprises in future in case
> VMM sends a larger struct expecting kernel to know abt it but an older
> kernel only knows abt older smaller sized struct. Also look at the
> review comment below.

Agreed. I'll change the validation to use strict equality. This is
simpler and clearer - userspace must provide exactly the size the kernel
expects.

> 
> > +
> > +		r = -ENOTTY;
> > +		memset(&host_caps, 0, sizeof(host_caps));
> > +		if (!kvm->arch.kvm_ops->get_compat_caps)
> > +			goto out;
> > +
> > +		r = kvm->arch.kvm_ops->get_compat_caps(&host_caps);
> > +		/* Set the actual size of the structure we're returning */
> > +		host_caps.size = sizeof(host_caps);
> > +		if (!r && copy_to_user(argp, &host_caps, sizeof(host_caps)))
> > +			r = -EFAULT;
> You are allowing a future userspace VMM to potentially send a larger
> 'struct kvm_ppc_compat_caps' that what kernel knows about. This makes
> error handling in userspace bit involved since there might be some
> fields in the 'struct kvm_ppc_compat_caps' given from userspace may
> remain un-initialized when userspace sees it. So please mention this
> subtle behaviour should be mentioned in patch description and also
> update it the doc in the later patch.

With the strict equality check (user_size != sizeof(host_caps)), this
concern should be addressed - we won't accept larger structs from
userspace. However, I'll still improve the documentation to:

1. In the commit message:
   - Explain the size field validation
   - Document that exact size match is required
   - Clarify error handling behavior

2. In Documentation/virt/kvm/api.rst:
   - Add improved documentation for KVM_PPC_GET_COMPAT_CAPS
   - Document the size field requirement and validation

Thanks,
Amit

> 
> > +		break;
> > +	}
> >  	default: {
> >  		struct kvm *kvm = filp->private_data;
> >  		r = kvm->arch.kvm_ops->arch_vm_ioctl(filp, ioctl, arg);
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 6c8afa2047bf..1788a0068662 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -996,6 +996,7 @@ struct kvm_enable_cap {
> >  #define KVM_CAP_S390_USER_OPEREXEC 246
> >  #define KVM_CAP_S390_KEYOP 247
> >  #define KVM_CAP_S390_VSIE_ESAMODE 248
> > +#define KVM_CAP_PPC_COMPAT_CAPS 249
> >  
> >  struct kvm_irq_routing_irqchip {
> >  	__u32 irqchip;
> > @@ -1349,6 +1350,9 @@ struct kvm_s390_keyop {
> >  #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
> >  #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
> >  
> > +/* Available with KVM_CAP_PPC_COMPAT_CAPS */
> > +#define KVM_PPC_GET_COMPAT_CAPS	_IOR(KVMIO,  0xe4, struct kvm_ppc_compat_caps)
> > +
> >  /*
> >   * ioctls for vcpu fds
> >   */
> > -- 
> > 2.50.1 (Apple Git-155)
> >
> >
> 
> -- 
> Cheers
> ~ Vaibhav

^ permalink raw reply

* Re: [PATCH 1/4] nfs: store the full NFS fileid in inode->i_ino
From: Jeff Layton @ 2026-06-23 11:04 UTC (permalink / raw)
  To: Mark Brown
  Cc: Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
	linux-nfs, linux-kernel, linux-doc
In-Reply-To: <655d0d2a5f8203c52c78d37462328449e49b7feb.camel@kernel.org>

On Mon, 2026-06-22 at 18:38 -0400, Jeff Layton wrote:
> On Mon, 2026-06-22 at 22:05 +0100, Mark Brown wrote:
> > On Tue, May 12, 2026 at 12:12:42PM -0400, Jeff Layton wrote:
> > > Now that inode->i_ino is a 64-bit value, store the full NFS fileid in
> > > it directly instead of an XOR-folded hash. This makes NFS_FILEID() and
> > > set_nfs_fileid() operate on inode->i_ino rather than the separate
> > > nfsi->fileid field.
> > 
> > This patch is in -next now and is triggering a failure for in the LTP
> > ioctl10.c test for me on arm:
> > 
> > tst_buffers.c:57: TINFO: Test is using guarded buffers
> > tst_test.c:2047: TINFO: LTP version: 20260130
> > tst_test.c:2050: TINFO: Tested kernel: 7.1.0-next-20260622 #1 SMP @1782128788 armv7l
> > 
> > ...
> > 
> > ioctl10.c:111: TFAIL: q->inode (11493907226) != entry.vm_inode (4294967295)
> > 
> 
> Note that the vm_inode value is arm32's ULONG_MAX.
> 
> > arm64 seems unaffected, I didn't really investigate but I'll note that
> > unsigned long is 32 bit on arm.
> > 
> > Full log:
> > 
> >    https://lava.sirena.org.uk/scheduler/job/2904745#L3852
> > 
> > bisect log with more test job links:
> > 
> 
> 
> The testcase does this:
> 
> static void parse_maps_file(const char *filename, const char *keyword, struct map_entry *entry)
> {
>         FILE *fp = SAFE_FOPEN(filename, "r");
> 
>         char line[1024];
> 
>         while (fgets(line, sizeof(line), fp) != NULL) {
>                 if (fnmatch(keyword, line, 0) == 0) {
>                         if (sscanf(line, "%lx-%lx %s %lx %x:%x %lu %s",
>                                                 &entry->vm_start, &entry->vm_end, entry->vm_flags_str,
>                                                 &entry->vm_pgoff, &entry->vm_major, &entry->vm_minor,
>                                                 &entry->vm_inode, entry->vm_name) < 7)
>                                 tst_brk(TFAIL, "parse maps file /proc/self/maps failed");
> 
>                         entry->vm_flags = parse_vm_flags(entry->vm_flags_str);
> 
>                         SAFE_FCLOSE(fp);
>                         return;
>                 }
>         }
> 
>         SAFE_FCLOSE(fp);
>         tst_brk(TFAIL, "parse maps file /proc/self/maps failed");
> }
> 
> Note that it's trying to stuff the inode number field into an unsigned
> long. Before this patch, the maps file would have printed the old
> (hashed) inode number on 32-bit. Now, it prints the full 64-bit inode
> number.
> 
> I asked The Big Pickle and it says:
> 
> "In glibc (userspace): The C standard says this is undefined behavior.
> In practice, glibc's scanf internally uses strtoul/strtoull, which on
> overflow store ULONG_MAX/ULLONG_MAX and set errno = ERANGE. However,
> scanf itself does not propagate ERANGE to the caller — it still returns
> 1 (success). So you'd silently get ULONG_MAX stored."
> 
> We could argue that this is a bug in the testcase. It assumes that the
> maps file will never print a value larger than ULONG_MAX in that field,
> and I don't see why it would make that assumption in this day and age.
> 
> Are there actual programs in the field that scrape the maps file that
> might be affected by this change?

This testcase patch should fix it. I'll plan to send this to the LTP
list, but it would be nice if someone could confirm the fix on arm32:

-----------------------8<---------------------

[PATCH LTP] ioctl10: fix the sscanf() call to handle 64-bit inode on 32-bit arch

This test started failing recently on arm32, when we switched the
kernel to displaying the full 64-bit inode number in the maps file.
Change the testcase to allow for a full 64-bit inode number on all
arches. The value it's compared to is already 64-bits so widening this
field is all that is necessary.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 testcases/kernel/syscalls/ioctl/ioctl10.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/testcases/kernel/syscalls/ioctl/ioctl10.c b/testcases/kernel/syscalls/ioctl/ioctl10.c
index b668c9e93889..d7e40f3c8643 100644
--- a/testcases/kernel/syscalls/ioctl/ioctl10.c
+++ b/testcases/kernel/syscalls/ioctl/ioctl10.c
@@ -35,7 +35,7 @@ struct map_entry {
 	unsigned long vm_pgoff;
 	unsigned int vm_major;
 	unsigned int vm_minor;
-	unsigned long vm_inode;
+	uint64_t vm_inode;
 	char vm_name[256];
 	unsigned int vm_flags;
 };
@@ -68,7 +68,7 @@ static void parse_maps_file(const char *filename, const char *keyword, struct ma
 
 	while (fgets(line, sizeof(line), fp) != NULL) {
 		if (fnmatch(keyword, line, 0) == 0) {
-			if (sscanf(line, "%lx-%lx %s %lx %x:%x %lu %s",
+			if (sscanf(line, "%lx-%lx %s %lx %x:%x %llu %s",
 						&entry->vm_start, &entry->vm_end, entry->vm_flags_str,
 						&entry->vm_pgoff, &entry->vm_major, &entry->vm_minor,
 						&entry->vm_inode, entry->vm_name) < 7)
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v2 05/11] hugetlb: Convert the vmf->pgoff to PAGE_SIZE granularity
From: XIAO WU @ 2026-06-23 10:54 UTC (permalink / raw)
  To: Jane Chu, akpm
  Cc: willy, jack, viro, brauner, muchun.song, osalvador, david, hughd,
	baolin.wang, linmiaohe, nao.horiguchi, lorenzo, rppt, peterx,
	corbet, linux-doc, linux-mm, linux-kernel, linux-fsdevel
In-Reply-To: <20260617172534.1740152-6-jane.chu@oracle.com>

Hi Jane,

Thanks for this series — the conversion to PAGE-granularity indexing is a
nice cleanup.

I came across a Sashiko AI review of this patch series, which flagged
several issues, one of which I was able to confirm triggers a real kernel
crash:

https://sashiko.dev/#/patchset/20260617172534.1740152-1-jane.chu@oracle.com

 > +++ b/mm/hugetlb.c
 > @@ -5952,8 +5955,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, 
struct vm_area_struct *vma,
 >          .address = address & huge_page_mask(h),
 >          .real_address = address,
 >          .flags = flags,
 > -        .pgoff = vma_hugecache_offset(h, vma,
 > -                address & huge_page_mask(h)),
 > +        .pgoff = linear_page_index(vma, address),

This change sets vmf.pgoff to linear_page_index(vma, address), but
`address` here is the raw unaligned fault address, not the huge-page-aligned
address.  Previously, vma_hugecache_offset() used `address & 
huge_page_mask(h)`
which produced a huge-page-aligned index.

When a page fault occurs at a non-huge-page-aligned address within a hugetlb
mapping (e.g., vm_start + 0x1000 for a 2MB page), the resulting pgoff is not
a multiple of pages_per_huge_page (512 for 2MB).  This unaligned index
propagates through:

   hugetlb_fault() → hugetlb_no_page() → hugetlb_add_to_page_cache()
   → __filemap_add_folio()

where this assertion fires (mm/filemap.c:862):

   VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);

With CONFIG_DEBUG_VM=y, this becomes a BUG() and panics the kernel.

I was able to reproduce this in a QEMU VM.  The fix should be trivial:
pass the aligned address to linear_page_index().

=== Reproduction ===

Kernel: 7.1.0-rc5-g7ba451f8a24f #1 SMP PREEMPT_DYNAMIC x86_64
Config: CONFIG_HUGETLBFS=y, CONFIG_DEBUG_VM=y, CONFIG_KASAN=y

Trigger: mmap a hugetlbfs file, then access an address at offset 0x1000
(one 4K page) into the mapping, which is unaligned relative to the 2MB
huge page boundary.

=== Full PoC ===

Compile with: gcc -o poc poc.c -static

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <fcntl.h>
#include <errno.h>

#ifndef MAP_HUGETLB
#define MAP_HUGETLB 0x40000
#endif
#ifndef MAP_HUGE_SHIFT
#define MAP_HUGE_SHIFT 26
#endif

/*
  * Bug: hugetlb_fault() sets vmf.pgoff = linear_page_index(vma, address)
  * using the raw unaligned fault address.  This unaligned pgoff reaches
  * __filemap_add_folio() which VM_BUG_ON_FOLIO's on it.
  */

static long get_hugepage_size(void)
{
     FILE *f;
     char line[256];
     long size = 2 * 1024 * 1024;

     f = fopen("/proc/meminfo", "r");
     if (!f)
         return size;
     while (fgets(line, sizeof(line), f)) {
         if (sscanf(line, "Hugepagesize: %ld kB", &size) == 1)
             size *= 1024;
     }
     fclose(f);
     return size;
}

int main(void)
{
     void *addr;
     size_t hpage_size;
     const char *hugetlbfs_path = "/mnt/huge/testfile";
     int fd;
     int ret;

     hpage_size = get_hugepage_size();
     printf("[+] Huge page size: %zu bytes\n", hpage_size);

     /* Mount hugetlbfs */
     mkdir("/mnt/huge", 0755);
     ret = syscall(__NR_mount, "hugetlbfs", "/mnt/huge", "hugetlbfs", 0, 
NULL);
     if (ret < 0 && errno != EBUSY && errno != ENOENT)
         perror("mount hugetlbfs");

     /* Reserve 1 huge page */
     {
         FILE *f = fopen("/proc/sys/vm/nr_hugepages", "w");
         if (f) { fprintf(f, "1"); fclose(f); }
     }

     /* Create hugetlbfs file and mmap it */
     fd = open(hugetlbfs_path, O_CREAT | O_RDWR, 0644);
     if (fd < 0) {
         perror("open hugetlbfs");
         printf("[!] Trying anonymous MAP_HUGETLB\n");
         addr = mmap(NULL, hpage_size, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
         if (addr == MAP_FAILED) {
             perror("mmap MAP_HUGETLB");
             return 1;
         }
     } else {
         ftruncate(fd, hpage_size);
         addr = mmap(NULL, hpage_size, PROT_READ | PROT_WRITE,
                 MAP_SHARED, fd, 0);
         close(fd);
         if (addr == MAP_FAILED) {
             perror("mmap hugetlbfs file");
             return 1;
         }
     }
     printf("[+] Mapping at %p\n", addr);

     /*
      * Trigger: access address at offset 0x1000 into the huge page.
      * vm_start is huge-page-aligned, but vm_start + 0x1000 is not.
      * hugetlb_fault() sets vmf.pgoff = linear_page_index(vma, address)
      * with the unaligned address, producing an unaligned pgoff.
      */
     printf("[+] Triggering fault at unaligned offset (%p + 
0x1000)...\n", addr);
     fflush(stdout);
     volatile char *trigger = (volatile char *)addr + 0x1000;
     *trigger = 0x41;

     printf("[+] Survived: value = 0x%02x\n", *trigger);
     return 0;
}

=== Crash Log ===

Linux syzkaller 7.1.0-rc5-g7ba451f8a24f #1 SMP PREEMPT_DYNAMIC x86_64

[  527.288433][ T9873] page dumped because: VM_BUG_ON_FOLIO(index & 
(folio_nr_pages(folio) - 1))
[  527.300642][ T9873] kernel BUG at mm/filemap.c:862!
[  527.301090][ T9873] Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
[  527.301640][ T9873] CPU: 0 UID: 0 PID: 9873 Comm: poc Not tainted
[  527.303803][ T9873] RIP: 0010:__filemap_add_folio+0xf39/0x1200
[  527.311913][ T9873] Call Trace:
[  527.312345][ T9873]  <TASK>
[  527.312676][ T9873]  hugetlb_add_to_page_cache+0xe3/0x240
[  527.313414][ T9873]  hugetlb_no_page+0x1301/0x21b0
[  527.314402][ T9873]  hugetlb_fault+0x531/0x1570
[  527.315259][ T9873]  handle_mm_fault+0x970/0xaf0
[  527.316565][ T9873]  do_user_addr_fault+0x60b/0x14c0
[  527.317434][ T9873]  asm_exc_page_fault+0x26/0x30
[  527.318733][ T9873] RIP: 0033:0x401fa2
[  527.326921][ T9873]  <TASK>
[  527.327245][ T9873] RIP: 0010:__filemap_add_folio+0xf39/0x1200
[  527.335300][ T9873] Kernel panic - not syncing: Fatal exception

The Sashiko review also flagged a few other pre-existing issues in
this series that I haven't verified yet:

1. [Critical] remove_inode_hugepages() in patch 9: passing folio->index
    (base-page index) to hugetlb_unmap_file_folio() which multiplies by
    pages_per_huge_page(h), effectively squaring the offset and causing
    the interval tree search to miss VMAs (potential UAF).

2. [High] hugetlbfs_zero_partial_page() in patch 7: Usama already
    pointed out the start >> PAGE_SHIFT question — `start` is a byte
    offset but filemap_lock_folio() expects a page index.

3. [Critical] filemap_get_pages() in patch 4: the `if (is_hugetlbfs)
    goto done` path returns 0 with an empty batch, which could cause
    filemap_read() to loop forever when reading a hole in a hugetlbfs
    file.

Thanks,
Xiao

^ permalink raw reply

* Re: [PATCH v4 0/2] cpufreq: CPPC: add autonomous mode boot parameter support
From: Sumit Gupta @ 2026-06-23 10:17 UTC (permalink / raw)
  To: Pierre Gondois, Viresh Kumar
  Cc: rafael, ionela.voinescu, zhenglifeng1, zhanjie9, corbet, skhan,
	rdunlap, mario.limonciello, linux-pm, linux-doc, linux-kernel,
	linux-tegra, treding, jonathanh, vsethi, ksitaraman, sanjayc,
	mochs, bbasu, sumitg
In-Reply-To: <f269fbc4-8b8f-4829-97bc-cf4cc9246aec@nvidia.com>


On 22/06/26 14:58, Sumit Gupta wrote:
>
> On 19/06/26 14:59, Pierre Gondois wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 6/18/26 07:28, Viresh Kumar wrote:
>>> On 16-06-26, 18:22, Sumit Gupta wrote:
>>>> The dependency it was waiting on, the "cpufreq: Set policy->min and
>>>> max as real QoS constraints" series, is now in linux-pm (linux-next).
>>>> I rebased on top and verified autonomous mode works as expected, and
>>>> it applies cleanly on the current linux-next.
>>>>
>>>> The [1] reference in patch 2/2 points to v2 of that series; the merged
>>>> version is v3 [2].
>>>>
>>>> If there are no further comments, please consider acking and queuing
>>>> this for the next cycle.
>>> I was waiting for CPPC reviewers to provide some feedback.i
>>>
>>> Jie / Lifeng / Pierre ?
>>>
>> I think the patchset has the same issue described at:
>>
>> https://lore.kernel.org/all/86780f97-29ee-4a72-b311-38c89434b707@arm.com/ 
>>
>>
>> I don't know if this is important to other persons,
>> but IMO it would be preferable to have a solution to this issue
>> before adding more functionalities relying on registers that are left
>> in an unknown state.
>>
>> If there are any other opinion ?
>>
>
> The concern is valid, but this isn't a new gap. The registers the boot
> parameter programs are already writable via existing sysfs:
>  - auto_sel via auto_select
>  - EPP via energy_performance_preference_val
> So userspace can already leave these in a non-default state across
> unload / CPU hotplug in mainline. The boot parameter just sets the
> same registers at boot via the same paths.
>
> I am already working on the save/restore change we discussed on
> the ospm_nominal_perf thread, as a dedicated follow-up grouping
> all OSPM-set registers (ospm_nominal_perf, auto_sel, EPP) together.
> I think doing it once uniformly is cleaner.
>
> Both features are already under review, so my preference is to take
> them first and add the save/restore on top, rather than merging it
> first and respinning both features under it. Either order works for me
> if you and the maintainers prefer infra-first.
>
> Thanks,
> Sumit
>
>

I have sent v5 of the autonomous mode series [1] with a small fix.

Also posted patch [3] to preserve OSPM set regs across hotplug/unload.
It applies on top of [1] & [2] (both not yet merged).

[1]
   [PATCH v5 0/2] cpufreq: CPPC: add autonomous mode boot parameter support
https://lore.kernel.org/lkml/20260623080652.3353386-1-sumitg@nvidia.com/

[2]
   [PATCH v5] ACPI: CPPC: Add ospm_nominal_perf support
https://lore.kernel.org/lkml/20260615185934.2383514-1-sumitg@nvidia.com/

[3]
   [PATCH] cpufreq: CPPC: Preserve OSPM-set registers across hotplug and 
unload
https://lore.kernel.org/lkml/20260623095403.3407436-1-sumitg@nvidia.com/

Thanks,
Sumit



^ permalink raw reply

* Re: [PATCH v3 1/2] dt-bindings: iio: dac: Add AD5529R
From: Janani Sunil @ 2026-06-23 10:07 UTC (permalink / raw)
  To: David Lechner, Nuno Sá, Rodrigo Alencar
  Cc: Jonathan Cameron, Conor Dooley, Janani Sunil, Lars-Peter Clausen,
	Michael Hennerich, Nuno Sá, Andy Shevchenko, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Philipp Zabel, Jonathan Corbet,
	Shuah Khan, linux-iio, devicetree, linux-kernel, linux-doc,
	Mark Brown
In-Reply-To: <c72fb508-05a4-429a-9ca7-86e42a115fa8@baylibre.com>


On 6/22/26 17:36, David Lechner wrote:
> On 6/22/26 7:20 AM, Nuno Sá wrote:
>> On Mon, Jun 22, 2026 at 12:51:20PM +0100, Rodrigo Alencar wrote:
>>> On 22/06/26 11:29, Nuno Sá wrote:
>>>> On Mon, Jun 22, 2026 at 10:24:05AM +0100, Rodrigo Alencar wrote:
>>>>> On 21/06/26 15:33, Jonathan Cameron wrote:
>>>>>> On Fri, 19 Jun 2026 16:54:11 +0100
>>>>>> Nuno Sá <noname.nuno@gmail.com> wrote:
>>>>>>
>>>>>>> On Fri, Jun 19, 2026 at 03:12:07PM +0100, Conor Dooley wrote:
>>>>>>>> On Fri, Jun 19, 2026 at 02:01:08PM +0100, Nuno Sá wrote:
>>>>>>>>> On Fri, Jun 19, 2026 at 12:40:54PM +0100, Conor Dooley wrote:
>>>>>>>>>> On Fri, Jun 19, 2026 at 12:36:55PM +0100, Conor Dooley wrote:
>>>>>>>>>>> On Fri, Jun 19, 2026 at 12:33:11PM +0200, Janani Sunil wrote:
>>>>>>>>>>>> On 6/14/26 21:44, Jonathan Cameron wrote:
>>>>>>>>>>>>> On Tue, 9 Jun 2026 16:47:23 +0200
>>>>>>>>>>>>> Janani Sunil <jan.sun97@gmail.com> wrote:
>>>>>>>>>>>>>    
>>>>>>>>>>>>>> On 5/26/26 15:11, Rodrigo Alencar wrote:
>>>>>>>>>>>>>>> On 26/05/19 05:42PM, Janani Sunil wrote:
>>>>>>>>>>>>>>>> Devicetree bindings for AD5529R 16 channel 12/16 bit high voltage,
>>>>>>>>>>>>>>>> buffered voltage output digital-to-analog converter (DAC) with an
>>>>>>>>>>>>>>>> integrated precision reference.
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>> Probably others may comment on that, but...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This parent node may support device addressing for multi-device support through
>>>>>>>>>>>>>>> those ID pins. I suppose that each device may have its own power supplies or
>>>>>>>>>>>>>>> other resources like the toggle pins or reset and enable.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That way I suppose that an example would look like...
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +patternProperties:
>>>>>>>>>>>>>>>> +  "^channel@([0-9]|1[0-5])$":
>>>>>>>>>>>>>>>> +    type: object
>>>>>>>>>>>>>>>> +    description: Child nodes for individual channel configuration
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +    properties:
>>>>>>>>>>>>>>>> +      reg:
>>>>>>>>>>>>>>>> +        description: Channel number.
>>>>>>>>>>>>>>>> +        minimum: 0
>>>>>>>>>>>>>>>> +        maximum: 15
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +      adi,output-range-microvolt:
>>>>>>>>>>>>>>>> +        description: |
>>>>>>>>>>>>>>>> +          Output voltage range for this channel as [min, max] in microvolts.
>>>>>>>>>>>>>>>> +          If not specified, defaults to 0V to 5V range.
>>>>>>>>>>>>>>>> +        oneOf:
>>>>>>>>>>>>>>>> +          - items:
>>>>>>>>>>>>>>>> +              - const: 0
>>>>>>>>>>>>>>>> +              - enum: [5000000, 10000000, 20000000, 40000000]
>>>>>>>>>>>>>>>> +          - items:
>>>>>>>>>>>>>>>> +              - const: -5000000
>>>>>>>>>>>>>>>> +              - const: 5000000
>>>>>>>>>>>>>>>> +          - items:
>>>>>>>>>>>>>>>> +              - const: -10000000
>>>>>>>>>>>>>>>> +              - const: 10000000
>>>>>>>>>>>>>>>> +          - items:
>>>>>>>>>>>>>>>> +              - const: -15000000
>>>>>>>>>>>>>>>> +              - const: 15000000
>>>>>>>>>>>>>>>> +          - items:
>>>>>>>>>>>>>>>> +              - const: -20000000
>>>>>>>>>>>>>>>> +              - const: 20000000
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +    required:
>>>>>>>>>>>>>>>> +      - reg
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +    additionalProperties: false
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +required:
>>>>>>>>>>>>>>>> +  - compatible
>>>>>>>>>>>>>>>> +  - reg
>>>>>>>>>>>>>>>> +  - vdd-supply
>>>>>>>>>>>>>>>> +  - avdd-supply
>>>>>>>>>>>>>>>> +  - hvdd-supply
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +dependencies:
>>>>>>>>>>>>>>>> +  spi-cpha: [ spi-cpol ]
>>>>>>>>>>>>>>>> +  spi-cpol: [ spi-cpha ]
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +allOf:
>>>>>>>>>>>>>>>> +  - $ref: /schemas/spi/spi-peripheral-props.yaml#
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +unevaluatedProperties: false
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +examples:
>>>>>>>>>>>>>>>> +  - |
>>>>>>>>>>>>>>>> +    #include <dt-bindings/gpio/gpio.h>
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +    spi {
>>>>>>>>>>>>>>>> +        #address-cells = <1>;
>>>>>>>>>>>>>>>> +        #size-cells = <0>;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +        dac@0 {
>>>>>>>>>>>>>>>> +            compatible = "adi,ad5529r-16";
>>>>>>>>>>>>>>>> +            reg = <0>;
>>>>>>>>>>>>>>>> +            spi-max-frequency = <25000000>;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +            vdd-supply = <&vdd_regulator>;
>>>>>>>>>>>>>>>> +            avdd-supply = <&avdd_regulator>;
>>>>>>>>>>>>>>>> +            hvdd-supply = <&hvdd_regulator>;
>>>>>>>>>>>>>>>> +            hvss-supply = <&hvss_regulator>;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +            reset-gpios = <&gpio0 87 GPIO_ACTIVE_LOW>;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +            #address-cells = <1>;
>>>>>>>>>>>>>>>> +            #size-cells = <0>;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +            channel@0 {
>>>>>>>>>>>>>>>> +                reg = <0>;
>>>>>>>>>>>>>>>> +                adi,output-range-microvolt = <0 5000000>;
>>>>>>>>>>>>>>>> +            };
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +            channel@1 {
>>>>>>>>>>>>>>>> +                reg = <1>;
>>>>>>>>>>>>>>>> +                adi,output-range-microvolt = <(-10000000) 10000000>;
>>>>>>>>>>>>>>>> +            };
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +            channel@2 {
>>>>>>>>>>>>>>>> +                reg = <2>;
>>>>>>>>>>>>>>>> +                adi,output-range-microvolt = <0 40000000>;
>>>>>>>>>>>>>>>> +            };
>>>>>>>>>>>>>>>> +        };
>>>>>>>>>>>>>>>> +    };
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 	spi {
>>>>>>>>>>>>>>> 		#address-cells = <1>;
>>>>>>>>>>>>>>> 		#size-cells = <0>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 		multi-dac@0 {
>>>>>>>>>>>>>>> 			compatible = "adi,ad5529r-16";
>>>>>>>>>>>>>>> 			reg = <0>;
>>>>>>>>>>>>>>> 			spi-max-frequency = <25000000>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 			#address-cells = <1>;
>>>>>>>>>>>>>>> 			#size-cells = <0>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 			dac@0 {
>>>>>>>>>>>>>>> 				reg = <0>;
>>>>>>>>>>>>>>> 				vdd-supply = <&vdd_regulator>;
>>>>>>>>>>>>>>> 				avdd-supply = <&avdd_regulator>;
>>>>>>>>>>>>>>> 				hvdd-supply = <&hvdd_regulator>;
>>>>>>>>>>>>>>> 				hvss-supply = <&hvss_regulator>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 				reset-gpios = <&gpio0 87 GPIO_ACTIVE_LOW>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 				#address-cells = <1>;
>>>>>>>>>>>>>>> 				#size-cells = <0>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 				channel@0 {
>>>>>>>>>>>>>>> 					reg = <0>;
>>>>>>>>>>>>>>> 					adi,output-range-microvolt = <0 5000000>;
>>>>>>>>>>>>>>> 				};
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 				channel@1 {
>>>>>>>>>>>>>>> 					reg = <1>;
>>>>>>>>>>>>>>> 					adi,output-range-microvolt = <(-10000000) 10000000>;
>>>>>>>>>>>>>>> 				};
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 				channel@2 {
>>>>>>>>>>>>>>> 					reg = <2>;
>>>>>>>>>>>>>>> 					adi,output-range-microvolt = <0 40000000>;
>>>>>>>>>>>>>>> 				};
>>>>>>>>>>>>>>> 			}
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 			dac@1 {
>>>>>>>>>>>>>>> 				reg = <1>;
>>>>>>>>>>>>>>> 				vdd-supply = <&vdd_regulator>;
>>>>>>>>>>>>>>> 				avdd-supply = <&avdd_regulator>;
>>>>>>>>>>>>>>> 				hvdd-supply = <&hvdd_regulator>;
>>>>>>>>>>>>>>> 				hvss-supply = <&hvss_regulator>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 				reset-gpios = <&gpio0 88 GPIO_ACTIVE_LOW>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 				#address-cells = <1>;
>>>>>>>>>>>>>>> 				#size-cells = <0>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 				channel@0 {
>>>>>>>>>>>>>>> 					reg = <0>;
>>>>>>>>>>>>>>> 					adi,output-range-microvolt = <0 5000000>;
>>>>>>>>>>>>>>> 				};
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 				channel@1 {
>>>>>>>>>>>>>>> 					reg = <1>;
>>>>>>>>>>>>>>> 					adi,output-range-microvolt = <(-10000000) 10000000>;
>>>>>>>>>>>>>>> 				};
>>>>>>>>>>>>>>> 			}
>>>>>>>>>>>>>>> 		};
>>>>>>>>>>>>>>> 	};
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> then you might need something like:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 	patternProperties:
>>>>>>>>>>>>>>> 		"^dac@[0-3]$":
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and put most of the things under this node pattern.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So the main driver that you're putting together might need to handle up to four instances.
>>>>>>>>>>>>>>> Even if your current driver cannot handle this, the dt-bindings might need cover that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Need to double check if each dac node needs a separate compatible, so you would maybe populate
>>>>>>>>>>>>>>> a platform data to be shared with the child nodes, which would be a separate driver.
>>>>>>>>>>>>>>> (not sure if it would make sense to mix and match ad5529r-16 and ad5529r-12).
>>>>>>>>>>>>>> Hi Rodrigo,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you for looking at this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For now, I would prefer to keep the binding scoped to a single AD5529R device instance. The current
>>>>>>>>>>>>>> hardware/use case we have only needs one device node and the driver is written around that model as well.
>>>>>>>>>>>>>> While the device addressing pins could allow multi-device topology, we do not have an actual platform using
>>>>>>>>>>>>>> that configuration at the moment, so I would prefer not to introduce an extra parent/child binding structure
>>>>>>>>>>>>>> speculatively without a validating use case.
>>>>>>>>>>>>> Interesting feature - kind of similar to address control on a typical i2c bus device, or
>>>>>>>>>>>>> looking at it another way a kind of distributed SPI mux.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Challenge of a binding is we need to anticipate the future.  So I think we do need something
>>>>>>>>>>>>> like Rodrigo is suggesting even if we only (for now) support a single instance in the driver.
>>>>>>>>>>>>> That would leave the path open to supporting the addressing at a later date.
>>>>>>>>>>>>> An alternative might be to look at it like a chained device setup. In those we pretend there
>>>>>>>>>>>>> is just one device with a lot of channels etc.  The snag is that here things are more loosely
>>>>>>>>>>>>> coupled whereas for those devices it tends to be you have to read / write the same register
>>>>>>>>>>>>> in all devices in the chain as one big SPI message.
>>>>>>>>>>>>>
>>>>>>>>>>>>> +CC Mark Brown as he may know of some precedence for this feature. For his reference..
>>>>>>>>>>>>> - Each of these device has 2 ID pins.  The SPI transfers have to contain the 2 bit
>>>>>>>>>>>>> value that matches that or they are ignored.  Thus a single bus + 1 chip select can
>>>>>>>>>>>>> be used to talk to 4 devices.  Question is what that looks like in device tree + I guess
>>>>>>>>>>>>> longer term how to support it cleanly in SPI.
>>>>>>>>>>> I'd swear I have seen this before, from some Microchip devices. Let me
>>>>>>>>>>> see if I can find what I am thinking of...
>>>>>>>>>>
>>>>>>>>>> microchip,mcp3911 and microchip,mcp3564 both seem to do this with
>>>>>>>>>> slightly different properties.
>>>>>>>>>>
>>>>>>>>>>    microchip,device-addr:
>>>>>>>>>>      description: Device address when multiple MCP3911 chips are present on the same SPI bus.
>>>>>>>>>>      $ref: /schemas/types.yaml#/definitions/uint32
>>>>>>>>>>      enum: [0, 1, 2, 3]
>>>>>>>>>>      default: 0
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    microchip,hw-device-address:
>>>>>>>>>>      $ref: /schemas/types.yaml#/definitions/uint32
>>>>>>>>>>      minimum: 0
>>>>>>>>>>      maximum: 3
>>>>>>>>>>      description:
>>>>>>>>>>        The address is set on a per-device basis by fuses in the factory,
>>>>>>>>>>        configured on request. If not requested, the fuses are set for 0x1.
>>>>>>>>>>        The device address is part of the device markings to avoid
>>>>>>>>>>        potential confusion. This address is coded on two bits, so four possible
>>>>>>>>>>        addresses are available when multiple devices are present on the same
>>>>>>>>>>        SPI bus with only one Chip Select line for all devices.
>>>>>>>>>>        Each device communication starts by a CS falling edge, followed by the
>>>>>>>>>>        clocking of the device address (BITS[7:6] - top two bits of COMMAND BYTE
>>>>>>>>>>        which is first one on the wire).
>>>>>>>>>>
>>>>>>>>>> This sounds exactly like the sort of feature that you're dealing with
>>>>>>>>>> here?
>>>>>>>>>>    
>>>>>>>>> The core idea yes but for this chip, things are a bit more annoying (but
>>>>>>>>> Janani can correct me if I'm wrong). Here, each device can, in theory,
>>>>>>>>> have it's own supplies, pins and at the very least, channels with maybe
>>>>>>>>> different scales. That is why Janani is proposing dac nodes. Given I
>>>>>>>>> honestly don't like much of that "adi,ad5529r-bus" compatible I wondered
>>>>>>>>> about solving this at the spi level.
>>>>>>>>>
>>>>>>>>> Ah and to make it more annoying, we can also mix 12 and 16 bits variants
>>>>>>>>> together in the same bus.
>>>>>>>> I'm definitely missing something, because that property for the
>>>>>>>> microchip devices is not impacted what else is on the bus. AFAICT, you
>>>>>>>> could have an mcp3911 and an mcp3564 on the same bus even though both
>>>>>>>> are completely different devices with different drivers. They have
>>>>>>>> individual device nodes and their own supplies etc etc. These aren't
>>>>>>>> per-channel properties on an adc or dac, they're per child device on a
>>>>>>>> spi bus.
>>>>>>> Maybe I'm the one missing something :). IIRC, spi would not allow two
>>>>>>> devices on the same CS right? Because for this chip we would need
>>>>>>> something like:
>>>>>>>
>>>>>>> spi {
>>>>>>> 	dac@0 {
>>>>>>> 		reg = <0>;
>>>>>>> 		adi,pin-id = <0>;
>>>>>>> 	};
>>>>>>>
>>>>>>> 	dac@1 {
>>>>>>> 		reg = <0>; // which seems already problematic?
>>>>>>> 		adi,pin-id <1>;
>>>>>>> 	};
>>>>>>>
>>>>>>> 	...
>>>>>>>
>>>>>>> 	//up to 4
>>>>>>> };
>>>>>> Yeah. It's not clear to me how that works for the microchip devices
>>>>>> (I suspect it doesn't!)
>>>>>>
>>>>>> Just thinking as I type, but could we do something a bit nasty with
>>>>>> a gpio mux that doesn't actually switch but represents the GPIO being
>>>>>> shared?  Given this is all tied to the spi bus that should all happen
>>>>>> under serializing locks.
>>>>>>
>>>>>> Agreed though that this would be nicer as an SPI thing that let
>>>>>> us specify that a single CS is share by multiple devices and their
>>>>>> is some other signal acting to select which one we are talking to.
>>>>>>
>>>>> If the device-addressing on the same chip-select is to be handled
>>>>> by the spi framework, wouldn't we lose device-specific features?
>>>>>
>>>>> I understand that this multi-device feature is there mostly to extend the
>>>>> channel count from 16 to 32, 48 or 64. I suppose the command:
>>>>>
>>>>> 	"MULTI DEVICE SW LDAC MODE"
>>>>>
>>>>> exists so that software can update channel values accross multiple devices.
>>>> Right! You do have a point! I agree the main driver for a feature like
>>>> this is likely to extend the channel count and effectively "aggregate"
>>>> devices.
>>>>
>>>> But I would say that even with the spi solution the MULTI DEVICE stuff
>>>> should be doable (as we still need a sort of adi,pin-id property).
>>> I don't think we can have something like an IIO buffer shared by multiple
>>> devices. Synchronizing separate devices would be doable with proper hardware
>>> support for this (probably involving an FGPA).
>> True!
>>
>>>   
>>>> But yes, I do feel that the whole feature is for aggregation so seeing
>>>> one device with 32 channels is the expectation here? Rather than seeing
>>>> two devices with 16 channels.
>>> Yes, I think aggregation is the whole point there... so that the IIO driver
>>> is multi-device-aware.
>> Which makes me feel that different pins per device might be possible
>> from an HW point of view but does not make much sense. For example, for
>> the buffer example I would expect LDAC to be shared between all the
>> devices.
>>
>> - Nuno Sá
> I think I mentioned this on a previous revision, but I still think the
> simplest way to go about it would be to assume that all chips treated
> as an aggregate device have everything wired in parallel and just add
> support for per-chip wiring on an as-needed basis. This is how we have
> handled daisy-chained devices so far.

Hi David,

One thing about this approach is that is does not cover a combination  of 12 and 16 bit parts in the chain,
since the compatible string would be at the top level and apply to all chips. To handle this without per chip child nodes or per-chip compatible,
I propose an "adi, resolution" property as an integer array, indexed by the device position:


dac@0 {
     compatible = "adi,ad5529r";
     reg = <0>;
     adi,device-addrs = <0 1>;
     adi,resolution   = <16 12>;   /* per-chip, indexed by position */
     reset-gpios = <&gpio0 87 GPIO_ACTIVE_LOW>;
     vdd-supply  = <&vdd_reg>;
     hvdd-supply = <&hvdd_reg>;

     channel@0  { reg = <0>;  adi,output-range-microvolt = <0 5000000>; };
     channel@16 { reg = <16>; adi,output-range-microvolt = <0 40000000>; };
};


1) This follows the daisy-chain/aggregated model as you suggested, exposing N*16 channels as a single IIO device.
2) Keeps the binding flat- no phantom compatible at a parent bus node, no per-chip child nodes.
3) Enables a 12 bit + 16 bit device combination in the chain, without needing a per-chip compatible.
4) adi, device-addrs specifies the HW address, allowing the driver to encode it into the SPI frame.
5) Supplies and GPIOs remain simple- assuming parallel wiring across all chips.

Jonathan, you had earlier suggested using separate compatibles
  (adi,ad5529r-16 and adi,ad5529r-12) to handle the
  resolution difference.
However, with the aggregated flat binding model,
  separate per-chip compatibles would require child nodes which brings
back the phantom compatible problem at the parent level. The
  adi,resolution array is intended as an alternative that achieves the
same goal-expressing per-chip resolution, without needing a per-chip
  compatible or child node structure.

Does this look reasonable?

Best Regards,
Janani Sunil


^ permalink raw reply

* Re: Issue cloning kernel-doc-zh from HUST mirror
From: Siwei Chen @ 2026-06-23 10:04 UTC (permalink / raw)
  To: linux-doc, Dongliang Mu; +Cc: si.yanteng, wy
In-Reply-To: <b03f244b-46b8-47e8-b7f5-d98d714ae15c@hust.edu.cn>

在 2026年6月23日星期二 中国标准时间 16:51:20，Dongliang Mu 写道：
> Hello Siwei,
> 
> The long answer is as follows:
> 
> The curl 52 Empty reply from server error is not a Git or Ubuntu
> compatibility issue. It happens because the kernel-doc-zh repository is
> extremely large, and the HUST mirror server closes the HTTPS connection
> early due to timeout or proxy limits.
> 
> You can try the following commands:
> 
> 
>       1. Shallow clone first (most reliable)
> 
> 
> 
>       git clone --depth 1
>       https://mirrors.hust.edu.cn/git/kernel-doc-zh.git linux
> 
> 
> 
>       Then fetch full history:
> 
> 
> 
>       git fetch --unshallow
> 
> If still failing, increase Git buffer like:
> 
> git config --global http.postBuffer 1073741824
> 
> 
> 
>       Finally, I will contact maintainers of HUST mirror site and try
>       some attempts to resolve this issue.
> 
> Dongliang Mu
> 

Hello, Dongliang

Thank you for the detailed explanation and suggestions.

I will try the shallow clone approach and the other workarounds you mentioned.

I also appreciate your willingness to contact the HUST mirror maintainers and 
investigate the issue further.

Thanks again for your help.

Best regards,
Siwei Chen



^ permalink raw reply

* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Binbin Wu @ 2026-06-23  9:48 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-18-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> @@ -606,12 +608,20 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>  	next = start;
>  	while (safe && filemap_get_folios(mapping, &next, last, &fbatch)) {
>  
> -		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +		for (i = 0; i < folio_batch_count(&fbatch);) {
>  			struct folio *folio = fbatch.folios[i];
>  
> -			if (folio_ref_count(folio) !=
> -			    folio_nr_pages(folio) + filemap_get_folios_refcount) {
> -				safe = false;
> +			safe = (folio_ref_count(folio) ==
> +				folio_nr_pages(folio) +
> +				filemap_get_folios_refcount);
> +
> +			if (safe) {
> +				++i;
> +			} else if (folio_may_be_lru_cached(folio) &&
> +				   !lru_drained) {
> +				lru_add_drain_all();

It seems unprivileged userspace is able to trigger lru_add_drain_all() repeatedly
by invoking KVM_SET_MEMORY_ATTRIBUTES2 in a loop, which could lead to DoS risk?

> +				lru_drained = true;
> +			} else {
>  				*err_index = max(start, folio->index);
>  				break;
>  			}
> 


^ permalink raw reply

* Re: [PATCH v8 21/46] KVM: guest_memfd: Zero page while getting pfn
From: Yan Zhao @ 2026-06-23  8:56 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com>

On Thu, Jun 18, 2026 at 05:31:58PM -0700, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Move the folio initialization logic from kvm_gmem_get_pfn() into
> __kvm_gmem_get_pfn() to also zero pages if the page is to be used in
> kvm_gmem_populate().
> 
> With in-place conversion, the existing data in a guest_memfd page can be
> populated into guest memory through platform-specific ioctls.
> 
> Without first zeroing the page obtained using __kvm_gmem_get_pfn(), it
> might contain uninitialized host memory, which would leak to the guest if
> the populate completes.
> 
> guest_memfd pages are zeroed at most once in the page's entire lifetime
> with guest_memfd, and that is tracked using the uptodate flag.
> 
> Zeroing the page in __kvm_gmem_get_pfn() is chosen over zeroing in
> kvm_gmem_get_folio() since other flows, such as a future write() syscall,
> can get a page, write to the page and then set page uptodate without
> zeroing.
> 
> This aligns with the concept of zeroing before first use - the other place
> where zeroing happens is in kvm_gmem_fault_user_mapping().
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  virt/kvm/guest_memfd.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 90bc1a26512b6..86c9f5b0863cb 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1137,6 +1137,11 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file,
>  		return ERR_PTR(-EHWPOISON);
>  	}
>  
> +	if (!folio_test_uptodate(folio)) {
> +		clear_highpage(folio_page(folio, 0));
> +		folio_mark_uptodate(folio);
> +	}
Note:
In the __kvm_gmem_populate() path, this folio_mark_uptodate() call makes the
later one after post_populate() pointless.

__kvm_gmem_populate
    |1.__kvm_gmem_get_pfn
    |     |->folio = kvm_gmem_get_folio()
    |     |  if (!folio_test_uptodate(folio))
    |     |     folio_mark_uptodate(folio);
    |2. ret = post_populate()
    |3. if (!ret)
    |       folio_mark_uptodate(folio);

>  	*pfn = folio_file_pfn(folio, index);
>  	if (max_order)
>  		*max_order = 0;
> @@ -1166,11 +1171,6 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		goto out;
>  	}
>  
> -	if (!folio_test_uptodate(folio)) {
> -		clear_highpage(folio_page(folio, 0));
> -		folio_mark_uptodate(folio);
> -	}
> -
>  	if (kvm_gmem_is_private_mem(inode, index))
>  		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>  
>


^ permalink raw reply

* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Yan Zhao @ 2026-06-23  8:41 UTC (permalink / raw)
  To: Sean Christopherson, ackerleytng, aik, andrew.jones, binbin.wu,
	brauner, chao.p.peng, david, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <ajoWngKaZ+wfIyR+@yzhao56-desk.sh.intel.com>

On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
> On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
> > On Mon, Jun 22, 2026, Yan Zhao wrote:
> > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
> > > > From: Ackerley Tng <ackerleytng@google.com>
> > > > 
> > > > Update tdx_gmem_post_populate() to handle cases where a source page is
> > > > not explicitly provided. Instead of returning -EOPNOTSUPP when src_page
> > > > is NULL, default to using the page associated with the destination PFN.
> > > > 
> > > > This change allows for in-place memory conversion where the data is
> > > > already present in the target PFN, ensuring the TDX module has a valid
> > > > source page reference for the TDH.MEM.PAGE.ADD operation.
> > > > 
> > > > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > ---
> > > >  Documentation/virt/kvm/x86/intel-tdx.rst |  4 ++++
> > > >  arch/x86/kvm/vmx/tdx.c                   | 11 ++++++++---
> > > >  2 files changed, 12 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
> > > > index 6a222e9d09541..74357fe87f9ec 100644
> > > > --- a/Documentation/virt/kvm/x86/intel-tdx.rst
> > > > +++ b/Documentation/virt/kvm/x86/intel-tdx.rst
> > > > @@ -158,6 +158,10 @@ KVM_TDX_INIT_MEM_REGION
> > > >  Initialize @nr_pages TDX guest private memory starting from @gpa with userspace
> > > >  provided data from @source_addr. @source_addr must be PAGE_SIZE-aligned.
> > > >  
> > > > +If guest_memfd in-place conversion is enabled, pass NULL for @source_addr to
> > > > +initialize the memory region using memory contents already populated in
> > > > +guest_memfd memory.
> > > > +
> > > >  Note, before calling this sub command, memory attribute of the range
> > > >  [gpa, gpa + nr_pages] needs to be private.  Userspace can use
> > > >  KVM_SET_MEMORY_ATTRIBUTES to set the attribute.
> > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > > index ffe9d0db58c59..56d10333c61a7 100644
> > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > > >  	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> > > >  		return -EIO;
> > > >  
> > > > -	if (!src_page)
> > > > -		return -EOPNOTSUPP;
> > > > +	if (!src_page) {
> > > > +		if (!gmem_in_place_conversion)
> > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
> > > without the MMAP flag, the absence of src_page should still be treated as an
> > > error.
> > 
> > Why MMAP?
> Hmm, I was showing a scenario that in-place conversion couldn't occur.
> I didn't mean that with the MMAP flag, mmap() and user write must occur.
> 
> > Shouldn't this be a general "if (!src_page && !up-to-date)"?  Just
> > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
> > and written memory.  And when write() lands, MMAP wouldn't be necessary to
> > initialize the memory.
> Do you mean using up-to-date flag as below?
> 
> if (!src_page) {
> 	src_page = pfn_to_page(pfn);
> 	if (!folio_test_uptodate(page_folio(src_page)))
> 		return -EOPNOTSUPP;
> }

Another concern with this fix is that:
commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
folio uptodate before reaching post_populate().

[1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/

> One concern is that TDX now does not much care about the up-to-date flag since
> TDX doesn't rely on the flag to clear pages on conversions.
> I'm not sure if the flag can be reliably checked in this case. e.g.,
> now the whole folio is marked up-to-date even if only part of it is faulted by
> user access.
> Ensuring that the up-to-date flag works correctly with huge page support seems
> to have more effort than introducing a dedicated flag for TDX.
> 
> > > Additionally, to properly enable in-place copying for the TDX initial memory
> > > region, userspace must not only specify source_addr to NULL, but also follow
> > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
> > > 1. create guest_memfd with MMAP flag
> > > 2. mmap the guest_memfd.
> > > 3. convert the initial memory range to shared.
> > > 4. copy initial content to the source page.
> > > 5. convert the initial memory range to private
> > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
> > > 7. do not unmap the source backend.
> > > 
> > > So, would it be reasonable to introduce a dedicated flag that allows userspace
> > > to explicitly opt into the in-place copy functionality? e.g.,
> > 
> > Why?  It's userspace's responsibility to get the above right.  If userspace fails
> > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
> I mean if userspace specifies a NULL source_addr by mistake, it's better for
> kernel to detect this mistake, similar to how it validates whether source_addr
> is PAGE_ALIGNED.
> Since userspace already needs to perform additional steps to enable in-place
> copy, specifying a dedicated flag to indicate that the NULL source_addr is
> intentional seems like a reasonable burden.

^ permalink raw reply

* Re: [PATCH v8 17/46] KVM: guest_memfd: Advertise KVM_SET_MEMORY_ATTRIBUTES2 ioctl
From: Binbin Wu @ 2026-06-23  9:14 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-17-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Introduce KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES to advertise the
> availability of the KVM_SET_MEMORY_ATTRIBUTES2 ioctl.
> 
> KVM_SET_MEMORY_ATTRIBUTES2 is a guest_memfd-scoped version of the existing
> KVM_SET_MEMORY_ATTRIBUTES VM ioctl. It allows userspace to manage memory
> attributes, such as KVM_MEMORY_ATTRIBUTE_PRIVATE, directly on a guest_memfd
> file descriptor.
> 
> This new version uses struct kvm_memory_attributes2, which adds an
> error_offset field to the output. This allows KVM to return the specific
> offset that triggered an error, which is especially useful for handling
> EAGAIN results caused by transient page reference counts during attribute
> conversions.
> 
> Update the KVM API documentation to define the new ioctl and its behavior,
> and add the necessary UAPI definitions and capability checks.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Suggested-by: Michael Roth <michael.roth@amd.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

Two nits below.


>  
> +4.145 KVM_SET_MEMORY_ATTRIBUTES2
> +---------------------------------
> +
> +:Capability: KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES
> +:Architectures: all
> +:Type: guest_memfd ioctl
> +:Parameters: struct kvm_memory_attributes2 (in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Errors:
> +
> +  ========== ===============================================================
> +  EINVAL     The specified `offset` or `size` were invalid (e.g. not
                                                   ^
                                                 was
 > +             page aligned, causes an overflow, or size is zero).
> +  EFAULT     The parameter address was invalid.
> +  EAGAIN     Some page within requested range had unexpected refcounts. The
> +             offset of the page will be returned in `error_offset`.
> +  ENOMEM     Ran out of memory trying to track private/shared state
> +  ========== ===============================================================

[...]

> +
> +Set attributes for a range of offsets within a guest_memfd to
> +KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed
> +memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is
                         ^
                    guest use

> +supported, after a successful call to set
> +KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable
> +into host userspace and will only be mappable by the guest.
> +


^ permalink raw reply

* [PATCH v6 4/4] Documentation: PCI: Add documentation for DOE endpoint support
From: Aksh Garg @ 2026-06-23  9:07 UTC (permalink / raw)
  To: linux-pci, linux-doc, mani, kwilczynski, bhelgaas, corbet, kishon,
	skhan, lukas, cassel, alistair
  Cc: linux-arm-kernel, linux-kernel, rdunlap, Frank.Li, s-vadapalli,
	danishanwar, srk, a-garg7
In-Reply-To: <20260623090737.711656-1-a-garg7@ti.com>

Document the architecture and implementation details for the Data Object
Exchange (DOE) framework for PCIe Endpoint devices.

Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Aksh Garg <a-garg7@ti.com>
---

Changes from v5 to v6:
- Addressed the review comments provided by Bjorn Helgaas at v5

Changes from v4 to v5:
- Updated the DOE Abort handling setion.

Changes from v3 to v4:
- Updated the maximum size of the DOE object from 256KB to 1MB,
  as per PCIe spec.
- Updated the DOE setup and cleanup sections.

Changes from v2 to v3:
- Rebased on 7.1-rc1.

Changes since v1:
- Squashed the patches [1] and [2], and moved the documentation file
  to Documentation/PCI/endpoint/pci-endpoint-doe.rst to match the existing
  naming scheme, as suggested by Niklas Cassel
- Updated the documentation as per the design and implementaion changes
  made to previous patches in this series:
  * Updated for static protocol array instead of dynamic registration
  * Documented asynchronous callback model
  * Updated request/response flow with new callback signature
  * Updated memory ownership: DOE core frees request, driver frees response
  * Updated initialization and cleanup sections for new APIs

v5: https://lore.kernel.org/all/20260610100256.1889111-5-a-garg7@ti.com/
v4: https://lore.kernel.org/all/20260522052434.802034-5-a-garg7@ti.com/
v3: https://lore.kernel.org/all/20260427051725.223704-5-a-garg7@ti.com/
v2: https://lore.kernel.org/all/20260401073022.215805-5-a-garg7@ti.com/
v1: [1] https://lore.kernel.org/all/20260213123603.420941-2-a-garg7@ti.com/
    [2] https://lore.kernel.org/all/20260213123603.420941-5-a-garg7@ti.com/

 Documentation/PCI/endpoint/index.rst          |   1 +
 .../PCI/endpoint/pci-endpoint-doe.rst         | 352 ++++++++++++++++++
 2 files changed, 353 insertions(+)
 create mode 100644 Documentation/PCI/endpoint/pci-endpoint-doe.rst

diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst
index dd1f62e731c9..7c03d5abd2ef 100644
--- a/Documentation/PCI/endpoint/index.rst
+++ b/Documentation/PCI/endpoint/index.rst
@@ -9,6 +9,7 @@ PCI Endpoint Framework
 
    pci-endpoint
    pci-endpoint-cfs
+   pci-endpoint-doe
    pci-test-function
    pci-test-howto
    pci-ntb-function
diff --git a/Documentation/PCI/endpoint/pci-endpoint-doe.rst b/Documentation/PCI/endpoint/pci-endpoint-doe.rst
new file mode 100644
index 000000000000..49bb1d8236f0
--- /dev/null
+++ b/Documentation/PCI/endpoint/pci-endpoint-doe.rst
@@ -0,0 +1,352 @@
+.. SPDX-License-Identifier: GPL-2.0-only OR MIT
+
+.. include:: <isonum.txt>
+
+=============================================
+Data Object Exchange (DOE) for PCIe Endpoint
+=============================================
+
+:Copyright: |copy| 2026 Texas Instruments Incorporated
+:Author: Aksh Garg <a-garg7@ti.com>
+:Co-Author: Siddharth Vadapalli <s-vadapalli@ti.com>
+
+Overview
+========
+
+DOE (Data Object Exchange) is a standard PCIe extended capability feature
+defined in PCI Express Base Specification Revision 7.0, Section 6.30.
+It is an optional mechanism for system firmware/software running on root
+complex (host) to perform :ref:`data object <data-object-term>` exchanges
+with an endpoint function. Each data object is uniquely identified by the
+Vendor ID of the vendor publishing the data object definition and a Data
+Object Type value assigned by that vendor.
+
+Think of DOE as a sophisticated mailbox system built into PCIe. The root
+complex can send structured requests to the endpoint device through DOE
+mailboxes, and the endpoint device responds with appropriate data. These
+mailboxes are implemented as PCIe Extended Capabilities in endpoint devices,
+allowing multiple mailboxes per function, each potentially supporting
+different data object protocols.
+
+The DOE support for root complex devices has already been implemented in
+``drivers/pci/doe.c``.
+
+How DOE Works
+=============
+
+The DOE mailbox operates through a simple request-response model:
+
+1. **Host sends request**: The root complex writes a data object (Vendor ID,
+   Type, and Payload) to the DOE Write Data Mailbox Register (one DWORD at
+   a time) of the endpoint function's DOE Capability and sets the DOE Go bit
+   in the DOE Control Register to indicate that a request is ready for
+   processing.
+2. **Endpoint processes**: The endpoint function reads the request from DOE
+   Write Data Mailbox Register, sets the DOE Busy bit in the DOE Status
+   Register, identifies the protocol of the data object, and executes the
+   appropriate handler.
+3. **Endpoint responds**: The endpoint function writes the response data
+   object to the DOE Read Data Mailbox Register (one DWORD at a time), and
+   sets the Data Object Ready bit in the DOE Status Register to indicate that
+   the response is ready. If an error occurs during request processing (such
+   as unsupported protocol or handler failure), the endpoint sets the DOE
+   Error bit in the DOE Status Register instead of the Data Object Ready bit.
+4. **Host reads response**: The root complex retrieves the response data from
+   the DOE Read Data Mailbox Register once the Data Object Ready bit is set
+   in the DOE Status Register, and then writes any value to this register to
+   indicate a successful read. If the DOE Error bit was set, the root complex
+   discards the response and performs error handling as needed.
+
+Each mailbox operates independently and can handle one transaction at a
+time. The DOE specification supports data objects of size up to 1MB
+(2\ :sup:`18` dwords).
+
+For complete DOE Capability details, refer to `PCI Express Base Specification
+Revision 7.0, Section 6.30 - Data Object Exchange (DOE)`.
+
+Key Terminologies
+=================
+
+.. _data-object-term:
+
+**Data Object**
+  A structured, vendor-defined, or standard-defined message exchanged
+  between root complex and endpoint function via DOE Capability registers
+  in configuration space of the function.
+
+**Mailbox**
+  A DOE Capability on the endpoint device, where each physical function
+  can have multiple mailboxes.
+
+**Protocol**
+  A specific type of DOE communication data object identified by a Vendor ID
+  and Type.
+
+**Handler**
+  A function that processes DOE requests of a specific protocol and generates
+  responses.
+
+Architecture of DOE Implementation for Endpoint
+===============================================
+
+.. code-block:: text
+
+      +------------------+
+      |                  |
+      |   Root Complex   |
+      |                  |
+      +--------^---------+
+               |
+               | Config space access
+               |   over PCIe link
+               |
+    +----------v-----------+
+    |                      |
+    |    PCIe Controller   |
+    |      as Endpoint     |
+    |                      |
+    |  +-----------------+ |
+    |  |   DOE Mailbox   | |
+    |  +------^----------+ |
+    +---------|------------+
+   +----------|-------------------------------------------------------------+
+   |          |                                        +------------------+ |
+   | +--------v---------+           Allocate           | +--------------+ | |
+   | |                  |------------------------------->|   Request    | | |
+   | |   EP Controller  |                            +-->|    Buffer    | | |
+   | |      Driver      |             Free           | | +--------------+ | |
+   | |                  |--------------------------+ | |                  | |
+   | +--------^---------+                          | | |                  | |
+   |          |                                    | | |                  | |
+   |          |                                    | | |                  | |
+   |          | pci_ep_doe_process_request()       | | |                  | |
+   |          |                                    | | |                  | |
+   | +--------v---------+             Free         | | |                  | |
+   | |                  |----------------------------+ |        DDR       | |
+   | |    DOE EP Core   |<----+                    |   |                  | |
+   | |  (pci-ep-doe.c)  |     |     Discovery      |   |                  | |
+   | |                  |-----+  Protocol Handler  |   |                  | |
+   | +--------^---------+                          |   |                  | |
+   |          |                                    |   |                  | |
+   |          | protocol_handler()                 |   |                  | |
+   |          |                                    |   |                  | |
+   | +--------v---------+                          |   |                  | |
+   | |                  |                          |   | +--------------+ | |
+   | | Protocol Handler |                          +---->|   Response   | | |
+   | |      Module      |------------------------------->|    Buffer    | | |
+   | | (CMA/SPDM/Other) |           Allocate           | +--------------+ | |
+   | |                  |                              |                  | |
+   | +------------------+                              |                  | |
+   |                                                   +------------------+ |
+   +------------------------------------------------------------------------+
+
+Initialization and Cleanup
+--------------------------
+
+**Framework Initialization and DOE Setup**
+
+The EPC core automatically initializes and sets up DOE mailboxes through the
+``pci_epc_init_capabilities()`` internal function, which is invoked during
+``pci_epc_init_notify()`` when the controller driver calls this API.
+Controller drivers do not need to explicitly handle DOE initialization,
+rather the EPC core manages this transparently.
+
+DOE initialization only occurs when the EPC driver reports DOE Capability
+through the ``doe_capable`` flag in its ``pci_epc_features``.
+
+This internal function performs the following steps:
+
+1. Calls ``pci_ep_doe_init(epc)`` to initialize the xarray data structure
+   (a resizable array data structure defined in linux) named ``doe_mbs`` that
+   stores metadata of DOE mailboxes for the controller in ``struct pci_epc``.
+2. Calls ``pci_epc_doe_setup(epc)`` to discover all DOE capabilities in the
+   endpoint function's configuration space for each function. For each
+   discovered DOE Capability, calls ``pci_ep_doe_add_mailbox(epc, func_no,
+   cap_offset)`` to register the mailbox.
+
+Each DOE mailbox structure created by ``pci_ep_doe_add_mailbox()`` gets an
+ordered workqueue allocated for processing DOE requests sequentially for that
+mailbox, enabling concurrent request handling across different mailboxes.
+Each mailbox is uniquely identified by the combination of physical function
+number and capability offset for that controller.
+
+**Cleanup**
+
+The EPC core automatically cleans up DOE mailboxes through the
+``pci_epc_deinit_capabilities()`` internal function, which is invoked during
+``pci_epc_deinit_notify()`` when the controller driver calls this API.
+Controller drivers do not need to explicitly handle DOE cleanup, rather
+the EPC core manages this transparently.
+
+DOE cleanup only occurs when the EPC device reported DOE Capability
+through the ``doe_capable`` flag in its ``pci_epc_features``.
+
+This internal function calls ``pci_ep_doe_destroy(epc)``, which destroys all
+registered mailboxes, cancels any pending tasks, flushes and destroys the
+workqueues, and frees all memory allocated to the mailboxes.
+
+Protocol Handler Support
+------------------------
+
+Protocol implementations (such as CMA, SPDM, or vendor-specific protocols)
+are supported through a static array of protocol handlers.
+
+When a new DOE protocol library is introduced, its handler function
+is added to the static ``pci_doe_protocols`` array in
+``drivers/pci/endpoint/pci-ep-doe.c``. The discovery protocol
+(VID = 0x0001 (PCI-SIG Vendor ID), Type = 0x00 (discovery protocol)) is
+included in this static array and handled internally by the DOE EP core.
+
+Request Handling
+----------------
+
+The complete flow of a DOE request from the root complex to the response:
+
+**Step 1: Root Complex → EP Controller Driver**
+
+The root complex writes a DOE request (Vendor ID, Type, and Payload) to the
+DOE Write Data Mailbox Register in the endpoint function's DOE Capability
+and sets the DOE Go bit in the DOE Control Register, indicating that the
+request is ready for processing.
+
+**Step 2: EP Controller Driver → DOE EP Core**
+
+The controller driver reads the request header to determine the data object
+length. Based on this length field, it allocates a request buffer in memory
+(DDR) of the appropriate size. The driver then reads the complete request
+payload from the DOE Write Data Mailbox Register and converts the data from
+little-endian format (the format followed in the PCIe transactions over the
+link) to CPU-native format using ``le32_to_cpu()``. The driver defines a
+completion callback function with signature ``void (*complete)(struct pci_epc
+*epc, u8 func_no, u16 cap_offset, int status, u16 vendor, u8 type, void
+*response_pl, size_t response_pl_sz)`` to be invoked when the request
+processing completes. The driver then calls ``pci_ep_doe_process_request(epc,
+func_no, cap_offset, vendor, type, request, request_sz, complete)`` to
+hand off the request to the DOE EP core. This function returns immediately
+after queuing the work (without blocking), and the driver sets the DOE Busy
+bit in the DOE Status Register.
+
+**Step 3: DOE EP Core Processing**
+
+The DOE EP core creates a task structure and submits it to the mailbox's
+ordered workqueue. This ensures that requests for each mailbox are processed
+sequentially, one at a time, as required by the DOE specification. It looks
+for the protocol handler based on the Vendor ID and Type from the request
+header, and executes the handler function.
+
+**Step 4: Protocol Handler Execution**
+
+The workqueue executes the task by calling the registered protocol handler:
+``handler(request, request_sz, &response, &response_sz)``. The handler
+processes the request, allocates a response buffer in memory (DDR), builds
+the response data, and returns the response pointer and size. For the
+discovery protocol, the DOE EP core handles this directly without invoking
+an external handler.
+
+**Step 5: DOE EP Core → EP Controller Driver**
+
+After the protocol handler completes, the DOE EP core frees the request
+buffer, and invokes the completion callback provided by the controller
+driver asynchronously. The callback receives the struct pci_epc, function
+number, capability offset (to identify the mailbox), status code indicating
+the result of request processing, Vendor ID and Type of the data object,
+the response buffer, and its size.
+
+**Step 6: EP Controller Driver → Root Complex**
+
+The controller driver converts the response from CPU-native format to
+little-endian format using ``cpu_to_le32()``, writes the response to DOE
+Read Data Mailbox Register, and sets the Data Object Ready bit in the DOE
+Status Register. The root complex then reads the response from the DOE Read
+Data Mailbox Register. Finally, the controller driver frees the response
+buffer (which the handler allocated).
+
+Asynchronous Request Processing
+-------------------------------
+
+The DOE-EP framework implements asynchronous request processing because an
+endpoint function can have multiple instances of DOE mailboxes, and requests
+may be interleaved across these mailboxes. Request processing of one mailbox
+should not result in blocking request processing of other mailboxes. Hence,
+requests on each mailbox need to be handled in parallel for optimization.
+
+For the EP controller driver to handle requests on multiple mailboxes in
+parallel, ``pci_ep_doe_process_request()`` must be asynchronous. The function
+returns immediately after submitting the request to the mailbox's workqueue,
+without waiting for the request to complete. A completion callback provided
+by the controller driver is invoked asynchronously when request processing
+finishes. This asynchronous design enables concurrent processing of requests
+across different mailboxes.
+
+Abort Handling
+--------------
+
+The DOE specification allows the root complex to abort ongoing DOE operations
+by setting the DOE Abort bit in the DOE Control Register.
+
+**Trigger**
+
+When the root complex sets the DOE Abort bit, the EP controller driver
+detects this condition (typically in an interrupt handler or register
+polling routine). The action taken depends on the timing of the abort:
+
+- **ABORT before request transfer**: If the DOE Abort bit is set before the
+  root complex transfers the request to the mailbox registers, the controller
+  driver should not call ``pci_ep_doe_abort()`` API.
+
+- **ABORT during request transfer**: If the DOE Abort bit is set while the
+  root complex is still transferring the request to the mailbox registers,
+  the controller driver should discard the request, and should not call
+  ``pci_ep_doe_abort()`` and ``pci_ep_doe_process_request()`` APIs in the
+  respective IRQ handlers.
+
+- **ABORT after request submission**: If the DOE Abort bit is set after
+  the request has been fully received and submitted to the DOE EP core via
+  ``pci_ep_doe_process_request()``, the controller driver must call
+  ``pci_ep_doe_abort(epc, func_no, cap_offset)`` for the affected mailbox
+  to perform abort sequence in the DOE EP core.
+
+**Abort Sequence**
+
+The abort function sets the CANCEL flag on the mailbox to prevent queued
+requests from starting. Instead of waiting for the workqueue to flush,
+it returns immediately.
+
+The CANCEL flag gets cleared after invoking the completion callback,
+allowing the mailbox to accept new requests.
+
+Queued requests that have not started execution will be aborted with an
+error status. The currently executing request will complete normally,
+and the controller will reject the response if it arrives after the abort
+sequence has been triggered.
+
+.. note::
+   Independent of when the DOE Abort bit is triggered, the controller
+   driver must clear the DOE Error, Busy, and Ready bits in the DOE Status
+   Register after completing the abort operation to reset the mailbox to
+   an idle state.
+
+Error Handling
+--------------
+
+Errors can occur during DOE request processing for various reasons, such as
+unsupported protocols, handler failures, or memory allocation failures.
+
+**Error Detection**
+
+When an error occurs during DOE request processing, the DOE EP core
+propagates this error back to the controller driver either through the
+``pci_ep_doe_process_request()`` return value, or the status code passed
+to the completion callback.
+
+**Error Response**
+
+When the controller driver receives an error code, it sets the DOE Error bit
+in the DOE Status Register instead of writing a response to the DOE Read Data
+Mailbox Register, and frees the buffers.
+
+API Reference
+=============
+
+.. kernel-doc:: drivers/pci/endpoint/pci-ep-doe.c
+   :export:
-- 
2.34.1


^ permalink raw reply related

* [PATCH v6 1/4] PCI/DOE: Move common definitions to the header file
From: Aksh Garg @ 2026-06-23  9:07 UTC (permalink / raw)
  To: linux-pci, linux-doc, mani, kwilczynski, bhelgaas, corbet, kishon,
	skhan, lukas, cassel, alistair
  Cc: linux-arm-kernel, linux-kernel, rdunlap, Frank.Li, s-vadapalli,
	danishanwar, srk, a-garg7
In-Reply-To: <20260623090737.711656-1-a-garg7@ti.com>

Move common macros and structures from drivers/pci/doe.c to
drivers/pci/pci.h to allow reuse across root complex and
endpoint DOE implementations.

PCI_DOE_MAX_LENGTH macro can be used outside the PCI core as well,
hence move the macro to include/linux/pci-doe.h.

These changes prepare the groundwork for the DOE endpoint implementation
that will reuse these common definitions.

Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Aksh Garg <a-garg7@ti.com>
---

Changes from v5 to v6:
- None.

Changes from v4 to v5:
- None.

Changes from v3 to v4:
- None.

Changes from v2 to v3:
- Rebased on 7.1-rc1.

Changes since v1:
- Moved the common macros that need not be visible outside the PCI core
  to drivers/pci/pci.h instead to include/linux/pci-doe.h as suggested
  by Lukas Wunner
- Removed the redundant empty inlines guarded with CONFIG_PCI_DOE in
  include/linux/pci-doe.h.

v5: https://lore.kernel.org/all/20260610100256.1889111-2-a-garg7@ti.com/
v4: https://lore.kernel.org/all/20260522052434.802034-2-a-garg7@ti.com/
v3: https://lore.kernel.org/all/20260427051725.223704-2-a-garg7@ti.com/
v2: https://lore.kernel.org/all/20260401073022.215805-2-a-garg7@ti.com/
v1: https://lore.kernel.org/all/20260213123603.420941-3-a-garg7@ti.com/

 drivers/pci/doe.c       | 11 -----------
 drivers/pci/pci.h       |  9 +++++++++
 include/linux/pci-doe.h |  3 +++
 3 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
index 7b41da4ec11a..e8d9e95644b3 100644
--- a/drivers/pci/doe.c
+++ b/drivers/pci/doe.c
@@ -28,12 +28,6 @@
 #define PCI_DOE_TIMEOUT HZ
 #define PCI_DOE_POLL_INTERVAL	(PCI_DOE_TIMEOUT / 128)
 
-#define PCI_DOE_FLAG_CANCEL	0
-#define PCI_DOE_FLAG_DEAD	1
-
-/* Max data object length is 2^18 dwords */
-#define PCI_DOE_MAX_LENGTH	(1 << 18)
-
 /**
  * struct pci_doe_mb - State for a single DOE mailbox
  *
@@ -63,11 +57,6 @@ struct pci_doe_mb {
 #endif
 };
 
-struct pci_doe_feature {
-	u16 vid;
-	u8 type;
-};
-
 /**
  * struct pci_doe_task - represents a single query/response
  *
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4a14f88e543a..5844deee2b5f 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -683,6 +683,15 @@ struct pci_sriov {
 	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
 };
 
+/* DOE Mailbox state flags */
+#define PCI_DOE_FLAG_CANCEL	0
+#define PCI_DOE_FLAG_DEAD	1
+
+struct pci_doe_feature {
+	u16 vid;
+	u8 type;
+};
+
 #ifdef CONFIG_PCI_DOE
 void pci_doe_init(struct pci_dev *pdev);
 void pci_doe_destroy(struct pci_dev *pdev);
diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
index bd4346a7c4e7..abb9b7ae8029 100644
--- a/include/linux/pci-doe.h
+++ b/include/linux/pci-doe.h
@@ -19,6 +19,9 @@ struct pci_doe_mb;
 #define PCI_DOE_FEATURE_CMA 1
 #define PCI_DOE_FEATURE_SSESSION 2
 
+/* Max data object length is 2^18 dwords */
+#define PCI_DOE_MAX_LENGTH		(1 << 18)
+
 struct pci_doe_mb *pci_find_doe_mailbox(struct pci_dev *pdev, u16 vendor,
 					u8 type);
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH v6 2/4] PCI: endpoint: Add DOE mailbox support for endpoint functions
From: Aksh Garg @ 2026-06-23  9:07 UTC (permalink / raw)
  To: linux-pci, linux-doc, mani, kwilczynski, bhelgaas, corbet, kishon,
	skhan, lukas, cassel, alistair
  Cc: linux-arm-kernel, linux-kernel, rdunlap, Frank.Li, s-vadapalli,
	danishanwar, srk, a-garg7
In-Reply-To: <20260623090737.711656-1-a-garg7@ti.com>

DOE (Data Object Exchange) is a standard PCIe extended capability
feature defined in PCI Express Base Specification Revision 7.0,
Section 6.30. It provides a communication mechanism primarily used for
implementing PCIe security features such as device authentication, and
secure link establishment. Think of DOE as a sophisticated mailbox
system built into PCIe. The root complex can send structured requests
to the endpoint device through DOE mailboxes, and the endpoint device
responds with appropriate data.

Add the DOE support for PCIe endpoint devices, enabling endpoint
functions to process the DOE requests from the host. The implementation
provides framework APIs for EPC core driver and controller drivers to
register mailboxes, and request processing with workqueues ensuring
sequential handling per mailbox, and parallel handling across mailboxes.
The Discovery protocol is handled internally by the DOE core.

This implementation complements the existing DOE implementation for
root complex in drivers/pci/doe.c.

Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Aksh Garg <a-garg7@ti.com>
---

Changes from v5 yo v6:
- Addressed the review comments provided by Bjorn Helgaas at v5
- Used xa_lock_irqsave() instead of xa_lock() in pci_ep_doe_get_mailbox()
- Added a spinlock to protect the work_queue of the doe_mb struct from
  being destroyed while being called

Changes from v4 to v5:
- Addressed the review comments by Sashiko
- Added refcount per DOE Mailbox to fix Use-After-Free bug
- Change in the Abort Sequence:
  * Instead of waiting on flush_workqueue() to clear the CANCEL flag,
    return immediately after setting the CANCEL flag. The CANCEL flag
    gets cleared in signal_task_complete(), allowing the mailbox to
    accept new requests
  * Abort sequence handling in various scenarios is updated and explained
    in the documentation at PATCH 4/4

Changes from v3 to v4:
- Used 'Returns' instead of 'RETURNS' in the function docstrings to
  comply with kernel-doc format, as suggested by Manivannan Sadhasivam.
- In pci_ep_doe_process_request(), changed the type of request buffer
  from "const void *" to "void *", as the ownership is transferred to
  DOE-EP framework, which is responsible to free the buffer.
- Added "struct pci_epc *epc" to typedef "pci_ep_doe_complete_t", to be
  used by the EPC driver.

Changes from v2 to v3:
- Rebased on 7.1-rc1.

Changes since v1:
- Moved the DOE-EP core file to drivers/pci/endpoint/pci-ep-doe.c, and
  corresponding Kconfig and Makefile to match the existing naming scheme,
  as suggested by Niklas Cassel.
- Renamed the config from PCI_DOE_EP to PCI_ENDPOINT_DOE
- Moved the function declarations that need not be visible outside the
  PCI core to drivers/pci/pci.h instead to include/linux/pci-doe.h as
  suggested by Lukas Wunner
- Converted from synchronous to asynchronous request processing:
  * Removed wait_for_completion() from pci_ep_doe_process_request()
  * Function returns immediately after queuing to workqueue, hence
    removed private data for completion in the task structure
  * Added completion callback as an additional argument to
    pci_ep_doe_process_request(), which takes the response and status
    parameters as arguments (along with other required arguments), hence
    removed task_status in the task structure
  * Created a typedef pci_ep_doe_complete_t for completion callback
  * Removed the pci_ep_doe_task_complete() function, as it would not be
    required anymore with these changes
  * Moved from INIT_WORK_ONSTACK() to INIT_WORK(), to initialize the work
    on heap instead of stack
  * signal_task_complete() now invokes the completion callback, once the
    protocol handler completes its task
- Changed from dynamic xarray-based protocol registration to static array:
  * Removed the register/unregister protocol APIs
  * Replaced the dynamic xarray with static array of struct pci_doe_protocol
  * Added discovery protocol to static array, instead of treating it specially,
    hence removed the special handling for Discovery protocol in
    doe_ep_task_work()
  * Updated pci_ep_doe_handle_discovery() and pci_ep_doe_find_protocol()
    accordingly.
- Memory Management:
  * DOE core frees request buffer in signal_task_complete()
    or during error handling
  * pci_ep_doe_process_request() defines response_pl and response_pl_sz
    as NULL and 0 respectively, whose pointer is passed to the protocol
    handler, hence removed the arguments void **response, size_t *response_sz
    to this function.
- Task structure refactoring:
  * Response buffer: void **response_pl to void *response_pl
  * Response size: size_t *response_pl_sz to size_t response_pl_sz
  * Changed the completion callback to type pci_ep_doe_complete_t
  * Removed void *private and int task_status
- Updated documentation comments of the functions according to the changes

v5: https://lore.kernel.org/all/20260610100256.1889111-3-a-garg7@ti.com/
v4: https://lore.kernel.org/all/20260522052434.802034-3-a-garg7@ti.com/
v3: https://lore.kernel.org/all/20260427051725.223704-3-a-garg7@ti.com/
v2: https://lore.kernel.org/all/20260401073022.215805-3-a-garg7@ti.com/
v1: https://lore.kernel.org/all/20260213123603.420941-4-a-garg7@ti.com/

 drivers/pci/endpoint/Kconfig      |  14 +
 drivers/pci/endpoint/Makefile     |   1 +
 drivers/pci/endpoint/pci-ep-doe.c | 591 ++++++++++++++++++++++++++++++
 drivers/pci/pci.h                 |  42 +++
 include/linux/pci-doe.h           |   5 +
 include/linux/pci-epc.h           |   3 +
 6 files changed, 656 insertions(+)
 create mode 100644 drivers/pci/endpoint/pci-ep-doe.c

diff --git a/drivers/pci/endpoint/Kconfig b/drivers/pci/endpoint/Kconfig
index 8dad291be8b8..15ae16aaa58f 100644
--- a/drivers/pci/endpoint/Kconfig
+++ b/drivers/pci/endpoint/Kconfig
@@ -36,6 +36,20 @@ config PCI_ENDPOINT_MSI_DOORBELL
 	  doorbell. The RC can trigger doorbell in EP by writing data to a
 	  dedicated BAR, which the EP maps to the controller's message address.
 
+config PCI_ENDPOINT_DOE
+	bool "PCI Endpoint Data Object Exchange (DOE) support"
+	depends on PCI_ENDPOINT
+	help
+	  This enables support for Data Object Exchange (DOE) protocol
+	  on PCI Endpoint controllers. It provides a communication
+	  mechanism through mailboxes, primarily used for PCIe security
+	  features.
+
+	  Say Y here if you want be able to communicate using PCIe DOE
+	  mailboxes.
+
+	  If unsure, say N.
+
 source "drivers/pci/endpoint/functions/Kconfig"
 
 endmenu
diff --git a/drivers/pci/endpoint/Makefile b/drivers/pci/endpoint/Makefile
index b4869d52053a..1fa176b6792b 100644
--- a/drivers/pci/endpoint/Makefile
+++ b/drivers/pci/endpoint/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_PCI_ENDPOINT_CONFIGFS)	+= pci-ep-cfs.o
 obj-$(CONFIG_PCI_ENDPOINT)		+= pci-epc-core.o pci-epf-core.o\
 					   pci-epc-mem.o functions/
 obj-$(CONFIG_PCI_ENDPOINT_MSI_DOORBELL)	+= pci-ep-msi.o
+obj-$(CONFIG_PCI_ENDPOINT_DOE)		+= pci-ep-doe.o
diff --git a/drivers/pci/endpoint/pci-ep-doe.c b/drivers/pci/endpoint/pci-ep-doe.c
new file mode 100644
index 000000000000..b2832253eaca
--- /dev/null
+++ b/drivers/pci/endpoint/pci-ep-doe.c
@@ -0,0 +1,591 @@
+// SPDX-License-Identifier: GPL-2.0-only OR MIT
+/*
+ * Data Object Exchange for PCIe Endpoint
+ *	PCIe r7.0, sec 6.30 DOE
+ *
+ * Copyright (C) 2026 Texas Instruments Incorporated - https://www.ti.com
+ *	Aksh Garg <a-garg7@ti.com>
+ *	Siddharth Vadapalli <s-vadapalli@ti.com>
+ */
+
+#define dev_fmt(fmt) "DOE EP: " fmt
+
+#include <linux/bitfield.h>
+#include <linux/device.h>
+#include <linux/pci.h>
+#include <linux/pci-epc.h>
+#include <linux/pci-doe.h>
+#include <linux/refcount.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/xarray.h>
+
+#include "../pci.h"
+
+/* Forward declaration of discovery protocol handler */
+static int pci_ep_doe_handle_discovery(const void *request, size_t request_sz,
+				       void **response, size_t *response_sz);
+
+/**
+ * struct pci_doe_protocol - DOE protocol handler entry
+ * @vid: Vendor ID
+ * @type: Protocol Type
+ * @handler: Handler function pointer
+ */
+struct pci_doe_protocol {
+	u16 vid;
+	u8 type;
+	pci_doe_protocol_handler_t handler;
+};
+
+/**
+ * struct pci_ep_doe_mb - State for a single DOE mailbox on EP
+ *
+ * This state is used to manage a single DOE mailbox capability on the
+ * endpoint side.
+ *
+ * @epc: PCI endpoint controller this mailbox belongs to
+ * @func_no: Physical function number of the function this mailbox belongs to
+ * @cap_offset: Capability offset
+ * @work_queue: Queue of work items
+ * @flags: Bit array of PCI_DOE_FLAG_* flags
+ * @refs: Refcount to manage mailbox lifetime and ensure safe cleanup
+ * @lock: Spinlock protecting work_queue access and DEAD flag checks
+ */
+struct pci_ep_doe_mb {
+	struct pci_epc *epc;
+	u8 func_no;
+	u16 cap_offset;
+	struct workqueue_struct *work_queue;
+	unsigned long flags;
+	refcount_t refs;
+	spinlock_t lock;	/* Serialize work queue access */
+};
+
+/**
+ * struct pci_ep_doe_task - Represents a single DOE request/response task
+ *
+ * @feat: DOE feature (Vendor ID and Type)
+ * @request_pl: Request payload
+ * @request_pl_sz: Size of request payload in bytes
+ * @response_pl: Response buffer
+ * @response_pl_sz: Size of response buffer in bytes
+ * @complete: Completion callback
+ * @work: Work structure for workqueue
+ * @doe_mb: DOE mailbox handling this task
+ */
+struct pci_ep_doe_task {
+	struct pci_doe_feature feat;
+	const void *request_pl;
+	size_t request_pl_sz;
+	void *response_pl;
+	size_t response_pl_sz;
+	pci_ep_doe_complete_t complete;
+
+	/* Initialized by pci_ep_doe_submit_task() */
+	struct work_struct work;
+	struct pci_ep_doe_mb *doe_mb;
+};
+
+/*
+ * Global registry of protocol handlers.
+ * When a new DOE protocol, library is added, add an entry to this array.
+ */
+static const struct pci_doe_protocol pci_doe_protocols[] = {
+	{
+		.vid = PCI_VENDOR_ID_PCI_SIG,
+		.type = PCI_DOE_FEATURE_DISCOVERY,
+		.handler = pci_ep_doe_handle_discovery,
+	},
+};
+
+/*
+ * Combine function number and capability offset into a unique lookup key
+ * for storing/retrieving DOE mailboxes in an xarray.
+ */
+#define PCI_DOE_MB_KEY(func, offset) \
+	(((unsigned long)(func) << 16) | (offset))
+#define PCI_DOE_PROTOCOL_COUNT        ARRAY_SIZE(pci_doe_protocols)
+
+/**
+ * pci_ep_doe_init() - Initialize the DOE framework for a controller in EP mode
+ * @epc: PCI endpoint controller
+ *
+ * Initialize the xarray that will hold the mailboxes.
+ */
+void pci_ep_doe_init(struct pci_epc *epc)
+{
+	xa_init(&epc->doe_mbs);
+}
+EXPORT_SYMBOL_GPL(pci_ep_doe_init);
+
+/**
+ * pci_ep_doe_add_mailbox() - Add a DOE mailbox for a physical function
+ * @epc: PCI endpoint controller
+ * @func_no: Physical function number
+ * @cap_offset: Offset of the DOE capability
+ *
+ * Create and register a DOE mailbox for the specified physical function
+ * and capability offset.
+ *
+ * EPC core driver calls this for each DOE capability discovered in the config
+ * space of each endpoint function if DOE support is available for the EPC.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int pci_ep_doe_add_mailbox(struct pci_epc *epc, u8 func_no, u16 cap_offset)
+{
+	struct pci_ep_doe_mb *doe_mb;
+	unsigned long key;
+	int ret;
+
+	doe_mb = kzalloc_obj(*doe_mb, GFP_KERNEL);
+	if (!doe_mb)
+		return -ENOMEM;
+
+	doe_mb->epc = epc;
+	doe_mb->func_no = func_no;
+	doe_mb->cap_offset = cap_offset;
+
+	doe_mb->work_queue = alloc_ordered_workqueue("pci_ep_doe[%s:pf%d:offset%x]",
+						     0, dev_name(&epc->dev),
+						     func_no, cap_offset);
+	if (!doe_mb->work_queue) {
+		dev_err(epc->dev.parent,
+			"[pf%d:offset%x] failed to allocate work queue\n",
+			func_no, cap_offset);
+		ret = -ENOMEM;
+		goto err_free;
+	}
+
+	/* Add to xarray with composite key */
+	key = PCI_DOE_MB_KEY(func_no, cap_offset);
+	ret = xa_insert(&epc->doe_mbs, key, doe_mb, GFP_KERNEL);
+	if (ret) {
+		dev_err(epc->dev.parent,
+			"[pf%d:offset%x] failed to insert mailbox: %d\n",
+			func_no, cap_offset, ret);
+		goto err_destroy;
+	}
+
+	refcount_set(&doe_mb->refs, 1);
+	spin_lock_init(&doe_mb->lock);
+
+	dev_dbg(epc->dev.parent,
+		"DOE mailbox added: pf%d offset 0x%x\n",
+		func_no, cap_offset);
+
+	return 0;
+
+err_destroy:
+	destroy_workqueue(doe_mb->work_queue);
+err_free:
+	kfree(doe_mb);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_ep_doe_add_mailbox);
+
+/**
+ * pci_ep_doe_cancel_tasks() - Cancel all pending tasks
+ * @doe_mb: DOE mailbox
+ *
+ * Cancel all pending tasks in the mailbox. Mark the mailbox as dead
+ * so no new tasks can be submitted.
+ */
+static void pci_ep_doe_cancel_tasks(struct pci_ep_doe_mb *doe_mb)
+{
+	/* Mark the mailbox as dead */
+	set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
+
+	/* Stop all pending work items from starting */
+	set_bit(PCI_DOE_FLAG_CANCEL, &doe_mb->flags);
+}
+
+/**
+ * pci_ep_doe_get_mailbox() - Get DOE mailbox by function and offset
+ * @epc: PCI endpoint controller
+ * @func_no: Physical function number
+ * @cap_offset: Offset of the DOE capability
+ *
+ * Internal helper to look up a DOE mailbox by its function number and
+ * capability offset.
+ *
+ * Return: Pointer to the mailbox or NULL if not found
+ */
+static struct pci_ep_doe_mb *pci_ep_doe_get_mailbox(struct pci_epc *epc,
+						    u8 func_no, u16 cap_offset)
+{
+	struct pci_ep_doe_mb *doe_mb;
+	unsigned long key, flags;
+
+	key = PCI_DOE_MB_KEY(func_no, cap_offset);
+
+	xa_lock_irqsave(&epc->doe_mbs, flags);
+
+	doe_mb = xa_load(&epc->doe_mbs, key);
+	if (doe_mb && !refcount_inc_not_zero(&doe_mb->refs))
+		doe_mb = NULL;
+
+	xa_unlock_irqrestore(&epc->doe_mbs, flags);
+
+	return doe_mb;
+}
+
+/**
+ * pci_ep_doe_put_mailbox() - Release a reference to a DOE mailbox
+ * @doe_mb: The mailbox structure to release
+ *
+ * Drop the reference count. Free the memory allocated to the mailbox structure
+ * if this was the last active reference.
+ */
+static void pci_ep_doe_put_mailbox(struct pci_ep_doe_mb *doe_mb)
+{
+	if (refcount_dec_and_test(&doe_mb->refs))
+		kfree(doe_mb);
+}
+
+/**
+ * pci_ep_doe_find_protocol() - Find protocol handler in static array
+ * @vendor: Vendor ID
+ * @type: Protocol Type
+ *
+ * Look up a protocol handler in the static protocol array by matching
+ * Vendor ID and Protocol Type.
+ *
+ * Return: Handler function pointer or NULL if not found
+ */
+static pci_doe_protocol_handler_t pci_ep_doe_find_protocol(u16 vendor, u8 type)
+{
+	int i;
+
+	/* Search static protocol array */
+	for (i = 0; i < PCI_DOE_PROTOCOL_COUNT; i++) {
+		if (pci_doe_protocols[i].vid == vendor &&
+		    pci_doe_protocols[i].type == type)
+			return pci_doe_protocols[i].handler;
+	}
+
+	return NULL;
+}
+
+/**
+ * pci_ep_doe_handle_discovery() - Handle Discovery protocol request
+ * @request: Request payload
+ * @request_sz: Request size
+ * @response: Output pointer for response buffer
+ * @response_sz: Output pointer for response size
+ *
+ * Handle the DOE Discovery protocol. The request contains an index specifying
+ * which protocol to query. This function creates a response containing the
+ * Vendor ID and Protocol Type for the requested index, along with the next
+ * index value for further discovery:
+ *
+ * - next_index = 0: Signals this is the last protocol supported
+ * - next_index = n (non-zero): Signals more protocols available,
+ *   query index n next
+ *
+ * Return: 0 on success, -errno on failure
+ */
+static int pci_ep_doe_handle_discovery(const void *request, size_t request_sz,
+				       void **response, size_t *response_sz)
+{
+	struct pci_doe_protocol protocol;
+	u8 requested_index, next_index;
+	u32 *response_pl;
+	u32 request_pl;
+	u16 vendor;
+	u8 type;
+
+	if (request_sz != sizeof(u32))
+		return -EINVAL;
+
+	request_pl = *(u32 *)request;
+	requested_index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX,
+				    request_pl);
+
+	if (requested_index >= PCI_DOE_PROTOCOL_COUNT) {
+		/* No more protocols to report */
+		vendor = 0;
+		type = 0;
+	} else {
+		/* Get protocol from array at requested_index */
+		protocol = pci_doe_protocols[requested_index];
+		vendor = protocol.vid;
+		type = protocol.type;
+	}
+
+	/* Calculate next index */
+	next_index = (requested_index + 1 < PCI_DOE_PROTOCOL_COUNT) ?
+		      requested_index + 1 : 0;
+
+	response_pl = kzalloc_obj(*response_pl, GFP_KERNEL);
+	if (!response_pl)
+		return -ENOMEM;
+
+	/* Build response */
+	*response_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, vendor) |
+		       FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE, type) |
+		       FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
+				  next_index);
+
+	*response = response_pl;
+	*response_sz = sizeof(*response_pl);
+
+	return 0;
+}
+
+static void signal_task_complete(struct pci_ep_doe_task *task, int status)
+{
+	struct pci_ep_doe_mb *doe_mb = task->doe_mb;
+
+	task->complete(doe_mb->epc, doe_mb->func_no, doe_mb->cap_offset,
+		       status, task->feat.vid, task->feat.type,
+		       task->response_pl, task->response_pl_sz);
+
+	/* Clear the CANCEL flag for next DOE request */
+	clear_bit(PCI_DOE_FLAG_CANCEL, &doe_mb->flags);
+
+	kfree(task->request_pl);
+	kfree(task);
+
+	/* Release the mailbox reference acquired during process_request */
+	pci_ep_doe_put_mailbox(doe_mb);
+}
+
+/**
+ * doe_ep_task_work() - Work function for processing DOE EP tasks
+ * @work: Work structure
+ *
+ * Process a DOE request by calling the appropriate protocol handler.
+ */
+static void doe_ep_task_work(struct work_struct *work)
+{
+	struct pci_ep_doe_task *task = container_of(work,
+						    struct pci_ep_doe_task,
+						    work);
+	struct pci_ep_doe_mb *doe_mb = task->doe_mb;
+	pci_doe_protocol_handler_t handler;
+	int rc;
+
+	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
+		signal_task_complete(task, -EIO);
+		return;
+	}
+
+	/* Check if request was aborted */
+	if (test_bit(PCI_DOE_FLAG_CANCEL, &doe_mb->flags)) {
+		signal_task_complete(task, -ECANCELED);
+		return;
+	}
+
+	handler = pci_ep_doe_find_protocol(task->feat.vid, task->feat.type);
+	if (!handler) {
+		dev_warn_ratelimited(doe_mb->epc->dev.parent,
+				     "[%d:%x] Unsupported protocol VID=%04x TYPE=%02x\n",
+				     doe_mb->func_no, doe_mb->cap_offset,
+				     task->feat.vid, task->feat.type);
+		signal_task_complete(task, -EOPNOTSUPP);
+		return;
+	}
+
+	rc = handler(task->request_pl, task->request_pl_sz,
+		     &task->response_pl, &task->response_pl_sz);
+
+	signal_task_complete(task, rc);
+}
+
+/**
+ * pci_ep_doe_submit_task() - Submit a task to be processed
+ * @doe_mb: DOE mailbox
+ * @task: Task to submit
+ *
+ * Submit a DOE task to the workqueue for asynchronous processing.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+static int pci_ep_doe_submit_task(struct pci_ep_doe_mb *doe_mb,
+				  struct pci_ep_doe_task *task)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&doe_mb->lock, flags);
+	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
+		ret = -EIO;
+		goto out;
+	}
+
+	task->doe_mb = doe_mb;
+	INIT_WORK(&task->work, doe_ep_task_work);
+	queue_work(doe_mb->work_queue, &task->work);
+
+out:
+	spin_unlock_irqrestore(&doe_mb->lock, flags);
+	return ret;
+}
+
+/**
+ * pci_ep_doe_process_request() - Process DOE request on endpoint
+ * @epc: PCI endpoint controller
+ * @func_no: Physical function number
+ * @cap_offset: DOE capability offset
+ * @vendor: Vendor ID from request header
+ * @type: Protocol Type from request header
+ * @request: Request payload in CPU-native format
+ * @request_sz: Size of request payload (bytes)
+ * @complete: Callback to invoke upon completion
+ *
+ * Asynchronously process a DOE request received on the endpoint. The request
+ * payload should not include the DOE header (vendor/type/length). Ownership
+ * of the request buffer is transferred to DOE EP core, which frees the buffer
+ * either on error or by signal_task_complete() after the completion callback
+ * fires. The protocol handler will allocate the response buffer, which the
+ * caller (controller driver) must free after use.
+ *
+ * This function returns immediately after queuing the request. The completion
+ * callback will be invoked asynchronously from workqueue context once the
+ * request is processed. The callback receives the function number and
+ * capability offset to identify the mailbox, along with a status code
+ * (0 on success, -errno on failure), and other required arguments.
+ *
+ * As per DOE specification, a mailbox processes one request at a time.
+ * Therefore, this function will never be called concurrently for the same
+ * mailbox by different callers.
+ *
+ * The caller is responsible for the conversion of the received DOE request
+ * with le32_to_cpu() before calling this function. Similarly, it is also
+ * responsible for converting the response payload with cpu_to_le32() before
+ * sending it back over the DOE mailbox.
+ *
+ * The caller is also responsible for ensuring that the request size is within
+ * the limits defined by PCI_DOE_MAX_LENGTH.
+ *
+ * Return: 0 if the request was successfully queued, -errno on failure
+ */
+int pci_ep_doe_process_request(struct pci_epc *epc, u8 func_no, u16 cap_offset,
+			       u16 vendor, u8 type, void *request,
+			       size_t request_sz,
+			       pci_ep_doe_complete_t complete)
+{
+	struct pci_ep_doe_mb *doe_mb;
+	struct pci_ep_doe_task *task;
+	int ret;
+
+	doe_mb = pci_ep_doe_get_mailbox(epc, func_no, cap_offset);
+	if (!doe_mb) {
+		kfree(request);
+		return -ENODEV;
+	}
+
+	task = kzalloc_obj(*task, GFP_ATOMIC);
+	if (!task) {
+		ret = -ENOMEM;
+		goto err_free;
+	}
+
+	task->feat.vid = vendor;
+	task->feat.type = type;
+	task->request_pl = request;
+	task->request_pl_sz = request_sz;
+	task->response_pl = NULL;
+	task->response_pl_sz = 0;
+	task->complete = complete;
+
+	ret = pci_ep_doe_submit_task(doe_mb, task);
+	if (ret)
+		goto err_task;
+
+	return 0;
+
+err_task:
+	kfree(task);
+err_free:
+	kfree(request);
+	pci_ep_doe_put_mailbox(doe_mb);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_ep_doe_process_request);
+
+/**
+ * pci_ep_doe_abort() - Abort DOE operations on a mailbox
+ * @epc: PCI endpoint controller
+ * @func_no: Physical function number
+ * @cap_offset: DOE capability offset
+ *
+ * Abort the queued or in-flight DOE operation for the specified mailbox.
+ * This function is called by the EP controller driver when the RC sets the
+ * DOE Abort bit in the DOE Control Register, and the DOE Busy bit is set in
+ * the DOE Status Register.
+ *
+ * Set the CANCEL flag on the mailbox to prevent queued requests
+ * from starting, and return immediately. The CANCEL flag gets cleared in
+ * signal_task_complete(), allowing the mailbox to accept new requests.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int pci_ep_doe_abort(struct pci_epc *epc, u8 func_no, u16 cap_offset)
+{
+	struct pci_ep_doe_mb *doe_mb;
+
+	doe_mb = pci_ep_doe_get_mailbox(epc, func_no, cap_offset);
+	if (!doe_mb)
+		return -ENODEV;
+
+	/* Set CANCEL flag - worker will abort queued requests */
+	set_bit(PCI_DOE_FLAG_CANCEL, &doe_mb->flags);
+
+	dev_dbg_ratelimited(epc->dev.parent,
+			    "DOE mailbox abort initialized: PF%d offset 0x%x\n",
+			    func_no, cap_offset);
+
+	pci_ep_doe_put_mailbox(doe_mb);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pci_ep_doe_abort);
+
+/**
+ * pci_ep_doe_destroy_mb() - Destroy a single DOE mailbox
+ * @doe_mb: DOE mailbox to destroy
+ *
+ * Internal function to destroy a mailbox and free its resources.
+ */
+static void pci_ep_doe_destroy_mb(struct pci_ep_doe_mb *doe_mb)
+{
+	struct workqueue_struct *wq;
+	unsigned long flags;
+
+	pci_ep_doe_cancel_tasks(doe_mb);
+
+	spin_lock_irqsave(&doe_mb->lock, flags);
+	wq = doe_mb->work_queue;
+	doe_mb->work_queue = NULL;
+	spin_unlock_irqrestore(&doe_mb->lock, flags);
+
+	if (wq)
+		destroy_workqueue(wq);
+
+	pci_ep_doe_put_mailbox(doe_mb);
+}
+
+/**
+ * pci_ep_doe_destroy() - Destroy all DOE mailboxes
+ * @epc: PCI endpoint controller
+ *
+ * Destroy all DOE mailboxes and free associated resources.
+ *
+ * The EPC core driver calls this to free all DOE resources,
+ * if DOE support is available for the EPC.
+ */
+void pci_ep_doe_destroy(struct pci_epc *epc)
+{
+	struct pci_ep_doe_mb *doe_mb;
+	unsigned long index;
+
+	xa_for_each(&epc->doe_mbs, index, doe_mb) {
+		xa_erase(&epc->doe_mbs, index);
+		pci_ep_doe_destroy_mb(doe_mb);
+	}
+
+	xa_destroy(&epc->doe_mbs);
+}
+EXPORT_SYMBOL_GPL(pci_ep_doe_destroy);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5844deee2b5f..6d3b4b779d15 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -692,6 +692,13 @@ struct pci_doe_feature {
 	u8 type;
 };
 
+struct pci_epc;
+
+typedef void (*pci_ep_doe_complete_t)(struct pci_epc *epc, u8 func_no,
+				      u16 cap_offset, int status,
+				      u16 vendor, u8 type,
+				      void *response_pl, size_t response_pl_sz);
+
 #ifdef CONFIG_PCI_DOE
 void pci_doe_init(struct pci_dev *pdev);
 void pci_doe_destroy(struct pci_dev *pdev);
@@ -702,6 +709,41 @@ static inline void pci_doe_destroy(struct pci_dev *pdev) { }
 static inline void pci_doe_disconnected(struct pci_dev *pdev) { }
 #endif
 
+#ifdef CONFIG_PCI_ENDPOINT_DOE
+void pci_ep_doe_init(struct pci_epc *epc);
+int pci_ep_doe_add_mailbox(struct pci_epc *epc, u8 func_no, u16 cap_offset);
+int pci_ep_doe_process_request(struct pci_epc *epc, u8 func_no, u16 cap_offset,
+			       u16 vendor, u8 type, void *request,
+			       size_t request_sz,
+			       pci_ep_doe_complete_t complete);
+int pci_ep_doe_abort(struct pci_epc *epc, u8 func_no, u16 cap_offset);
+void pci_ep_doe_destroy(struct pci_epc *epc);
+#else
+static inline void pci_ep_doe_init(struct pci_epc *epc) { }
+static inline int pci_ep_doe_add_mailbox(struct pci_epc *epc, u8 func_no,
+					 u16 cap_offset)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int pci_ep_doe_process_request(struct pci_epc *epc,
+					     u8 func_no, u16 cap_offset,
+					     u16 vendor, u8 type,
+					     void *request, size_t request_sz,
+					     pci_ep_doe_complete_t complete)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int pci_ep_doe_abort(struct pci_epc *epc, u8 func_no,
+				   u16 cap_offset)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void pci_ep_doe_destroy(struct pci_epc *epc) { }
+#endif
+
 #ifdef CONFIG_PCI_NPEM
 void pci_npem_create(struct pci_dev *dev);
 void pci_npem_remove(struct pci_dev *dev);
diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
index abb9b7ae8029..c46e42f3ce78 100644
--- a/include/linux/pci-doe.h
+++ b/include/linux/pci-doe.h
@@ -22,6 +22,11 @@ struct pci_doe_mb;
 /* Max data object length is 2^18 dwords */
 #define PCI_DOE_MAX_LENGTH		(1 << 18)
 
+typedef int (*pci_doe_protocol_handler_t)(const void *request,
+					  size_t request_sz,
+					  void **response,
+					  size_t *response_sz);
+
 struct pci_doe_mb *pci_find_doe_mailbox(struct pci_dev *pdev, u16 vendor,
 					u8 type);
 
diff --git a/include/linux/pci-epc.h b/include/linux/pci-epc.h
index 1eca1264815b..dd26294c8175 100644
--- a/include/linux/pci-epc.h
+++ b/include/linux/pci-epc.h
@@ -182,6 +182,9 @@ struct pci_epc {
 	unsigned long			function_num_map;
 	int				domain_nr;
 	bool				init_complete;
+#ifdef CONFIG_PCI_ENDPOINT_DOE
+	struct xarray			doe_mbs;
+#endif
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related

* [PATCH v6 3/4] PCI: endpoint: Add support for DOE initialization and setup in EPC core
From: Aksh Garg @ 2026-06-23  9:07 UTC (permalink / raw)
  To: linux-pci, linux-doc, mani, kwilczynski, bhelgaas, corbet, kishon,
	skhan, lukas, cassel, alistair
  Cc: linux-arm-kernel, linux-kernel, rdunlap, Frank.Li, s-vadapalli,
	danishanwar, srk, a-garg7
In-Reply-To: <20260623090737.711656-1-a-garg7@ti.com>

Add pci_epc_init_capabilities() in EPC core driver to initialize and
setup the capabilities supported by the EPC driver. This calls
pci_epc_doe_setup() to setup the DOE framework for an endpoint controller,
which discovers the DOE capabilities (extended capability ID 0x2E), and
registers each discovered DOE mailbox for all the functions in the
endpoint controller.

Add pci_epc_deinit_capabilities() in EPC core driver for cleanup of the
resources used by the capabilities of the EPC driver. This calls
pci_ep_doe_destroy() to destroy all DOE mailboxes and free associated
resources.

Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Aksh Garg <a-garg7@ti.com>
---

Changes from v5 to v6:
- Addressed the review comments provided by Bjorn Helgaas at v5

Changes from v4 to v5:
- Addressed the review comments by Sashiko

Changes from v3 to v4:
- Call DOE setup and destroy APIs directly within the EPC core, instead of
  relying on the EPC drivers to call them individually. EPC drivers do not
  need to explicitly handle DOE setup, rather the EPC core manages this
  transparently. (Suggested by Manivannan Sadhasivam).
- Removed pci_epc_doe_destroy() API, which was just calling pci_ep_doe_destroy().
  Instead, called pci_ep_doe_destroy() directly during cleanup.
- Called pci_ep_doe_init() before the "!epc->ops->find_ext_capability" check,
  because if doe-capable=1 and find_ext_capability() op is undefined, this
  would not initialize the epc->doe_mbs xarray. However during cleanup, the
  check "!epc->ops->find_ext_capability" would be unnecessary, and it will
  try to destroy the epc->doe_mbs xarray even when it was not initialized.

Changes from v2 to v3:
- Rebased on 7.1-rc1.

Changes since v1:
- New patch added to v2 (not present in v1)

v5: https://lore.kernel.org/all/20260610100256.1889111-4-a-garg7@ti.com/
v4: https://lore.kernel.org/all/20260522052434.802034-4-a-garg7@ti.com/
v3: https://lore.kernel.org/all/20260427051725.223704-4-a-garg7@ti.com/
v2: https://lore.kernel.org/all/20260401073022.215805-4-a-garg7@ti.com/

This patch is introduced based on the feedback provided by Manivannan
Sadhasivam at [1].

[1]: https://lore.kernel.org/all/p57x6jleaim5w7t2k3v7tioujnaxuovfpj5euop5ogefvw23se@y5fw3che5p5d/


 drivers/pci/endpoint/pci-epc-core.c | 101 ++++++++++++++++++++++++++++
 include/linux/pci-epc.h             |   6 ++
 2 files changed, 107 insertions(+)

diff --git a/drivers/pci/endpoint/pci-epc-core.c b/drivers/pci/endpoint/pci-epc-core.c
index 6c3c58185fc5..96bd624559f2 100644
--- a/drivers/pci/endpoint/pci-epc-core.c
+++ b/drivers/pci/endpoint/pci-epc-core.c
@@ -14,6 +14,8 @@
 #include <linux/pci-epf.h>
 #include <linux/pci-ep-cfs.h>
 
+#include "../pci.h"
+
 static const struct class pci_epc_class = {
 	.name = "pci_epc",
 };
@@ -842,6 +844,78 @@ void pci_epc_linkdown(struct pci_epc *epc)
 }
 EXPORT_SYMBOL_GPL(pci_epc_linkdown);
 
+/**
+ * pci_epc_doe_setup() - Discover and setup DOE mailboxes for all functions
+ * @epc: the EPC device on which DOE mailboxes has to be setup
+ *
+ * Discover DOE (Data Object Exchange) capabilities for all physical
+ * functions in the endpoint controller and register DOE mailboxes.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+static int pci_epc_doe_setup(struct pci_epc *epc)
+{
+	u8 func_no, vfunc_no = 0;
+	u16 cap_offset;
+	int ret;
+
+	if (!epc->ops || !epc->ops->find_ext_capability)
+		return -EINVAL;
+
+	/* Discover DOE capabilities for all functions */
+	for (func_no = 0; func_no < epc->max_functions; func_no++) {
+		mutex_lock(&epc->lock);
+		cap_offset = epc->ops->find_ext_capability(epc, func_no,
+							   vfunc_no, 0,
+							   PCI_EXT_CAP_ID_DOE);
+		mutex_unlock(&epc->lock);
+
+		while (cap_offset) {
+			/* Register this DOE mailbox */
+			ret = pci_ep_doe_add_mailbox(epc, func_no, cap_offset);
+			if (ret) {
+				dev_warn(&epc->dev,
+					 "[pf%d:offset %x] failed to add DOE mailbox\n",
+					 func_no, cap_offset);
+			}
+
+			mutex_lock(&epc->lock);
+			cap_offset = epc->ops->find_ext_capability(epc, func_no,
+								   vfunc_no,
+								   cap_offset,
+								   PCI_EXT_CAP_ID_DOE);
+			mutex_unlock(&epc->lock);
+		}
+	}
+
+	dev_dbg(&epc->dev, "DOE mailboxes setup complete\n");
+	return 0;
+}
+
+/**
+ * pci_epc_init_capabilities() - Initialize EPC capabilities
+ * @epc: the EPC device whose capabilities need to be initialized
+ *
+ * Initialize capabilities supported by the EPC device.
+ */
+static void pci_epc_init_capabilities(struct pci_epc *epc)
+{
+	const struct pci_epc_features *epc_features;
+	int ret;
+
+	epc_features = pci_epc_get_features(epc, 0, 0);
+	if (!epc_features)
+		return;
+
+	if (IS_ENABLED(CONFIG_PCI_ENDPOINT_DOE) && epc_features->doe_capable) {
+		pci_ep_doe_init(epc);
+
+		ret = pci_epc_doe_setup(epc);
+		if (ret)
+			dev_warn(&epc->dev, "DOE setup failed: %d\n", ret);
+	}
+}
+
 /**
  * pci_epc_init_notify() - Notify the EPF device that EPC device initialization
  *                         is completed.
@@ -857,6 +931,9 @@ void pci_epc_init_notify(struct pci_epc *epc)
 	if (IS_ERR_OR_NULL(epc))
 		return;
 
+	if (!epc->init_complete)
+		pci_epc_init_capabilities(epc);
+
 	mutex_lock(&epc->list_lock);
 	list_for_each_entry(epf, &epc->pci_epf, list) {
 		mutex_lock(&epf->lock);
@@ -890,6 +967,27 @@ void pci_epc_notify_pending_init(struct pci_epc *epc, struct pci_epf *epf)
 }
 EXPORT_SYMBOL_GPL(pci_epc_notify_pending_init);
 
+/**
+ * pci_epc_deinit_capabilities() - Clean up EPC capabilities
+ * @epc: the EPC device whose capabilities need to be cleaned up
+ *
+ * Clean up capabilities supported by the EPC device,
+ * and free the associated resources.
+ */
+static void pci_epc_deinit_capabilities(struct pci_epc *epc)
+{
+	const struct pci_epc_features *epc_features;
+
+	epc_features = pci_epc_get_features(epc, 0, 0);
+	if (!epc_features)
+		return;
+
+	if (IS_ENABLED(CONFIG_PCI_ENDPOINT_DOE) && epc_features->doe_capable) {
+		pci_ep_doe_destroy(epc);
+		dev_dbg(&epc->dev, "DOE mailboxes destroyed\n");
+	}
+}
+
 /**
  * pci_epc_deinit_notify() - Notify the EPF device about EPC deinitialization
  * @epc: the EPC device whose deinitialization is completed
@@ -903,6 +1001,9 @@ void pci_epc_deinit_notify(struct pci_epc *epc)
 	if (IS_ERR_OR_NULL(epc))
 		return;
 
+	if (epc->init_complete)
+		pci_epc_deinit_capabilities(epc);
+
 	mutex_lock(&epc->list_lock);
 	list_for_each_entry(epf, &epc->pci_epf, list) {
 		mutex_lock(&epf->lock);
diff --git a/include/linux/pci-epc.h b/include/linux/pci-epc.h
index dd26294c8175..11474e337db3 100644
--- a/include/linux/pci-epc.h
+++ b/include/linux/pci-epc.h
@@ -84,6 +84,8 @@ struct pci_epc_map {
  * @start: ops to start the PCI link
  * @stop: ops to stop the PCI link
  * @get_features: ops to get the features supported by the EPC
+ * @find_ext_capability: ops to find extended capability offset for a function
+ *			 in endpoint controller
  * @owner: the module owner containing the ops
  */
 struct pci_epc_ops {
@@ -115,6 +117,8 @@ struct pci_epc_ops {
 	void	(*stop)(struct pci_epc *epc);
 	const struct pci_epc_features* (*get_features)(struct pci_epc *epc,
 						       u8 func_no, u8 vfunc_no);
+	u16	(*find_ext_capability)(struct pci_epc *epc, u8 func_no,
+				       u8 vfunc_no, u16 start, u8 cap);
 	struct module *owner;
 };
 
@@ -270,6 +274,7 @@ struct pci_epc_bar_desc {
  * @msi_capable: indicate if the endpoint function has MSI capability
  * @msix_capable: indicate if the endpoint function has MSI-X capability
  * @intx_capable: indicate if the endpoint can raise INTx interrupts
+ * @doe_capable: indicate if the endpoint function has DOE capability
  * @bar: array specifying the hardware description for each BAR
  * @align: alignment size required for BAR buffer allocation
  */
@@ -280,6 +285,7 @@ struct pci_epc_features {
 	unsigned int	msi_capable : 1;
 	unsigned int	msix_capable : 1;
 	unsigned int	intx_capable : 1;
+	unsigned int	doe_capable : 1;
 	struct	pci_epc_bar_desc bar[PCI_STD_NUM_BARS];
 	size_t	align;
 };
-- 
2.34.1


^ permalink raw reply related

* [PATCH v6 0/4] PCI: Add DOE support for endpoint
From: Aksh Garg @ 2026-06-23  9:07 UTC (permalink / raw)
  To: linux-pci, linux-doc, mani, kwilczynski, bhelgaas, corbet, kishon,
	skhan, lukas, cassel, alistair
  Cc: linux-arm-kernel, linux-kernel, rdunlap, Frank.Li, s-vadapalli,
	danishanwar, srk, a-garg7

This patch series introduces the framework for supporting the Data
Object Exchange (DOE) feature for PCIe endpoint devices. Please refer
to the documentation added in patch 4 for details on the feature and
implementation architecture.

The implementation provides a common framework for all PCIe endpoint
controllers, not specific to any particular SoC vendor.

Currently, there are no EPC drivers which support DOE. Hence, there are no
users of the APIs introduced in this series. To avoid dead code being
merged to the kernel, this series can't be merged as of now, hence I am
posting this series to be reviewed by the time the EPC driver gets
submitted as discussed at [1].
[1]: https://lore.kernel.org/all/fa3c59fa-cfa0-49ed-b656-2e9aaf45e440@ti.com/

The changes since v1 are documented in the respective patch descriptions.

v5: https://lore.kernel.org/all/20260610100256.1889111-1-a-garg7@ti.com/
v4: https://lore.kernel.org/all/20260522052434.802034-1-a-garg7@ti.com/
v3: https://lore.kernel.org/all/20260427051725.223704-1-a-garg7@ti.com/
v2: https://lore.kernel.org/all/20260401073022.215805-1-a-garg7@ti.com/
v1 (RFC): https://lore.kernel.org/all/20260213123603.420941-1-a-garg7@ti.com/

Below is a code demonstration showing the integration of DOE-EP APIs with
EPC drivers.

Note: The provided code is just to show how an EPC driver is expected to
      utilize the pci_ep_doe_process_request() and pci_ep_doe_abort() APIs,
      and might not cover all the corner cases. The below implementation
      also expects the EPC hardware to have some memory buffer to store the
      data from(for) write_mailbox(read_mailbox) DOE capability registers.

============================================================================

/* ========== DOE Completion Callback (invoked by DOE-EP core) ========== */

static void doe_completion_cb(struct pci_epc *epc, u8 func_no, u16 cap_offset,
			       int status, u16 vendor, u8 type,
			       void *response_pl, size_t response_pl_sz)
{
	struct epc_driver *drv = epc_get_drvdata(epc);
	u32 *response = (u32 *)response_pl;
	u32 header1, header2;
	int payload_dw, i;
	
	if (readl(drv->base + PF_DOE_CTRL_REG(func_no, cap_offset)) & DOE_CTRL_ABORT) {
		/* Aborted: do not send response */
		goto free;
	}

	if (status < 0) {
		/* Error: set ERROR bit in DOE Status register */
		writel(1 << DOE_STATUS_ERROR,
		       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
		goto free;
	}

	/* Success: write DOE headers first, then response to the read memory */

	/* Header 1: Vendor ID (bits 15:0) | Type (bits 23:16) */
	header1 = (type << 16) | vendor;
	writel(header1, drv->base + PF_DOE_RD_MEMORY_WR_REG(func_no, cap_offset));

	/* Header 2: Length in DW (including 2 DW of headers + payload) */
	payload_dw = DIV_ROUND_UP(response_pl_sz, sizeof(u32));
	header2 = 2 + payload_dw;  /* 2 header DWs + payload */
	writel(header2, drv->base + PF_DOE_RD_MEMORY_WR_REG(func_no, cap_offset));
	
	/* Set READY bit to signal response ready */
	writel(1 << DOE_STATUS_READY,
	       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));

	/* Write response payload DWORDs to Read memory */
	for (i = 0; i < payload_dw; i++)
		writel(response[i],
		       drv->base + PF_DOE_RD_MEMORY_WR_REG(func_no, cap_offset));

	/* Wait for the memory to empty before clearing the READY bit */
	while (!RD_MEMORY_EMPTY()) {/* wait */}

	writel(0 << DOE_STATUS_READY,
	       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));

free:
	/* unset BUSY bit */
	writel(0 << DOE_STATUS_BUSY,
	       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));

	kfree(response_pl);
}

/* ========== DOE Interrupt Handler (triggered on GO bit from root complex) ========== */

static irqreturn_t doe_interrupt_handler(int irq, void *priv)
{
	struct epc_driver *drv = priv;
	u16 cap_offset = extract_cap_offset_from_irq(irq);
	u8 func_no = extract_func_from_irq(irq);
	u32 header1, header2, length_dw, *request;
	u16 vendor;
	u8 type;
	int i, ret;

	/* Read first header DWORD: Vendor ID (bits 15:0) | Type (bits 23:16) */
	header1 = readl(drv->base + PF_DOE_WR_MEMORY_RD_REG(func_no, cap_offset));
	vendor = header1 & 0xFFFF;
	type = (header1 >> 16) & 0xFF;

	/* Read second header DWORD: Length in DW (includes 2 DW of headers) */
	header2 = readl(drv->base + PF_DOE_WR_MEMORY_RD_REG(func_no, cap_offset));
	length_dw = header2 & 0x3FFFF;  /* Bits 17:0 */

	if (!length_dw)
		length_dw = PCI_DOE_MAX_LENGTH;

	length_dw -= 2;  /* Subtract 2 DW of headers to get payload length */
	/* Allocate buffer for complete request (headers + payload) */
	request = kzalloc(length_dw * sizeof(u32), GFP_ATOMIC);
	if (!request) {
		writel(1 << DOE_STATUS_ERROR,
		       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
		return IRQ_HANDLED;
	}

	/* Read remaining payload DWORDs from Write memory */
	for (i = 0; i < length_dw; i++) {
		while (WR_MEMORY_EMPTY()) { /* wait */ }
		request[i] = readl(drv->base + PF_DOE_WR_MEMORY_RD_REG(func_no, cap_offset));
	}
	
	mutex_lock(&lock);
	/* Check the ABORT bit, if set then return */
	if (readl(drv->base + PF_DOE_CTRL_REG(func_no, cap_offset)) & DOE_CTRL_ABORT) {
		kfree(request);
		mutex_unlock(&lock);
		return IRQ_HANDLED;
	}

	/* Set BUSY bit */
	writel(1 << DOE_STATUS_BUSY,
	       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
	mutex_unlock(&lock);

	/* Hand off to DOE-EP core for asynchronous processing */
	ret = pci_ep_doe_process_request(drv->epc, func_no, cap_offset,
					 vendor, type, (void *)request,
					 length_dw * sizeof(u32),
					 doe_completion_cb);
	if (ret) {
		writel(1 << DOE_STATUS_ERROR,
		       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
		kfree(request);
	}

	return IRQ_HANDLED;
}

/* ========== Abort Handler (triggered on ABORT bit from root complex) ========== */

static irqreturn_t doe_abort_handler(int irq, void *priv)
{
	struct epc_driver *drv = priv;
	u16 cap_offset = extract_cap_offset_from_irq(irq);
	u8 func_no = extract_func_from_irq(irq);
	
	mutex_lock(&lock);
	
	/* call abort API only if BUSY bit set (pci_ep_doe_process_request() called) */
	if (readl(drv->base + PF_DOE_STATUS_REG(func_no, cap_offset)) & DOE_STATUS_BUSY)
		pci_ep_doe_abort(drv->epc, func_no, cap_offset);
	
	mutex_unlock(&lock);

	/* Discard Write memory contents */
	writel(DOE_WR_MEMORY_CTRL_DISCARD,
	       drv->base + PF_DOE_WR_MEMORY_CTRL_REG(func_no, cap_offset));

	/* Clear status bits */
	writel((0 << DOE_STATUS_ERROR) | (0 << DOE_STATUS_READY),
	       drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));

	return IRQ_HANDLED;
}

====================================================================================

Aksh Garg (4):
  PCI/DOE: Move common definitions to the header file
  PCI: endpoint: Add DOE mailbox support for endpoint functions
  PCI: endpoint: Add support for DOE initialization and setup in EPC
    core
  Documentation: PCI: Add documentation for DOE endpoint support

 Documentation/PCI/endpoint/index.rst          |   1 +
 .../PCI/endpoint/pci-endpoint-doe.rst         | 352 +++++++++++
 drivers/pci/doe.c                             |  11 -
 drivers/pci/endpoint/Kconfig                  |  14 +
 drivers/pci/endpoint/Makefile                 |   1 +
 drivers/pci/endpoint/pci-ep-doe.c             | 591 ++++++++++++++++++
 drivers/pci/endpoint/pci-epc-core.c           | 101 +++
 drivers/pci/pci.h                             |  51 ++
 include/linux/pci-doe.h                       |   8 +
 include/linux/pci-epc.h                       |   9 +
 10 files changed, 1128 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/PCI/endpoint/pci-endpoint-doe.rst
 create mode 100644 drivers/pci/endpoint/pci-ep-doe.c

-- 
2.34.1


^ permalink raw reply

* Re: [PATCH v8 15/46] KVM: guest_memfd: Call arch invalidate hooks on conversion
From: Fuad Tabba @ 2026-06-23  8:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	willy, wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajneQVLriUshjFIO@google.com>

Hi Sean,

On Tue, 23 Jun 2026 at 02:15, Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jun 19, 2026, Fuad Tabba wrote:
> > On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
> > <devnull+ackerleytng.google.com@kernel.org> wrote:
> > >
> > > From: Ackerley Tng <ackerleytng@google.com>
> > >
> > > When memory in guest_memfd is converted from private to shared, the
> > > platform-specific state associated with the guest-private pages must be
> > > invalidated or cleaned up.
> > >
> > > Iterate over the folios in the affected range and call the
> > > kvm_arch_gmem_invalidate() hook for each PFN range. This allows
> > > architectures to perform necessary teardown, such as updating hardware
> > > metadata or encryption states, before the pages are transitioned to the
> > > shared state.
> > >
> > > Invoke this helper after indicating to KVM's mmu code that an invalidation
> > > is in progress to stop in-flight page faults from succeeding.
> > >
> > > Reviewed-by: Fuad Tabba <tabba@google.com>
> > > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> >
> > Coming back to this after working through the arm64/pKVM side. My
> > Reviewed-by here is from the previous round and the patch hasn't
> > changed, but I missed an implication for arm64.
> >
> > kvm_arch_gmem_invalidate() is now called from two paths with the same
> > (start, end) signature: folio teardown (kvm_gmem_free_folio) and
> > private->shared conversion (here). For SNP/TDX that's fine, conversion is
> > destructive anyway. For pKVM the two need opposite content semantics:
> > conversion must preserve the page in place (same physical page, the point
> > of in-place conversion without encryption), while teardown must scrub it
> > before returning it to the host.
> >
> > The hook gets only a pfn range with no indication of which caller it's
> > serving, so arm64 can't give the two paths the behaviour they need. It
> > would help to signal intent on the conversion path: a reason/flag, a
> > separate hook, or not routing non-destructive conversion through the
> > teardown hook.
> >
> > arm64 isn't here yet, so this isn't urgent, but the hook is gaining a
> > second caller now, and it's cheaper to leave room for the distinction
> > than to change a generic contract other arches depend on later.
>
> Crud.  It may not be urgent for arm64, but it's urgent for other reasons that
> I "can't" describe in detail at the moment, and even if that weren't the case, I
> think we should clean things up now.  More below.

No problem on the parts you can't get into. Agreed it's worth cleaning up
now, and worth doing in this round rather than landing the overloaded
hook: reworking a generic contract once SNP/TDX (and eventually arm64)
depend on it is the expensive path.

>
> > >  virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 41 insertions(+)
> > >
> > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > index 433f79047b9d1..3c94442bc8131 100644
> > > --- a/virt/kvm/guest_memfd.c
> > > +++ b/virt/kvm/guest_memfd.c
> > > @@ -607,6 +607,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> > >         return safe;
> > >  }
> > >
> > > +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> > > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
>
> Not your fault, but kvm_arch_gmem_invalidate() is badly misnamed.  It's not
> "invalidating" anything, it's much more of a "free" callback, as SNP uses it to
> put physical pages back into a shared state when a maybe-private folio is freed.
>
> As Fuad points out, (ab)using that hook for the private=>shared conversion case
> "works", but not broadly.  And it makes the bad name worse, because it's called
> from code that _is_ doing true invalidations.  For pKVM, it may not even need to
> do anything invalidation-like.

Agreed on the name and the overload, and for pKVM the split is more than
cosmetic. The free/teardown path is where pKVM has to scrub a page before
it goes back to the host; conversion has to leave the page in place with
its contents intact (no encryption, same physical page in both states).
Keeping scrub on the free callback and off the conversion path is what
preserves that, so this helps us, it isn't just tidying SNP.

>
> To avoid a conflict with patches that are going to have priority over this series,
> to set the stage for arm64 support, and to avoid avoid bleeding vendor details
> into guest_memfd, as if they are core guest_memfd behavior (only SNP needs the
> "invalidation" on this specific transition), I think we should add an arch hook
> to do conversions straightaway.
>
> Unless there's a clever option I'm missing, it'll mean adding yet another
> HAVE_KVM_ARCH_GMEM_XXX flag?  Hmm, especially because IIUC, arm64/pKVM doesn't
> need a callback for this case, only the free_folio case.
>
> > > +{
> > > +       struct folio_batch fbatch;
> > > +       pgoff_t next = start;
> > > +       int i;
> > > +
> > > +       folio_batch_init(&fbatch);
> > > +       while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
> > > +               for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> > > +                       struct folio *folio = fbatch.folios[i];
> > > +                       pgoff_t start_index, end_index;
> > > +                       kvm_pfn_t start_pfn, end_pfn;
> > > +
> > > +                       start_index = max(start, folio->index);
> > > +                       end_index = min(end, folio_next_index(folio));
> > > +                       /*
> > > +                        * end_index is either in folio or points to
> > > +                        * the first page of the next folio. Hence,
> > > +                        * all pages in range [start_index, end_index)
> > > +                        * are contiguous.
> > > +                        */
> > > +                       start_pfn = folio_file_pfn(folio, start_index);
> > > +                       end_pfn = start_pfn + end_index - start_index;
> > > +
> > > +                       kvm_arch_gmem_invalidate(start_pfn, end_pfn);
> > > +               }
> > > +
> > > +               folio_batch_release(&fbatch);
> > > +               cond_resched();
> > > +       }
> > > +}
> > > +#else
> > > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> > > +#endif
> > > +
> > >  static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> > >                                      size_t nr_pages, uint64_t attrs,
> > >                                      pgoff_t *err_index)
> > > @@ -647,7 +683,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> > >          */
> > >
> > >         kvm_gmem_invalidate_start(inode, start, end);
> > > +
> > > +       if (!to_private)
> > > +               kvm_gmem_invalidate(inode, start, end);
>
> E.g. instead make this something like this?
>
>         kvm_gmem_set_pfn_attributes(...)
>
> Hrm, though that wastes folio lookups in the to_private case.  So maybe just this,
> assuming pKVM doesn't need to take additional action on conversions?

You're right, and we expect it to hold for both directions, not only
private->shared. pKVM conversions are driven by the guest's
share/unshare hypercall: EL2 makes the stage-2 ownership change (grant
or remove host access) on the hypercall and exits, and the host
records it via KVM_SET_MEMORY_ATTRIBUTES2 afterwards. So by the time
guest_memfd updates attributes the EL2 side is already done in either
direction, and the ioctl is host-side bookkeeping. The only arch
callback we expect to need is the free/teardown one, nothing on
convert, and we wouldn't want a make_private hook either.

>
>         if (!to_private)
>                 kvm_gmem_make_shared(...)
>
> Actually, if we do that, then we don't need a separate arch hook, just a separate
> config.  It'll still bleed SNP details into guest_memfd, but it'll at least be
> done in a way that's more explicitly arch specific (and it's no different than
> what we already do for PREPARE...).

Doing it config-only (no separate convert hook) works for us, and nothing
about it constrains arm64. If connecting pKVM conversion to gmem later
turns up something we need, we'd add it config-gated in parallel, not by
overloading the renamed callback.

Cheers,
/fuad

>
> E.g. this?  There will still be a looming rename conflict, but that's easy enough
> to handle.
>
> diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> index 9ce5be7843f2..8aead0abd788 100644
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -648,8 +648,8 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>         return safe;
>  }
>
> -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> +#ifdef CONFIG_KVM_ARCH_GMEM_FREE_ON_SHARED_CONVERSION
> +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end)
>  {
>         struct folio_batch fbatch;
>         pgoff_t next = start;
> @@ -681,7 +681,7 @@ static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
>         }
>  }
>  #else
> -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end) { }
>  #endif
>
>  static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> @@ -729,7 +729,7 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>         kvm_gmem_invalidate_start(inode, start, end);
>
>         if (!to_private)
> -               kvm_gmem_invalidate(inode, start, end);
> +               kvm_gmem_make_shared(inode, start, end);
>
>         mas_store_prealloc(&mas, xa_mk_value(attrs));

^ permalink raw reply

* Re: Issue cloning kernel-doc-zh from HUST mirror
From: Dongliang Mu @ 2026-06-23  8:51 UTC (permalink / raw)
  To: Siwei Chen, linux-doc; +Cc: si.yanteng, wy
In-Reply-To: <4292BADB2022F3A5+5117009.JcJflTAXpt@anka-vmware20-1>


On 6/23/26 3:39 PM, Siwei Chen wrote:
> Hello,
>
> I am following the documentation at:
>
> https://docs.kernel.org/translations/zh_CN/how-to.html#id3
>
> When trying to clone the repository from the recommended mirror:
>
> git clone https://mirrors.hust.edu.cn/git/kernel-doc-zh.git linux
>
> I consistently get the following error:
>
> error: RPC failed; curl 52 Empty reply from server
> fatal: expected 'packfile'
Hello Siwei,

The long answer is as follows:

The curl 52 Empty reply from server error is not a Git or Ubuntu 
compatibility issue. It happens because the kernel-doc-zh repository is 
extremely large, and the HUST mirror server closes the HTTPS connection 
early due to timeout or proxy limits.

You can try the following commands:


      1. Shallow clone first (most reliable)



      git clone --depth 1
      https://mirrors.hust.edu.cn/git/kernel-doc-zh.git linux



      Then fetch full history:



      git fetch --unshallow

If still failing, increase Git buffer like:

git config --global http.postBuffer 1073741824



      Finally, I will contact maintainers of HUST mirror site and try
      some attempts to resolve this issue.

Dongliang Mu

>
> My environment is:
>
> Ubuntu 26.04
> git version 2.53
>
> I have verified that the URL is reachable from my network, but the clone
> operation still fails.
>
> Could anyone help me understand whether this is a mirror-side issue, a Git
> compatibility issue, or something wrong with my setup?
>
> Thank you for your time.
>
> Best regards,
> Siwei Chen
>


^ permalink raw reply

* Re: [PATCH v5 04/34] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
From: David Woodhouse @ 2026-06-23  8:50 UTC (permalink / raw)
  To: Dongli Zhang, x86, kvm, linux-doc, linux-kernel, xen-devel,
	linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	Paul Durrant, Jonathan Cameron, Marc Zyngier, Sascha Bischoff,
	Jack Allister, joe.jin, Joey Gouly
In-Reply-To: <913f6048e1193e65278cb3f4b4dbf04a151e85f5.camel@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 1867 bytes --]

On Tue, 2026-06-16 at 12:13 +0100, David Woodhouse wrote:
> On Mon, 2026-06-15 at 23:47 -0700, Dongli Zhang wrote:
> > I tested patches 02, 03, 04, and 26 by customizing QEMU to support kexec live
> > updates (LUO and KHO), preserving the memfd across kexec.
> 
> Thank you.
> 
> > For my use case, I used KVM_[GS]ET_CLOCK_GUEST instead of the existing
> > KVM_[GS]ET_CLOCK. I didn't account the downtime in my QEMU code, although host
> > TSC never resets across kexec.
> > 
> > Clock drift was zero, and I did not observe any unnecessary master clock updates
> > after KVM_SET_CLOCK_GUEST completed.
> 
> The kvmclock drift won't have been *zero*; it will have been a
> nanosecond or two. Which most people won't notice, but is annoying me.
> 
> It believe it comes from both pvclock_update_vm_gtod_copy() and
> kvm_vcpu_ioctl_set_clock_guest() rounding *down*. I think we should
> tweak the latter to round *up* so they're at least not biasing in the
> same direction.
> 
> We could also do better at picking a snapshot cycle count which
> *doesn't* lose in the rounding. But those are definitely improvements
> for another day; this series is long and complex enough and has already
> gained a dependency on fixes in core timekeeping snapshots.

I think some of that drift should be solved by the snapshot fixes from
https://lore.kernel.org/all/20260622211822.1056437-2-dwmw2@infradead.org/
and in fact we might be able to do even better...

Since ktime_get_snapshot_id() now calculates the ideal time with sub-
nanosecond precision, we would add that as a field in the snapshot and
then KVM could track ->master_clock_ns_frac and ->kvmclock_offset_frac
and eliminate the rest of the rounding error too.

It might be more trouble than it's worth, but either way I'll look at
it some other time; it can be done incrementally.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH] crypto: af_alg - Add af_alg_restrict sysctl, defaulting to 1
From: Bastien Nocera @ 2026-06-23  8:42 UTC (permalink / raw)
  To: Eric Biggers, linux-crypto, Herbert Xu
  Cc: linux-kernel, linux-doc, linux-bluetooth, iwd, linux-hardening,
	Milan Broz, Demi Marie Obenour, Andy Lutomirski, ell
In-Reply-To: <20260622234803.6982-1-ebiggers@kernel.org>

Hello Eric,

On Mon, 2026-06-22 at 16:48 -0700, Eric Biggers wrote:
> AF_ALG is a frequent source of vulnerabilities and a maintenance
> nightmare.  It exposes far more functionality to userspace than ever
> should have been exposed, especially to unprivileged processes. 
> Recent
> exploits have targeted kernel internal implementation details like
> "authencesn" that have zero use case for userspace access.

You should also CC: ell@lists.linux.dev for AF_ALG related changes, as
ell uses AF_ALG extensively for crypto and checksumming.

Cheers

> 
> Fortunately, AF_ALG is rarely used in practice, as userspace crypto
> libraries exist.  And when it is used, only some functionality is
> known
> to be used, and many users are known to hold capabilities already.
> iwd for example requires CAP_NET_ADMIN and has a known algorithm list
> (
> https://lore.kernel.org/linux-crypto/bcbbef00-5881-421b-8892-7be6c04b832d@gmail.com
> /).
> 
> Thus, let's restrict the set of allowed algorithms by default,
> depending
> on the capabilities held.
> 
> Add a sysctl /proc/sys/crypto/af_alg_restrict with meaning:
> 
>     0: unrestricted
>     1: limited functionality
>     2: completely disabled
> 
> Set the default value to 1, which enables an algorithm allowlist for
> unprivileged processes and a slightly longer allowlist for privileged
> processes.
> 
> Note that the list may be tweaked in the future.  However, the common
> use cases such as iwd and bluez are taken into account already.  I've
> tested that iwd still works with the default value of 1.
> 
> Signed-off-by: Eric Biggers <ebiggers@kernel.org>
> ---
>  Documentation/admin-guide/sysctl/crypto.rst | 36 +++++++++++
>  Documentation/crypto/userspace-if.rst       | 13 +++-
>  crypto/af_alg.c                             | 72
> +++++++++++++++++++--
>  crypto/algif_aead.c                         | 11 ++++
>  crypto/algif_hash.c                         | 24 +++++++
>  crypto/algif_rng.c                          |  9 +++
>  crypto/algif_skcipher.c                     | 20 ++++++
>  include/crypto/if_alg.h                     |  8 +++
>  8 files changed, 184 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/crypto.rst
> b/Documentation/admin-guide/sysctl/crypto.rst
> index b707bd314a64..9a1bd53287f4 100644
> --- a/Documentation/admin-guide/sysctl/crypto.rst
> +++ b/Documentation/admin-guide/sysctl/crypto.rst
> @@ -5,10 +5,46 @@
>  These files show up in ``/proc/sys/crypto/``, depending on the
>  kernel configuration:
>  
>  .. contents:: :local:
>  
> +.. _af_alg_restrict:
> +
> +af_alg_restrict
> +===============
> +
> +Controls the level of restriction of AF_ALG.
> +
> +AF_ALG is a deprecated and rarely-used userspace interface that is a
> +frequent source of vulnerabilities. It also unnecessarily exposes a
> +large number of kernel implementation details. For more information
> +about AF_ALG, see :ref:`Documentation/crypto/userspace-if.rst
> +<crypto_userspace_interface>`.
> +
> +Starting in Linux v7.3, AF_ALG supports only a limited set of
> +algorithms by default. This sysctl allows the system administrator
> to
> +remove this restriction when needed for compatibility reasons, or to
> +go further and disable AF_ALG entirely. The default value is 1.
> +
> +=== 
> ==================================================================
> +0    AF_ALG is unrestricted.
> +
> +1    AF_ALG is supported with a limited list of algorithms. The list
> +     is designed for compatibility with known users such as iwd and
> +     bluez that haven't yet been fixed to use userspace crypto code.
> +
> +     Specifically, there is an allowlist for unprivileged processes
> +     and a somewhat longer allowlist for processes that hold
> +     CAP_SYS_ADMIN or CAP_NET_ADMIN in the initial user namespace.
> +
> +     Attempts to bind() an AF_ALG socket with a disallowed algorithm
> +     fail with ENOENT.
> +
> +2    AF_ALG is completely disabled. Attempts to create an AF_ALG
> +     socket fail with EAFNOSUPPORT.
> +=== 
> ==================================================================
> +
>  fips_enabled
>  ============
>  
>  Read-only flag that indicates whether FIPS mode is enabled.
>  
> diff --git a/Documentation/crypto/userspace-if.rst
> b/Documentation/crypto/userspace-if.rst
> index ab93300c8e04..d6194346e366 100644
> --- a/Documentation/crypto/userspace-if.rst
> +++ b/Documentation/crypto/userspace-if.rst
> @@ -1,5 +1,7 @@
> +.. _crypto_userspace_interface:
> +
>  User Space Interface
>  ====================
>  
>  Introduction
>  ------------
> @@ -10,13 +12,18 @@ code.
>  
>  AF_ALG is insecure and is deprecated. Originally added to the kernel
> in 2010,
>  most kernel developers now consider it to be a mistake. Support for
> hardware
>  accelerators, which was the original purpose of AF_ALG, has been
> removed.
>  
> -AF_ALG continues to be supported only for backwards compatibility.
> On systems
> -where no programs using AF_ALG remain, the support for it should be
> disabled by
> -disabling ``CONFIG_CRYPTO_USER_API_*``.
> +AF_ALG continues to be supported only for backwards compatibility.
> +
> +Starting in Linux v7.3, the set of algorithms supported by AF_ALG is
> limited by
> +default. See :ref:`/proc/sys/crypto/af_alg_restrict
> <af_alg_restrict>`.
> +
> +On systems where no programs using AF_ALG remain, the support for it
> should be
> +disabled entirely by setting ``/proc/sys/crypto/af_alg_restrict`` to
> 2 or by
> +disabling ``CONFIG_CRYPTO_USER_API_*`` in the kernel configuration.
>  
>  Deprecation
>  -----------
>  
>  AF_ALG was originally intended to provide userspace programs access
> to crypto
> diff --git a/crypto/af_alg.c b/crypto/af_alg.c
> index cce000e8590e..34b801568fba 100644
> --- a/crypto/af_alg.c
> +++ b/crypto/af_alg.c
> @@ -6,10 +6,11 @@
>   *
>   * Copyright (c) 2010 Herbert Xu <herbert@gondor.apana.org.au>
>   */
>  
>  #include <linux/atomic.h>
> +#include <linux/capability.h>
>  #include <crypto/if_alg.h>
>  #include <linux/crypto.h>
>  #include <linux/init.h>
>  #include <linux/kernel.h>
>  #include <linux/key.h>
> @@ -20,14 +21,32 @@
>  #include <linux/rwsem.h>
>  #include <linux/sched.h>
>  #include <linux/sched/signal.h>
>  #include <linux/security.h>
>  #include <linux/string.h>
> +#include <linux/sysctl.h>
> +#include <linux/user_namespace.h>
>  #include <keys/user-type.h>
>  #include <keys/trusted-type.h>
>  #include <keys/encrypted-type.h>
>  
> +static int af_alg_restrict = 1;
> +
> +static const struct ctl_table af_alg_table[] = {
> +	{
> +		.procname       = "af_alg_restrict",
> +		.data           = &af_alg_restrict,
> +		.maxlen         = sizeof(int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_TWO,
> +	},
> +};
> +
> +static struct ctl_table_header *af_alg_header;
> +
>  struct alg_type_list {
>  	const struct af_alg_type *type;
>  	struct list_head list;
>  };
>  
> @@ -108,10 +127,43 @@ int af_alg_unregister_type(const struct
> af_alg_type *type)
>  
>  	return err;
>  }
>  EXPORT_SYMBOL_GPL(af_alg_unregister_type);
>  
> +static bool af_alg_capable(void)
> +{
> +	return ns_capable_noaudit(&init_user_ns, CAP_NET_ADMIN) ||
> +	       capable(CAP_SYS_ADMIN);
> +}
> +
> +int af_alg_check_restriction(const char *name,
> +			     const struct af_alg_allowlist_entry
> allowlist[])
> +{
> +	int level = READ_ONCE(af_alg_restrict);
> +
> +	if (level == 0)
> +		return 0;
> +	if (level == 1) {
> +		for (const struct af_alg_allowlist_entry *ent =
> allowlist;
> +		     ent->name; ent++) {
> +			if (strcmp(name, ent->name) == 0 &&
> +			    (!ent->privileged || af_alg_capable()))
> +				return 0;
> +		}
> +	}
> +	/*
> +	 * Use -ENOENT (the error code for "algorithm not found")
> instead of
> +	 * -EACCES or -EPERM, for the highest chance of correctly
> triggering
> +	 * fallback code paths in userspace programs.
> +	 *
> +	 * Don't log a warning, since it would be noisy.  iwd tries
> to bind a
> +	 * bunch of algorithms that it never uses.
> +	 */
> +	return -ENOENT;
> +}
> +EXPORT_SYMBOL_GPL(af_alg_check_restriction);
> +
>  static void alg_do_release(const struct af_alg_type *type, void
> *private)
>  {
>  	if (!type)
>  		return;
>  
> @@ -504,10 +556,13 @@ static int alg_create(struct net *net, struct
> socket *sock, int protocol,
>  		      int kern)
>  {
>  	struct sock *sk;
>  	int err;
>  
> +	if (READ_ONCE(af_alg_restrict) == 2)
> +		return -EAFNOSUPPORT;
> +
>  	if (sock->type != SOCK_SEQPACKET)
>  		return -ESOCKTNOSUPPORT;
>  	if (protocol != 0)
>  		return -EPROTONOSUPPORT;
>  
> @@ -1220,31 +1275,36 @@ int af_alg_get_rsgl(struct sock *sk, struct
> msghdr *msg, int flags,
>  }
>  EXPORT_SYMBOL_GPL(af_alg_get_rsgl);
>  
>  static int __init af_alg_init(void)
>  {
> -	int err = proto_register(&alg_proto, 0);
> +	int err;
> +
> +	af_alg_header = register_sysctl("crypto", af_alg_table);
>  
> +	err = proto_register(&alg_proto, 0);
>  	if (err)
> -		goto out;
> +		goto out_unregister_sysctl;
>  
>  	err = sock_register(&alg_family);
> -	if (err != 0)
> +	if (err)
>  		goto out_unregister_proto;
>  
> -out:
> -	return err;
> +	return 0;
>  
>  out_unregister_proto:
>  	proto_unregister(&alg_proto);
> -	goto out;
> +out_unregister_sysctl:
> +	unregister_sysctl_table(af_alg_header);
> +	return err;
>  }
>  
>  static void __exit af_alg_exit(void)
>  {
>  	sock_unregister(PF_ALG);
>  	proto_unregister(&alg_proto);
> +	unregister_sysctl_table(af_alg_header);
>  }
>  
>  module_init(af_alg_init);
>  module_exit(af_alg_exit);
>  MODULE_DESCRIPTION("Crypto userspace interface");
> diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
> index 787aac8aeb24..b9217f9086aa 100644
> --- a/crypto/algif_aead.c
> +++ b/crypto/algif_aead.c
> @@ -32,10 +32,15 @@
>  #include <linux/mm.h>
>  #include <linux/module.h>
>  #include <linux/net.h>
>  #include <net/sock.h>
>  
> +static const struct af_alg_allowlist_entry aead_allowlist[] = {
> +	{ "ccm(aes)", true }, /* bluez */
> +	{},
> +};
> +
>  static inline bool aead_sufficient_data(struct sock *sk)
>  {
>  	struct alg_sock *ask = alg_sk(sk);
>  	struct sock *psk = ask->parent;
>  	struct alg_sock *pask = alg_sk(psk);
> @@ -342,10 +347,16 @@ static struct proto_ops algif_aead_ops_nokey =
> {
>  	.poll		=	af_alg_poll,
>  };
>  
>  static void *aead_bind(const char *name)
>  {
> +	int err;
> +
> +	err = af_alg_check_restriction(name, aead_allowlist);
> +	if (err)
> +		return ERR_PTR(err);
> +
>  	return crypto_alloc_aead(name, 0, AF_ALG_CRYPTOAPI_MASK);
>  }
>  
>  static void aead_release(void *private)
>  {
> diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
> index 5452ad6c1506..a8d958d51ece 100644
> --- a/crypto/algif_hash.c
> +++ b/crypto/algif_hash.c
> @@ -14,10 +14,28 @@
>  #include <linux/mm.h>
>  #include <linux/module.h>
>  #include <linux/net.h>
>  #include <net/sock.h>
>  
> +static const struct af_alg_allowlist_entry hash_allowlist[] = {
> +	{ "cmac(aes)", true }, /* iwd, bluez */
> +	{ "hmac(md5)", true }, /* iwd */
> +	{ "hmac(sha1)", true }, /* iwd */
> +	{ "hmac(sha224)", true }, /* iwd */
> +	{ "hmac(sha256)", true }, /* iwd */
> +	{ "hmac(sha384)", true }, /* iwd */
> +	{ "hmac(sha512)", true }, /* iwd, sha512hmac */
> +	{ "md4", true }, /* iwd */
> +	{ "md5", true }, /* iwd */
> +	{ "sha1", false }, /* iwd, iproute2 < 7.0 */
> +	{ "sha224", true }, /* iwd */
> +	{ "sha256", true }, /* iwd */
> +	{ "sha384", true }, /* iwd */
> +	{ "sha512", true }, /* iwd */
> +	{},
> +};
> +
>  struct hash_ctx {
>  	struct af_alg_sgl sgl;
>  
>  	u8 *result;
>  
> @@ -380,10 +398,16 @@ static struct proto_ops algif_hash_ops_nokey =
> {
>  	.accept		=	hash_accept_nokey,
>  };
>  
>  static void *hash_bind(const char *name)
>  {
> +	int err;
> +
> +	err = af_alg_check_restriction(name, hash_allowlist);
> +	if (err)
> +		return ERR_PTR(err);
> +
>  	return crypto_alloc_ahash(name, 0, AF_ALG_CRYPTOAPI_MASK);
>  }
>  
>  static void hash_release(void *private)
>  {
> diff --git a/crypto/algif_rng.c b/crypto/algif_rng.c
> index 4dfe7899f8fa..bd522915d56d 100644
> --- a/crypto/algif_rng.c
> +++ b/crypto/algif_rng.c
> @@ -48,10 +48,14 @@
>  
>  MODULE_LICENSE("GPL");
>  MODULE_AUTHOR("Stephan Mueller <smueller@chronox.de>");
>  MODULE_DESCRIPTION("User-space interface for random number
> generators");
>  
> +static const struct af_alg_allowlist_entry rng_allowlist[] = {
> +	{},
> +};
> +
>  struct rng_ctx {
>  #define MAXSIZE 128
>  	unsigned int len;
>  	struct crypto_rng *drng;
>  	u8 *addtl;
> @@ -199,10 +203,15 @@ static struct proto_ops __maybe_unused
> algif_rng_test_ops = {
>  
>  static void *rng_bind(const char *name)
>  {
>  	struct rng_parent_ctx *pctx;
>  	struct crypto_rng *rng;
> +	int err;
> +
> +	err = af_alg_check_restriction(name, rng_allowlist);
> +	if (err)
> +		return ERR_PTR(err);
>  
>  	pctx = kzalloc_obj(*pctx);
>  	if (!pctx)
>  		return ERR_PTR(-ENOMEM);
>  
> diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
> index df20bdfe1f1f..2b8069667974 100644
> --- a/crypto/algif_skcipher.c
> +++ b/crypto/algif_skcipher.c
> @@ -32,10 +32,24 @@
>  #include <linux/mm.h>
>  #include <linux/module.h>
>  #include <linux/net.h>
>  #include <net/sock.h>
>  
> +static const struct af_alg_allowlist_entry skcipher_allowlist[] = {
> +	{ "adiantum(xchacha12,aes)", false }, /* cryptsetup */
> +	{ "adiantum(xchacha20,aes)", false }, /* cryptsetup */
> +	{ "cbc(aes)", true }, /* iwd */
> +	{ "cbc(des)", true }, /* iwd */
> +	{ "cbc(des3_ede)", true }, /* iwd */
> +	{ "ctr(aes)", true }, /* iwd */
> +	{ "ecb(aes)", true }, /* iwd, bluez */
> +	{ "ecb(des)", true }, /* iwd */
> +	{ "hctr2(aes)", false }, /* cryptsetup */
> +	{ "xts(aes)", false }, /* cryptsetup benchmark */
> +	{},
> +};
> +
>  static int skcipher_sendmsg(struct socket *sock, struct msghdr *msg,
>  			    size_t size)
>  {
>  	struct sock *sk = sock->sk;
>  	struct alg_sock *ask = alg_sk(sk);
> @@ -307,10 +321,16 @@ static struct proto_ops
> algif_skcipher_ops_nokey = {
>  	.poll		=	af_alg_poll,
>  };
>  
>  static void *skcipher_bind(const char *name)
>  {
> +	int err;
> +
> +	err = af_alg_check_restriction(name, skcipher_allowlist);
> +	if (err)
> +		return ERR_PTR(err);
> +
>  	return crypto_alloc_skcipher(name, 0,
> AF_ALG_CRYPTOAPI_MASK);
>  }
>  
>  static void skcipher_release(void *private)
>  {
> diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
> index 7643ba954125..4e9ed8e73403 100644
> --- a/include/crypto/if_alg.h
> +++ b/include/crypto/if_alg.h
> @@ -159,13 +159,21 @@ struct af_alg_ctx {
>  	unsigned int len;
>  
>  	unsigned int inflight;
>  };
>  
> +struct af_alg_allowlist_entry {
> +	const char *name;
> +	bool privileged;
> +};
> +
>  int af_alg_register_type(const struct af_alg_type *type);
>  int af_alg_unregister_type(const struct af_alg_type *type);
>  
> +int af_alg_check_restriction(const char *name,
> +			     const struct af_alg_allowlist_entry
> allowlist[]);
> +
>  int af_alg_release(struct socket *sock);
>  void af_alg_release_parent(struct sock *sk);
>  int af_alg_accept(struct sock *sk, struct socket *newsock,
>  		  struct proto_accept_arg *arg);
>  
> 
> base-commit: 1dc18801be29bc54709aa355b8acd80e183b03cd

^ permalink raw reply

* Re: Issue cloning kernel-doc-zh from HUST mirror
From: Dongliang Mu @ 2026-06-23  8:25 UTC (permalink / raw)
  To: Siwei Chen, linux-doc; +Cc: si.yanteng, wy
In-Reply-To: <4292BADB2022F3A5+5117009.JcJflTAXpt@anka-vmware20-1>


On 6/23/26 3:39 PM, Siwei Chen wrote:
> Hello,
>
> I am following the documentation at:
>
> https://docs.kernel.org/translations/zh_CN/how-to.html#id3
>
> When trying to clone the repository from the recommended mirror:
>
> git clone https://mirrors.hust.edu.cn/git/kernel-doc-zh.git linux
>
> I consistently get the following error:
>
> error: RPC failed; curl 52 Empty reply from server
> fatal: expected 'packfile'
>
> My environment is:
>
> Ubuntu 26.04
> git version 2.53
>
> I have verified that the URL is reachable from my network, but the clone
> operation still fails.
>
> Could anyone help me understand whether this is a mirror-side issue, a Git
> compatibility issue, or something wrong with my setup?

I confirmed this is a mirror-side issue. Please use the first git repo:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/alexs/linux.git

Dongliang Mu

>
> Thank you for your time.
>
> Best regards,
> Siwei Chen
>


^ permalink raw reply

* Re: [PATCH v8 13/46] KVM: guest_memfd: Add base support for KVM_SET_MEMORY_ATTRIBUTES2
From: Fuad Tabba @ 2026-06-23  8:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	willy, wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajnRxuJ19OzZ8zJC@google.com>

On Tue, 23 Jun 2026 at 01:22, Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jun 19, 2026, Fuad Tabba wrote:
> > On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
> > <devnull+ackerleytng.google.com@kernel.org> wrote:
> > >
> > > From: Ackerley Tng <ackerleytng@google.com>
> > >
> > > Introduce base support for KVM_SET_MEMORY_ATTRIBUTES2 in guest_memfd, which
> > > just updates attributes tracked by guest_memfd.
> > >
> > > Validate input fields in general. Guard usage of KVM_SET_MEMORY_ATTRIBUTES2
> > > by making sure requested attributes are supported for this instance of kvm.
> > >
> > > A new KVM_SET_MEMORY_ATTRIBUTES2 is defined to support writes (unlike
> > > KVM_SET_MEMORY_ATTRIBUTES) in addition to reads so it can provide error
> > > details to userspace. This will be used in a later patch.
> > >
> > > The two ioctls use their corresponding structs with no overlap, but
> > > backward compatibility is baked in for future support of
> > > KVM_SET_MEMORY_ATTRIBUTES2 and struct kvm_memory_attributes2 in the VM
> > > ioctl.
> > >
> > > The process of setting memory attributes is set up such that the later half
> > > will not fail due to allocation. Any necessary checks are performed before
> > > the point of no return.
> > >
> > > Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> > > Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> > > Co-developed-by: Sean Christoperson <seanjc@google.com>
> > > Signed-off-by: Sean Christoperson <seanjc@google.com>
> > > Reviewed-by: Fuad Tabba <tabba@google.com>
> > > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> >
> > Note sure if it's user error on my part, if I'm applying this to the
> > wrong base, but I found a build break here on patch 13:
> > kvm_gmem_invalidate_start() doesn't exist in the base tree. The
> > function is kvm_gmem_invalidate_begin() here. The rename
> > (190cc5370a8b6) landed via a different merge path and isn't an
> > ancestor of the stated base.
> >
> > Patches 19 and 20 have the same mismatch. Fix for all three is
> > s/kvm_gmem_invalidate_start/kvm_gmem_invalidate_begin/.
>
> Ya, Ackerley used a slightly older kvm/next to send the patches.  I at least was
> testing against kvm-x86/next, which does have the rename.
>
> Other than noting that this should be applied against the current kvm/next, I
> don't think there's anything else to be done?

Agree. Sorry, didn't mean to be nit-picky, but this really threw me off :)

Cheers,
/fuad

^ permalink raw reply

* Re: [PATCH] docs/mm: clarify that we are not looking for LLM generated content
From: David Hildenbrand (Arm) @ 2026-06-23  8:15 UTC (permalink / raw)
  To: linux-doc
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jonathan Corbet,
	Shuah Khan, Matthew Wilcox, Harry Yoo, linux-mm, linux-kernel
In-Reply-To: <20260420-llmdoc-v1-1-47d2091177c4@kernel.org>

On 4/20/26 23:03, David Hildenbrand (Arm) wrote:
> Let's make it clear that we are not looking for LLM generated content
> from contributors not familiar with the details of MM, as it shifts the
> real work onto reviewers.
> 
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> ---
>  Documentation/mm/index.rst | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> index 7aa2a8886908..13a79f5d092c 100644
> --- a/Documentation/mm/index.rst
> +++ b/Documentation/mm/index.rst
> @@ -7,6 +7,19 @@ of Linux.  If you are looking for advice on simply allocating memory,
>  see the :ref:`memory_allocation`.  For controlling and tuning guides,
>  see the :doc:`admin guide <../admin-guide/mm/index>`.
>  
> +.. note::
> +
> +  Unfortunately, parts of this guide are still incomplete or missing.
> +  While we appreciate contributions, documentation in this area is hard
> +  to get right and requires a lot of attention to detail.  New contributors
> +  should reach out to the relevant maintainers early.
> +
> +  This guide is expected to reflect reality, which requires contributors
> +  to have a detailed understanding.  Documentation generated with LLMs
> +  by contributors unfamiliar with these details shifts the real work onto
> +  reviewers, which is why such contributions will be rejected without
> +  further comment.
> +
>  .. toctree::
>     :maxdepth: 1
>  
> 
> ---
> base-commit: da6b5aae84beb0917ecb0c9fbc71169d145397ff
> change-id: 20260420-llmdoc-21bf5fadbd6f
> 
> Best regards,

I assume this was not picked up yet? (via documentation or mm tree?)

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v3 1/2] dt-bindings: iio: dac: Add AD5529R
From: Rodrigo Alencar @ 2026-06-23  8:09 UTC (permalink / raw)
  To: Nuno Sá, Rodrigo Alencar
  Cc: Jonathan Cameron, Conor Dooley, Janani Sunil, Janani Sunil,
	Lars-Peter Clausen, Michael Hennerich, David Lechner,
	Nuno Sá, Andy Shevchenko, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Philipp Zabel, Jonathan Corbet, Shuah Khan,
	linux-iio, devicetree, linux-kernel, linux-doc, Mark Brown
In-Reply-To: <ajklksIDLsj0BZul@nsa>

On 22/06/26 13:20, Nuno Sá wrote:
> On Mon, Jun 22, 2026 at 12:51:20PM +0100, Rodrigo Alencar wrote:
> > On 22/06/26 11:29, Nuno Sá wrote:
> > > On Mon, Jun 22, 2026 at 10:24:05AM +0100, Rodrigo Alencar wrote:
> > > > On 21/06/26 15:33, Jonathan Cameron wrote:
> > > > > On Fri, 19 Jun 2026 16:54:11 +0100
> > > > > Nuno Sá <noname.nuno@gmail.com> wrote:
> > > > > 
> > > > > > On Fri, Jun 19, 2026 at 03:12:07PM +0100, Conor Dooley wrote:
> > > > > > > On Fri, Jun 19, 2026 at 02:01:08PM +0100, Nuno Sá wrote:  
> > > > > > > > On Fri, Jun 19, 2026 at 12:40:54PM +0100, Conor Dooley wrote:  
> > > > > > > > > On Fri, Jun 19, 2026 at 12:36:55PM +0100, Conor Dooley wrote:  
> > > > > > > > > > On Fri, Jun 19, 2026 at 12:33:11PM +0200, Janani Sunil wrote:  
> > > > > > > > > > > 
> > > > > > > > > > > On 6/14/26 21:44, Jonathan Cameron wrote:  
> > > > > > > > > > > > On Tue, 9 Jun 2026 16:47:23 +0200
> > > > > > > > > > > > Janani Sunil <jan.sun97@gmail.com> wrote:
> > > > > > > > > > > >   
> > > > > > > > > > > > > On 5/26/26 15:11, Rodrigo Alencar wrote:  
> > > > > > > > > > > > > > On 26/05/19 05:42PM, Janani Sunil wrote:  
> > > > > > > > > > > > > > > Devicetree bindings for AD5529R 16 channel 12/16 bit high voltage,
> > > > > > > > > > > > > > > buffered voltage output digital-to-analog converter (DAC) with an
> > > > > > > > > > > > > > > integrated precision reference.  
> > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > Probably others may comment on that, but...
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > This parent node may support device addressing for multi-device support through
> > > > > > > > > > > > > > those ID pins. I suppose that each device may have its own power supplies or
> > > > > > > > > > > > > > other resources like the toggle pins or reset and enable.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > That way I suppose that an example would look like...  
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +patternProperties:
> > > > > > > > > > > > > > > +  "^channel@([0-9]|1[0-5])$":
> > > > > > > > > > > > > > > +    type: object
> > > > > > > > > > > > > > > +    description: Child nodes for individual channel configuration
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +    properties:
> > > > > > > > > > > > > > > +      reg:
> > > > > > > > > > > > > > > +        description: Channel number.
> > > > > > > > > > > > > > > +        minimum: 0
> > > > > > > > > > > > > > > +        maximum: 15
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +      adi,output-range-microvolt:
> > > > > > > > > > > > > > > +        description: |
> > > > > > > > > > > > > > > +          Output voltage range for this channel as [min, max] in microvolts.
> > > > > > > > > > > > > > > +          If not specified, defaults to 0V to 5V range.
> > > > > > > > > > > > > > > +        oneOf:
> > > > > > > > > > > > > > > +          - items:
> > > > > > > > > > > > > > > +              - const: 0
> > > > > > > > > > > > > > > +              - enum: [5000000, 10000000, 20000000, 40000000]
> > > > > > > > > > > > > > > +          - items:
> > > > > > > > > > > > > > > +              - const: -5000000
> > > > > > > > > > > > > > > +              - const: 5000000
> > > > > > > > > > > > > > > +          - items:
> > > > > > > > > > > > > > > +              - const: -10000000
> > > > > > > > > > > > > > > +              - const: 10000000
> > > > > > > > > > > > > > > +          - items:
> > > > > > > > > > > > > > > +              - const: -15000000
> > > > > > > > > > > > > > > +              - const: 15000000
> > > > > > > > > > > > > > > +          - items:
> > > > > > > > > > > > > > > +              - const: -20000000
> > > > > > > > > > > > > > > +              - const: 20000000
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +    required:
> > > > > > > > > > > > > > > +      - reg
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +    additionalProperties: false
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +required:
> > > > > > > > > > > > > > > +  - compatible
> > > > > > > > > > > > > > > +  - reg
> > > > > > > > > > > > > > > +  - vdd-supply
> > > > > > > > > > > > > > > +  - avdd-supply
> > > > > > > > > > > > > > > +  - hvdd-supply
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +dependencies:
> > > > > > > > > > > > > > > +  spi-cpha: [ spi-cpol ]
> > > > > > > > > > > > > > > +  spi-cpol: [ spi-cpha ]
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +allOf:
> > > > > > > > > > > > > > > +  - $ref: /schemas/spi/spi-peripheral-props.yaml#
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +unevaluatedProperties: false
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +examples:
> > > > > > > > > > > > > > > +  - |
> > > > > > > > > > > > > > > +    #include <dt-bindings/gpio/gpio.h>
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +    spi {
> > > > > > > > > > > > > > > +        #address-cells = <1>;
> > > > > > > > > > > > > > > +        #size-cells = <0>;
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +        dac@0 {
> > > > > > > > > > > > > > > +            compatible = "adi,ad5529r-16";
> > > > > > > > > > > > > > > +            reg = <0>;
> > > > > > > > > > > > > > > +            spi-max-frequency = <25000000>;
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +            vdd-supply = <&vdd_regulator>;
> > > > > > > > > > > > > > > +            avdd-supply = <&avdd_regulator>;
> > > > > > > > > > > > > > > +            hvdd-supply = <&hvdd_regulator>;
> > > > > > > > > > > > > > > +            hvss-supply = <&hvss_regulator>;
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +            reset-gpios = <&gpio0 87 GPIO_ACTIVE_LOW>;
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +            #address-cells = <1>;
> > > > > > > > > > > > > > > +            #size-cells = <0>;
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +            channel@0 {
> > > > > > > > > > > > > > > +                reg = <0>;
> > > > > > > > > > > > > > > +                adi,output-range-microvolt = <0 5000000>;
> > > > > > > > > > > > > > > +            };
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +            channel@1 {
> > > > > > > > > > > > > > > +                reg = <1>;
> > > > > > > > > > > > > > > +                adi,output-range-microvolt = <(-10000000) 10000000>;
> > > > > > > > > > > > > > > +            };
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +            channel@2 {
> > > > > > > > > > > > > > > +                reg = <2>;
> > > > > > > > > > > > > > > +                adi,output-range-microvolt = <0 40000000>;
> > > > > > > > > > > > > > > +            };
> > > > > > > > > > > > > > > +        };
> > > > > > > > > > > > > > > +    };  
> > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 	spi {
> > > > > > > > > > > > > > 		#address-cells = <1>;
> > > > > > > > > > > > > > 		#size-cells = <0>;
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 		multi-dac@0 {
> > > > > > > > > > > > > > 			compatible = "adi,ad5529r-16";
> > > > > > > > > > > > > > 			reg = <0>;
> > > > > > > > > > > > > > 			spi-max-frequency = <25000000>;
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 			#address-cells = <1>;
> > > > > > > > > > > > > > 			#size-cells = <0>;
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 			dac@0 {
> > > > > > > > > > > > > > 				reg = <0>;
> > > > > > > > > > > > > > 				vdd-supply = <&vdd_regulator>;
> > > > > > > > > > > > > > 				avdd-supply = <&avdd_regulator>;
> > > > > > > > > > > > > > 				hvdd-supply = <&hvdd_regulator>;
> > > > > > > > > > > > > > 				hvss-supply = <&hvss_regulator>;
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 				reset-gpios = <&gpio0 87 GPIO_ACTIVE_LOW>;
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 				#address-cells = <1>;
> > > > > > > > > > > > > > 				#size-cells = <0>;
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 				channel@0 {
> > > > > > > > > > > > > > 					reg = <0>;
> > > > > > > > > > > > > > 					adi,output-range-microvolt = <0 5000000>;
> > > > > > > > > > > > > > 				};
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 				channel@1 {
> > > > > > > > > > > > > > 					reg = <1>;
> > > > > > > > > > > > > > 					adi,output-range-microvolt = <(-10000000) 10000000>;
> > > > > > > > > > > > > > 				};
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 				channel@2 {
> > > > > > > > > > > > > > 					reg = <2>;
> > > > > > > > > > > > > > 					adi,output-range-microvolt = <0 40000000>;
> > > > > > > > > > > > > > 				};
> > > > > > > > > > > > > > 			}
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 			dac@1 {
> > > > > > > > > > > > > > 				reg = <1>;
> > > > > > > > > > > > > > 				vdd-supply = <&vdd_regulator>;
> > > > > > > > > > > > > > 				avdd-supply = <&avdd_regulator>;
> > > > > > > > > > > > > > 				hvdd-supply = <&hvdd_regulator>;
> > > > > > > > > > > > > > 				hvss-supply = <&hvss_regulator>;
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 				reset-gpios = <&gpio0 88 GPIO_ACTIVE_LOW>;
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 				#address-cells = <1>;
> > > > > > > > > > > > > > 				#size-cells = <0>;
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 				channel@0 {
> > > > > > > > > > > > > > 					reg = <0>;
> > > > > > > > > > > > > > 					adi,output-range-microvolt = <0 5000000>;
> > > > > > > > > > > > > > 				};
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 				channel@1 {
> > > > > > > > > > > > > > 					reg = <1>;
> > > > > > > > > > > > > > 					adi,output-range-microvolt = <(-10000000) 10000000>;
> > > > > > > > > > > > > > 				};
> > > > > > > > > > > > > > 			}
> > > > > > > > > > > > > > 		};
> > > > > > > > > > > > > > 	};
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > then you might need something like:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 	patternProperties:
> > > > > > > > > > > > > > 		"^dac@[0-3]$":
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > and put most of the things under this node pattern.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > So the main driver that you're putting together might need to handle up to four instances.
> > > > > > > > > > > > > > Even if your current driver cannot handle this, the dt-bindings might need cover that.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Need to double check if each dac node needs a separate compatible, so you would maybe populate
> > > > > > > > > > > > > > a platform data to be shared with the child nodes, which would be a separate driver.
> > > > > > > > > > > > > > (not sure if it would make sense to mix and match ad5529r-16 and ad5529r-12).  
> > > > > > > > > > > > > Hi Rodrigo,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thank you for looking at this.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > For now, I would prefer to keep the binding scoped to a single AD5529R device instance. The current
> > > > > > > > > > > > > hardware/use case we have only needs one device node and the driver is written around that model as well.
> > > > > > > > > > > > > While the device addressing pins could allow multi-device topology, we do not have an actual platform using
> > > > > > > > > > > > > that configuration at the moment, so I would prefer not to introduce an extra parent/child binding structure
> > > > > > > > > > > > > speculatively without a validating use case.  
> > > > > > > > > > > > Interesting feature - kind of similar to address control on a typical i2c bus device, or
> > > > > > > > > > > > looking at it another way a kind of distributed SPI mux.
> > > > > > > > > > > > 
> > > > > > > > > > > > Challenge of a binding is we need to anticipate the future.  So I think we do need something
> > > > > > > > > > > > like Rodrigo is suggesting even if we only (for now) support a single instance in the driver.
> > > > > > > > > > > > That would leave the path open to supporting the addressing at a later date.
> > > > > > > > > > > > An alternative might be to look at it like a chained device setup. In those we pretend there
> > > > > > > > > > > > is just one device with a lot of channels etc.  The snag is that here things are more loosely
> > > > > > > > > > > > coupled whereas for those devices it tends to be you have to read / write the same register
> > > > > > > > > > > > in all devices in the chain as one big SPI message.
> > > > > > > > > > > > 
> > > > > > > > > > > > +CC Mark Brown as he may know of some precedence for this feature. For his reference..
> > > > > > > > > > > > - Each of these device has 2 ID pins.  The SPI transfers have to contain the 2 bit
> > > > > > > > > > > > value that matches that or they are ignored.  Thus a single bus + 1 chip select can
> > > > > > > > > > > > be used to talk to 4 devices.  Question is what that looks like in device tree + I guess
> > > > > > > > > > > > longer term how to support it cleanly in SPI.  
> > > > > > > > > > 
> > > > > > > > > > I'd swear I have seen this before, from some Microchip devices. Let me
> > > > > > > > > > see if I can find what I am thinking of...  
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > microchip,mcp3911 and microchip,mcp3564 both seem to do this with
> > > > > > > > > slightly different properties.
> > > > > > > > > 
> > > > > > > > >   microchip,device-addr:
> > > > > > > > >     description: Device address when multiple MCP3911 chips are present on the same SPI bus.
> > > > > > > > >     $ref: /schemas/types.yaml#/definitions/uint32
> > > > > > > > >     enum: [0, 1, 2, 3]
> > > > > > > > >     default: 0
> > > > > > > > > 
> > > > > > > > > and
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > >   microchip,hw-device-address:
> > > > > > > > >     $ref: /schemas/types.yaml#/definitions/uint32
> > > > > > > > >     minimum: 0
> > > > > > > > >     maximum: 3
> > > > > > > > >     description:
> > > > > > > > >       The address is set on a per-device basis by fuses in the factory,
> > > > > > > > >       configured on request. If not requested, the fuses are set for 0x1.
> > > > > > > > >       The device address is part of the device markings to avoid
> > > > > > > > >       potential confusion. This address is coded on two bits, so four possible
> > > > > > > > >       addresses are available when multiple devices are present on the same
> > > > > > > > >       SPI bus with only one Chip Select line for all devices.
> > > > > > > > >       Each device communication starts by a CS falling edge, followed by the
> > > > > > > > >       clocking of the device address (BITS[7:6] - top two bits of COMMAND BYTE
> > > > > > > > >       which is first one on the wire).
> > > > > > > > > 
> > > > > > > > > This sounds exactly like the sort of feature that you're dealing with
> > > > > > > > > here?
> > > > > > > > >   
> > > > > > > > 
> > > > > > > > The core idea yes but for this chip, things are a bit more annoying (but
> > > > > > > > Janani can correct me if I'm wrong). Here, each device can, in theory,
> > > > > > > > have it's own supplies, pins and at the very least, channels with maybe
> > > > > > > > different scales. That is why Janani is proposing dac nodes. Given I
> > > > > > > > honestly don't like much of that "adi,ad5529r-bus" compatible I wondered
> > > > > > > > about solving this at the spi level.
> > > > > > > > 
> > > > > > > > Ah and to make it more annoying, we can also mix 12 and 16 bits variants
> > > > > > > > together in the same bus.  
> > > > > > > 
> > > > > > > I'm definitely missing something, because that property for the
> > > > > > > microchip devices is not impacted what else is on the bus. AFAICT, you
> > > > > > > could have an mcp3911 and an mcp3564 on the same bus even though both
> > > > > > > are completely different devices with different drivers. They have
> > > > > > > individual device nodes and their own supplies etc etc. These aren't
> > > > > > > per-channel properties on an adc or dac, they're per child device on a
> > > > > > > spi bus.  
> > > > > > 
> > > > > > Maybe I'm the one missing something :). IIRC, spi would not allow two
> > > > > > devices on the same CS right? Because for this chip we would need
> > > > > > something like:
> > > > > > 
> > > > > > spi {
> > > > > > 	dac@0 {
> > > > > > 		reg = <0>;
> > > > > > 		adi,pin-id = <0>;
> > > > > > 	};
> > > > > > 
> > > > > > 	dac@1 {
> > > > > > 		reg = <0>; // which seems already problematic?
> > > > > > 		adi,pin-id <1>;
> > > > > > 	};
> > > > > > 
> > > > > > 	...
> > > > > > 
> > > > > > 	//up to 4
> > > > > > };
> > > > > Yeah. It's not clear to me how that works for the microchip devices
> > > > > (I suspect it doesn't!)
> > > > > 
> > > > > Just thinking as I type, but could we do something a bit nasty with
> > > > > a gpio mux that doesn't actually switch but represents the GPIO being
> > > > > shared?  Given this is all tied to the spi bus that should all happen
> > > > > under serializing locks. 
> > > > > 
> > > > > Agreed though that this would be nicer as an SPI thing that let
> > > > > us specify that a single CS is share by multiple devices and their
> > > > > is some other signal acting to select which one we are talking to.
> > > > > 
> > > > 
> > > > If the device-addressing on the same chip-select is to be handled
> > > > by the spi framework, wouldn't we lose device-specific features?
> > > > 
> > > > I understand that this multi-device feature is there mostly to extend the
> > > > channel count from 16 to 32, 48 or 64. I suppose the command:
> > > > 
> > > > 	"MULTI DEVICE SW LDAC MODE"
> > > > 
> > > > exists so that software can update channel values accross multiple devices.
> > > 
> > > Right! You do have a point! I agree the main driver for a feature like
> > > this is likely to extend the channel count and effectively "aggregate"
> > > devices.
> > > 
> > > But I would say that even with the spi solution the MULTI DEVICE stuff
> > > should be doable (as we still need a sort of adi,pin-id property). 
> > 
> > I don't think we can have something like an IIO buffer shared by multiple
> > devices. Synchronizing separate devices would be doable with proper hardware
> > support for this (probably involving an FGPA).
> 
> True!
> 
> >  
> > > But yes, I do feel that the whole feature is for aggregation so seeing
> > > one device with 32 channels is the expectation here? Rather than seeing
> > > two devices with 16 channels.
> > 
> > Yes, I think aggregation is the whole point there... so that the IIO driver
> > is multi-device-aware.
> 
> Which makes me feel that different pins per device might be possible
> from an HW point of view but does not make much sense. For example, for
> the buffer example I would expect LDAC to be shared between all the
> devices.

That is why I would still suggest the multi-dac node in the middle...
the parent node can hold shared resources, while the dac children can
have their own, overriding or inheriting stuff.

-- 
Kind regards,

Rodrigo Alencar

^ permalink raw reply

* [PATCH v5 2/2] cpufreq: CPPC: add autonomous mode boot parameter support
From: Sumit Gupta @ 2026-06-23  8:06 UTC (permalink / raw)
  To: rafael, viresh.kumar, pierre.gondois, ionela.voinescu,
	zhenglifeng1, zhanjie9, corbet, skhan, rdunlap, mario.limonciello,
	linux-kernel, linux-pm, linux-doc, linux-tegra
  Cc: treding, jonathanh, vsethi, ksitaraman, sanjayc, mochs, bbasu,
	sumitg
In-Reply-To: <20260623080652.3353386-1-sumitg@nvidia.com>

Add a kernel boot parameter 'cppc_cpufreq.auto_sel_mode' to enable
CPPC autonomous performance selection on all CPUs at system startup.
When autonomous mode is enabled, the hardware automatically adjusts
CPU performance based on workload demands using Energy Performance
Preference (EPP) hints.

When the parameter is set:
- Configure all CPUs for autonomous operation on first init
- Use HW min/max_perf when available; otherwise initialize from caps
- Initialize desired_perf to max_perf as a starting hint
- Hardware controls frequency instead of the OS governor
- EPP behavior depends on parameter value:
  - performance (or 1):         override EPP to performance (0x0)
  - balance_performance (or 2): override EPP to balance_performance
                                (0x80)
  - default_epp (or 3):         preserve EPP value programmed by
                                BIOS/firmware

Unset, "0"/"disabled", or an unrecognized value leaves autonomous
selection disabled.

The boot parameter is applied only during first policy initialization.
Skip applying it on CPU hotplug to preserve runtime sysfs configuration.

This relies on commit 8c83947c5dbb ("cpufreq: Use policy->min/max init as
QoS request") so that the policy->min/max set in cppc_cpufreq_cpu_init()
are used as the policy's QoS requests and not overridden by
cpufreq_set_policy() during init.

Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
---
 .../admin-guide/kernel-parameters.txt         |  22 +++
 drivers/cpufreq/cppc_cpufreq.c                | 151 +++++++++++++++++-
 include/acpi/cppc_acpi.h                      |   1 +
 3 files changed, 169 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b5493a7f8f22..88820d34d516 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1019,6 +1019,28 @@ Kernel parameters
 			policy to use. This governor must be registered in the
 			kernel before the cpufreq driver probes.
 
+	cppc_cpufreq.auto_sel_mode=
+			[CPU_FREQ] Enable ACPI CPPC autonomous performance
+			selection. When enabled, hardware automatically adjusts
+			CPU frequency on all CPUs based on workload demands.
+			In Autonomous mode, Energy Performance Preference (EPP)
+			hints guide hardware toward performance (0x0) or energy
+			efficiency (0xff).
+			Requires ACPI CPPC autonomous selection register
+			support.
+			Accepts:
+			  disabled, 0:
+				  cpufreq governors are used (auto_sel disabled)
+			  performance, 1:
+				  enable auto_sel + set EPP to performance (0x0)
+			  balance_performance, 2:
+				  enable auto_sel + set EPP to
+				  balance_performance (0x80)
+			  default_epp, 3:
+				  enable auto_sel, preserve EPP value programmed
+				  by BIOS/firmware
+			Unset or an unrecognized value is treated as disabled.
+
 	cpu_init_udelay=N
 			[X86,EARLY] Delay for N microsec between assert and de-assert
 			of APIC INIT to start processors.  This delay occurs
diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index f7a47576717a..efa673e3830c 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -28,6 +28,55 @@
 
 static struct cpufreq_driver cppc_cpufreq_driver;
 
+/* Autonomous Selection boot parameter modes */
+enum {
+	AUTO_SEL_DISABLED = 0,
+	AUTO_SEL_PERFORMANCE = 1,
+	AUTO_SEL_BALANCE_PERFORMANCE = 2,
+	AUTO_SEL_DEFAULT_EPP = 3,
+};
+
+static int auto_sel_mode;
+
+static int auto_sel_mode_set(const char *val, const struct kernel_param *kp)
+{
+	int *mode = kp->arg;
+
+	*mode = AUTO_SEL_DISABLED;
+
+	if (sysfs_streq(val, "performance") || sysfs_streq(val, "1"))
+		*mode = AUTO_SEL_PERFORMANCE;
+	else if (sysfs_streq(val, "balance_performance") || sysfs_streq(val, "2"))
+		*mode = AUTO_SEL_BALANCE_PERFORMANCE;
+	else if (sysfs_streq(val, "default_epp") || sysfs_streq(val, "3"))
+		*mode = AUTO_SEL_DEFAULT_EPP;
+	else if (!sysfs_streq(val, "disabled") && !sysfs_streq(val, "0"))
+		pr_warn("Invalid auto_sel_mode \"%s\", disable auto select\n", val);
+
+	return 0;
+}
+
+static int auto_sel_mode_get(char *buffer, const struct kernel_param *kp)
+{
+	int *mode = kp->arg;
+
+	switch (*mode) {
+	case AUTO_SEL_PERFORMANCE:
+		return sysfs_emit(buffer, "performance\n");
+	case AUTO_SEL_BALANCE_PERFORMANCE:
+		return sysfs_emit(buffer, "balance_performance\n");
+	case AUTO_SEL_DEFAULT_EPP:
+		return sysfs_emit(buffer, "default_epp\n");
+	default:
+		return sysfs_emit(buffer, "disabled\n");
+	}
+}
+
+static const struct kernel_param_ops auto_sel_mode_ops = {
+	.set = auto_sel_mode_set,
+	.get = auto_sel_mode_get,
+};
+
 #ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
 static enum {
 	FIE_UNSET = -1,
@@ -645,7 +694,9 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
 	unsigned int cpu = policy->cpu;
 	struct cppc_cpudata *cpu_data;
 	struct cppc_perf_caps *caps;
+	bool set_epp = true;
 	int ret;
+	u32 epp;
 
 	cpu_data = cppc_cpufreq_get_cpu_data(cpu);
 	if (!cpu_data) {
@@ -715,11 +766,87 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
 	policy->cur = cppc_perf_to_khz(caps, caps->highest_perf);
 	cpu_data->perf_ctrls.desired_perf =  caps->highest_perf;
 
-	ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
-	if (ret) {
-		pr_debug("Err setting perf value:%d on CPU:%d. ret:%d\n",
-			 caps->highest_perf, cpu, ret);
-		goto out;
+	/*
+	 * Enable autonomous mode on first init if boot param is set.
+	 * Check last_governor to detect first init and skip if auto_sel
+	 * is already enabled.
+	 */
+	if (auto_sel_mode && policy->last_governor[0] == '\0' &&
+	    !cpu_data->perf_ctrls.auto_sel) {
+		/* Init min/max_perf from caps if not already set by HW. */
+		if (!cpu_data->perf_ctrls.min_perf)
+			cpu_data->perf_ctrls.min_perf = caps->lowest_nonlinear_perf;
+		if (!cpu_data->perf_ctrls.max_perf)
+			cpu_data->perf_ctrls.max_perf = policy->boost_enabled ?
+				caps->highest_perf : caps->nominal_perf;
+
+		/*
+		 * In autonomous mode desired_perf is only a hint; EPP and
+		 * the platform drive actual selection within [min, max].
+		 * Initialize it to max_perf so HW starts at the upper bound.
+		 */
+		cpu_data->perf_ctrls.desired_perf = cpu_data->perf_ctrls.max_perf;
+
+		policy->cur = cppc_perf_to_khz(caps,
+					       cpu_data->perf_ctrls.desired_perf);
+
+		/*
+		 * Set EPP per mode. 'default_epp' preserves the BIOS/firmware
+		 * programmed EPP value. EPP is optional - some platforms may
+		 * not support it.
+		 */
+		switch (auto_sel_mode) {
+		case AUTO_SEL_PERFORMANCE:
+			epp = CPPC_EPP_PERFORMANCE_PREF;
+			break;
+		case AUTO_SEL_BALANCE_PERFORMANCE:
+			epp = CPPC_EPP_BALANCE_PERFORMANCE_PREF;
+			break;
+		default:
+			set_epp = false;
+			break;
+		}
+
+		if (set_epp) {
+			ret = cppc_set_epp(cpu, epp);
+			if (ret && ret != -EOPNOTSUPP)
+				pr_warn("Failed to set EPP for CPU%d (%d)\n", cpu, ret);
+			else if (!ret)
+				cpu_data->perf_ctrls.energy_perf = epp;
+		}
+
+		/* Program min/max/desired into CPPC regs (non-fatal on failure). */
+		ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
+		if (ret)
+			pr_warn("set_perf failed CPU%d (%d); using HW values\n",
+				cpu, ret);
+
+		ret = cppc_set_auto_sel(cpu, true);
+		if (ret && ret != -EOPNOTSUPP)
+			pr_warn("auto_sel CPU%d failed (%d); using OS mode\n",
+				cpu, ret);
+		else if (!ret)
+			cpu_data->perf_ctrls.auto_sel = true;
+	}
+
+	if (cpu_data->perf_ctrls.auto_sel) {
+		/* Sync policy limits from HW when autonomous mode is active */
+		policy->min = cppc_perf_to_khz(caps,
+					       cpu_data->perf_ctrls.min_perf ?:
+					       caps->lowest_nonlinear_perf);
+		policy->max = cppc_perf_to_khz(caps,
+					       cpu_data->perf_ctrls.max_perf ?:
+					       (policy->boost_enabled ?
+						caps->highest_perf :
+						caps->nominal_perf));
+	} else {
+		/* Normal mode: governors control frequency */
+		ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
+		if (ret) {
+			pr_debug("Err setting perf value:%d on CPU:%d. ret:%d\n",
+				 caps->highest_perf, cpu, ret);
+			goto out;
+		}
 	}
 
 	cppc_cpufreq_cpu_fie_init(policy);
@@ -1066,10 +1193,24 @@ static int __init cppc_cpufreq_init(void)
 
 static void __exit cppc_cpufreq_exit(void)
 {
+	unsigned int cpu;
+
+	for_each_present_cpu(cpu)
+		cppc_set_auto_sel(cpu, false);
+
 	cpufreq_unregister_driver(&cppc_cpufreq_driver);
 	cppc_freq_invariance_exit();
 }
 
+module_param_cb(auto_sel_mode, &auto_sel_mode_ops, &auto_sel_mode, 0444);
+MODULE_PARM_DESC(auto_sel_mode,
+		 "Enable CPPC autonomous performance selection at boot: "
+		 "disabled or 0 (use cpufreq governors), "
+		 "performance or 1 (EPP=performance), "
+		 "balance_performance or 2 (EPP=balance_performance), "
+		 "default_epp or 3 (preserve BIOS/firmware EPP); "
+		 "an unrecognized value is treated as disabled");
+
 module_exit(cppc_cpufreq_exit);
 MODULE_AUTHOR("Ashwin Chaugule");
 MODULE_DESCRIPTION("CPUFreq driver based on the ACPI CPPC v5.0+ spec");
diff --git a/include/acpi/cppc_acpi.h b/include/acpi/cppc_acpi.h
index 8693890a7275..9b18fb9aab7c 100644
--- a/include/acpi/cppc_acpi.h
+++ b/include/acpi/cppc_acpi.h
@@ -42,6 +42,7 @@
 #define CPPC_AUTO_ACT_WINDOW_SIG_CARRY_THRESH 129
 
 #define CPPC_EPP_PERFORMANCE_PREF		0x00
+#define CPPC_EPP_BALANCE_PERFORMANCE_PREF	0x80
 #define CPPC_EPP_ENERGY_EFFICIENCY_PREF		0xFF
 
 #define CPPC_PERF_LIMITED_DESIRED_EXCURSION	BIT(0)
-- 
2.34.1


^ permalink raw reply related

* [PATCH v5 1/2] cpufreq: CPPC: Set CPPC Enable register in cpu_init
From: Sumit Gupta @ 2026-06-23  8:06 UTC (permalink / raw)
  To: rafael, viresh.kumar, pierre.gondois, ionela.voinescu,
	zhenglifeng1, zhanjie9, corbet, skhan, rdunlap, mario.limonciello,
	linux-kernel, linux-pm, linux-doc, linux-tegra
  Cc: treding, jonathanh, vsethi, ksitaraman, sanjayc, mochs, bbasu,
	sumitg
In-Reply-To: <20260623080652.3353386-1-sumitg@nvidia.com>

As per ACPI 6.x s8.4.6.1.4 (CPPC Enable register):
  "If supported by the platform, OSPM writes a one to this register
   to enable CPPC on this processor. If not implemented, OSPM assumes
   the platform always has CPPC enabled."

Call cppc_set_enable() at the start of cppc_cpufreq_cpu_init() so
this is done for both OS-driven and autonomous CPPC control modes.
Errors are logged but non-fatal as the register is optional.

Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
---
 drivers/cpufreq/cppc_cpufreq.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index f6cea0c54dd9..f7a47576717a 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -655,6 +655,14 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
 	caps = &cpu_data->perf_caps;
 	policy->driver_data = cpu_data;
 
+	/*
+	 * Enable CPPC for both OS-driven and autonomous modes.
+	 * The Enable register is optional - some platforms may not support it
+	 */
+	ret = cppc_set_enable(cpu, true);
+	if (ret && ret != -EOPNOTSUPP)
+		pr_warn("Failed to enable CPPC for CPU%d (%d)\n", cpu, ret);
+
 	/*
 	 * Set min to lowest nonlinear perf to avoid any efficiency penalty (see
 	 * Section 8.4.7.1.1.5 of ACPI 6.1 spec)
-- 
2.34.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox