* Re: [RFC PATCH v2 06/10] kvm: guest_memfd: Add support for freezing and unfreezing mappings
From: tarunsahu @ 2026-06-23 14:03 UTC (permalink / raw)
To: Sean Christopherson, Ackerley Tng
Cc: Jonathan Corbet, vannapurve, fvdl, Pasha Tatashin, Shuah Khan,
sagis, aneesh.kumar, skhawaja, vipinsh, Pratyush Yadav, david,
dmatlack, mark.rutland, Paolo Bonzini, Mike Rapoport,
Alexander Graf, axelrasmussen, linux-kselftest, kexec,
linux-kernel, linux-doc, kvm, linux-mm
In-Reply-To: <ajnOnzdknfwbuJ9g@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Mon, Jun 22, 2026, Ackerley Tng wrote:
>> Tarun Sahu <tarunsahu@google.com> writes:
>>
>> > This patch introduces the freeze on gmem_inode which prevents
>>
>> Can't find the reference now, but commit messages should take the
>> imperative mood and avoid "this patch" [*]
>
> From Documentation/process/submitting-patches.rst:
>
> Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
> instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
> to do frotz", as if you are giving orders to the codebase to change
> its behaviour.
>
> Documentation/process/maintainer-tip.rst and Documentation/process/maintainer-kvm-x86.rst
> elaborate more on the preferred style (I do most of the guest_memfd maintenance,
> and so for all intents and purpose it's bound by KVM x86 "rules").
Thanks!. Will take care of that.
^ permalink raw reply
* Re: [RFC PATCH v2 06/10] kvm: guest_memfd: Add support for freezing and unfreezing mappings
From: tarunsahu @ 2026-06-23 14:02 UTC (permalink / raw)
To: Ackerley Tng, Jonathan Corbet, vannapurve, fvdl, Pasha Tatashin,
Shuah Khan, sagis, aneesh.kumar, skhawaja, vipinsh,
Pratyush Yadav, david, dmatlack, mark.rutland, Paolo Bonzini,
Mike Rapoport, Alexander Graf, seanjc, axelrasmussen
Cc: linux-kselftest, kexec, linux-kernel, linux-doc, kvm, linux-mm
In-Reply-To: <CAEvNRgFEHciT3T9y+qEYRvXhDwfrggoU7Rm=f9hT3OrV+wgpNQ@mail.gmail.com>
Thanks for reviewing!
Ackerley Tng <ackerleytng@google.com> writes:
> Tarun Sahu <tarunsahu@google.com> writes:
>
>> This patch introduces the freeze on gmem_inode which prevents
>
> Can't find the reference now, but commit messages should take the
> imperative mood and avoid "this patch" [*]
>
> [*] https://lore.kernel.org/all/YKRWNaqzo4GVDxHP@google.com/
>
ACK. Will take care of it.
>> the fallocate call and any new page fault allocation. This will avoid
>> gmem file modification when it is being preserved
>>
>> Used srcu lock to synchronise the freeze call, where write blocks
>> until all the reads are free. And reads are re-entrant.
>>
>> Incase fault fails, It return -EPERM and VM_EXIT to userspace. userspace
>> must handle this properly as every new fault will fail.
>>
>> Signed-off-by: Tarun Sahu <tarunsahu@google.com>
>>
>> [...snip...]
>>
>> @@ -105,12 +108,20 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>> if (!IS_ERR(folio))
>> return folio;
>>
>> + idx = srcu_read_lock(&kvm_gmem_freeze_srcu);
>> + if (kvm_gmem_is_frozen(inode)) {
>> + srcu_read_unlock(&kvm_gmem_freeze_srcu, idx);
>> + return ERR_PTR(-EPERM);
>> + }
>> +
>> policy = mpol_shared_policy_lookup(&GMEM_I(inode)->policy, index);
>> folio = __filemap_get_folio_mpol(inode->i_mapping, index,
>> FGP_LOCK | FGP_CREAT,
>> mapping_gfp_mask(inode->i_mapping), policy);
>> mpol_cond_put(policy);
>>
>> + srcu_read_unlock(&kvm_gmem_freeze_srcu, idx);
>> +
>> /*
>> * External interfaces like kvm_gmem_get_pfn() support dealing
>> * with hugepages to a degree, but internally, guest_memfd currently
>> @@ -273,16 +284,30 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
>> static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
>> loff_t len)
>> {
>> + struct inode *inode = file_inode(file);
>> int ret;
>> + int idx;
>>
>> - if (!(mode & FALLOC_FL_KEEP_SIZE))
>> - return -EOPNOTSUPP;
>> + idx = srcu_read_lock(&kvm_gmem_freeze_srcu);
>> + if (kvm_gmem_is_frozen(inode)) {
>> + srcu_read_unlock(&kvm_gmem_freeze_srcu, idx);
>> + return -EPERM;
>> + }
>
> fallocate may eventually go to kvm_gmem_get_folio(), so that would check
> kvm_gmem_is_frozen() twice. Is this meant to catch the punch hole case?
>
Right. To catch punch hole case. And read lock being re-entrant, so I
blocked the fallocate call completely.
>>
>> - if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
>> - return -EOPNOTSUPP;
>> + if (!(mode & FALLOC_FL_KEEP_SIZE)) {
>> + ret = -EOPNOTSUPP;
>> + goto out;
>> + }
>>
>> - if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
>> - return -EINVAL;
>> + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) {
>> + ret = -EOPNOTSUPP;
>> + goto out;
>> + }
>> +
>> + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) {
>> + ret = -EINVAL;
>> + goto out;
>> + }
>
> There's some reordering here. Why not let the validation happen like
> before, then check kvm_gmem_is_frozen()?
>
>>
>> if (mode & FALLOC_FL_PUNCH_HOLE)
>> ret = kvm_gmem_punch_hole(file_inode(file), offset, len);
>>
>> [...snip...]
>>
>> +
>> +/**
>> + * kvm_gmem_freeze - Freeze or unfreeze a guest_memfd inode mapping.
>> + * @inode: The guest_memfd inode.
>> + * @freeze: True to freeze, false to unfreeze.
>> + *
>> + * This API is used strictly during the live update / preservation transition
>> + * window to prevent host userspace and guest-side faults from making any
>> + * mapping modifications (such as fallocate or page fault allocation)
>> + * to the guest_memfd page cache.
>> + *
>> + * Synchronization Strategy (Sleepable RCU):
>> + * To avoid high-contention VFS locks (like inode_lock or
>> + * filemap_invalidate_lock) on the vCPU page fault hot paths, this subsystem
>> + * implements a lightweight, system-wide Sleepable RCU (SRCU) mechanism
>> + * (`kvm_gmem_freeze_srcu`):
>> + *
>> + * Global vs. Per-Inode SRCU
>> + * ======================
>> + * A single system-wide global static `srcu_struct` is used instead of a
>> + * per-inode SRCU structure to completely prevent unprivileged users from
>> + * exhausting the host's per-CPU memory allocator. Because
>> + * `init_srcu_struct()` allocates per-CPU memory via `alloc_percpu()`, which
>> + * is not accounted by memory cgroups (memcg),
>> + * a per-inode SRCU structure would allow a tenant to bypass cgroup limits and
>> + * trigger a system-wide Out-of-Memory (OOM) crash simply by spawning a large
>> + * number of guest_memfd file descriptors (bounded only by RLIMIT_NOFILE).
>> + *
>> + * Flag Modification Note:
>> + * Since `GUEST_MEMFD_F_MAPPING_FROZEN` is the ONLY flag in
>> + * `GMEM_I(inode)->flags` that is mutated dynamically at runtime (all other
>> + * flags are creation-time flags which remain strictly read-only), there is
>> + * no possibility of concurrent bit-modification races. Therefore, a standard
>> + * `WRITE_ONCE` is fully safe and does not require complex `cmpxchg`
>> + * synchronization loops.
>> + */
>> +void kvm_gmem_freeze(struct inode *inode, bool freeze)
>> +{
>> + u64 flags = READ_ONCE(GMEM_I(inode)->flags);
>> +
>> + if (freeze)
>> + flags |= GUEST_MEMFD_F_MAPPING_FROZEN;
>> + else
>> + flags &= ~GUEST_MEMFD_F_MAPPING_FROZEN;
>> +
>> + WRITE_ONCE(GMEM_I(inode)->flags, flags);
>> +
>> + if (freeze)
>> + synchronize_srcu(&kvm_gmem_freeze_srcu);
>
> Why only synchronize on freeze but not unfreeze?
It was not needed because
Freeze => True
When an user setting freeze to true.
"Preservation will be stalled till all the current ongoing allocation
finished, and future allocations are already stopped."
Freeze => False
When an user unfreezing, current allocation/fallocate will
return -EPERM, and future one will be succeeded as freeze is set
to false. Synchronization will only stall the user, behviour does
not change.
Unless, user expects that it should be waiting for all the ongoing
drains.
>
>> +}
>> +
>>
>> [...snip...]
>>
^ permalink raw reply
* Re: [PATCH] docs/mm: clarify that we are not looking for LLM generated content
From: David Hildenbrand (Arm) @ 2026-06-23 13:56 UTC (permalink / raw)
To: Jonathan Corbet, linux-doc
Cc: Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
Matthew Wilcox, Harry Yoo, linux-mm, linux-kernel
In-Reply-To: <87wlvpct0b.fsf@trenco.lwn.net>
On 6/23/26 14:59, Jonathan Corbet wrote:
> "David Hildenbrand (Arm)" <david@kernel.org> writes:
>
>> On 4/20/26 23:03, David Hildenbrand (Arm) wrote:
>>> Let's make it clear that we are not looking for LLM generated content
>>> from contributors not familiar with the details of MM, as it shifts the
>>> real work onto reviewers.
>>>
>>> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
>>> ---
>>> Documentation/mm/index.rst | 13 +++++++++++++
>>> 1 file changed, 13 insertions(+)
>>>
>>> diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
>>> index 7aa2a8886908..13a79f5d092c 100644
>>> --- a/Documentation/mm/index.rst
>>> +++ b/Documentation/mm/index.rst
>>> @@ -7,6 +7,19 @@ of Linux. If you are looking for advice on simply allocating memory,
>>> see the :ref:`memory_allocation`. For controlling and tuning guides,
>>> see the :doc:`admin guide <../admin-guide/mm/index>`.
>>>
>>> +.. note::
>>> +
>>> + Unfortunately, parts of this guide are still incomplete or missing.
>>> + While we appreciate contributions, documentation in this area is hard
>>> + to get right and requires a lot of attention to detail. New contributors
>>> + should reach out to the relevant maintainers early.
>>> +
>>> + This guide is expected to reflect reality, which requires contributors
>>> + to have a detailed understanding. Documentation generated with LLMs
>>> + by contributors unfamiliar with these details shifts the real work onto
>>> + reviewers, which is why such contributions will be rejected without
>>> + further comment.
>>> +
>>> .. toctree::
>>> :maxdepth: 1
>>>
>>>
>>> ---
>>> base-commit: da6b5aae84beb0917ecb0c9fbc71169d145397ff
>>> change-id: 20260420-llmdoc-21bf5fadbd6f
>>>
>>> Best regards,
>>
>> I assume this was not picked up yet? (via documentation or mm tree?)
>
> I had figured Andrew would grab it; I can certainly do so if you'd like.
yes please. I guess I'll soon start grabbing stuff myself. Stay tuned. :)
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v7 00/10] tracing/probes: Add more typecast features
From: Masami Hiramatsu @ 2026-06-23 13:54 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Steven Rostedt, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217904992.643090.15726197350652241270.stgit@devnote2>
On Tue, 23 Jun 2026 10:44:10 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> Hi,
>
> Here is the 7th version of series to introduce more typecast features
> to probe events. The previous version is here:
>
> https://lore.kernel.org/all/178201238795.570818.15573963115625446598.stgit@devnote2/
>
> In this version, I added 2 new fix and cleanup patches and update
> according to Sashiko's review. [1/10] is a long-lived issue about
> @+FOFFS, which was wrongly adding offset twice. [2/10] is a clean
> up patch for renaming fetch_op name (good to dump it).
> This is applicable against probes/core branch on linux-trace tree.
I'll take the first 2 patches to probes/core, since those
are obvious fix and cleanup.
Thanks,
>
> Steve introduced BTF typecast feature for eprobe[1].
> This series extends it and add more options:
>
> 1. Expanding BTF typecast to kprobe and fprobe.
> (currently only function entry/exit)
>
> 2. Introduce container_of like typecast. This adds a "assigned
> member" option to the typecast.
>
> (STRUCT,MEMBER)VAR->ANOTHER_MEMBER
>
> This casts VAR to STRUCT type but the VAR is as the address
> of STRUCT.MEMBER. In C, it is:
>
> container_of(VAR, STRUCT, MEMBER)->ANOTHER_MEMBER
>
> 3. Support nested typecast, e.g.
>
> (STRUCT)((STRUCT2)VAR->MEMBER2)->MEMBER
>
> the nest level must be smaller than 3.
>
> 4. Add $current variable to point "current" task_struct.
> This is useful with typecast, e.g.
>
> (task_struct)$current->pid
>
> 5. per-cpu dereference support.
>
> Intrdouce this_cpu_read(VAR) and this_cpu_ptr(VAR) to
> access per-cpu data on the current CPU (accessing other CPU
> data is not stable, because it can be changed.)
>
> You can access the member of per-cpu data structure using
> typecast like:
>
> (STRUCT)this_cpu_ptr(VAR)->MEMBER
>
> And added fetcharg dump feature (for debug) and updated test scripts
> to test part of them.
>
> Thanks,
>
> ---
> base-commit: 3ec75d0067f30eb5e0730f033766d6ab2feca7ae
>
> Masami Hiramatsu (Google) (10):
> tracing/probes: Fix double addition of offset for @+FOFFSET
> tracing/probes: Rename FETCH_OP_DATA to FETCH_OP_IMMSTR
> tracing/probes: Support dumping fetcharg program for debugging dynamic events
> tracing/probes: Support typecast for various probe events
> tracing/probes: Support nested typecast
> tracing/probes: Type casting always involves nested calls
> tracing/probes: Support field specifier option for typecast
> tracing/probes: Add $current variable support
> tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
> tracing/probes: Add a new testcase for BTF typecasts
>
>
> Documentation/trace/eprobetrace.rst | 9
> Documentation/trace/fprobetrace.rst | 10
> Documentation/trace/kprobetrace.rst | 11
> kernel/trace/Kconfig | 11
> kernel/trace/trace.c | 8
> kernel/trace/trace_eprobe.c | 2
> kernel/trace/trace_fprobe.c | 2
> kernel/trace/trace_kprobe.c | 2
> kernel/trace/trace_probe.c | 582 ++++++++++++++++----
> kernel/trace/trace_probe.h | 98 ++-
> kernel/trace/trace_probe_tmpl.h | 27 +
> kernel/trace/trace_uprobe.c | 3
> samples/trace_events/trace-events-sample.c | 40 +
> samples/trace_events/trace-events-sample.h | 34 +
> .../ftrace/test.d/dynevent/btf_probe_event.tc | 51 ++
> .../ftrace/test.d/dynevent/fprobe_syntax_errors.tc | 11
> .../ftrace/test.d/kprobe/kprobe_syntax_errors.tc | 11
> .../ftrace/test.d/kprobe/uprobe_syntax_errors.tc | 5
> 18 files changed, 756 insertions(+), 161 deletions(-)
> create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v4 3/4] KVM: PPC: Book3S HV: Add support for compat CPU capabilities for KVM on PowerNV
From: Amit Machhiwal @ 2026-06-23 13:31 UTC (permalink / raw)
To: Vaibhav Jain
Cc: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan,
Anushree Mathur, Paolo Bonzini, Nicholas Piggin, Michael Ellerman,
Christophe Leroy (CS GROUP), Jonathan Corbet, Shuah Khan, kvm,
linux-kernel, linux-doc, lkp
In-Reply-To: <875x3fcb3x.fsf@vajain21.in.ibm.com>
Hi Vaibhav,
Thanks for reviewing this patch. Please find my response inline.
On 2026/06/19 11:42 AM, Vaibhav Jain wrote:
> Hi Amit.
>
> Thanks for the patch and incorporating V3 review comments. Further
> review comments inline below:
>
> Amit Machhiwal <amachhiw@linux.ibm.com> writes:
>
> > Currently, when booting a compatibility-mode KVM guest (L1) on a PowerNV
> > hypervisor (L0), the guest runs with the expected processor
> > compatibility level. However, when booting a nested KVM guest (L2)
> > inside the L1, QEMU derives the CPU model from the raw host PVR and
> > attempts to run the nested guest at that level, instead of honoring the
> > compatibility mode of the L1.
> >
> > Extend host CPU compatibility capability reporting to support nested
> > virtualization on PowerNV systems (PAPR nested API v1).
> >
> > For nested API v2 (PowerVM), compatibility capabilities are obtained
> > from the hypervisor via the H_GUEST_GET_CAPABILITIES hcall. This
> > information is not available on PowerNV systems.
> >
> > For nested API v1, derive the compatibility capabilities from the L1
> > guest by reading the "cpu-version" property from the device tree, which
> > reflects the effective (logical) processor compatibility level. Map this
> > value to the corresponding compatibility capability bitmap using
> > KVM-specific constants.
> >
> > Introduce a helper to translate CPU version values into KVM_PPC_COMPAT_CAP
> > bits and integrate it into kvmppc_get_compat_caps(). The implementation
> > applies masking to ensure only supported processor modes are exposed.
> >
> > This allows userspace to query host CPU compatibility modes on both
> > PowerVM and PowerNV platforms via the KVM_PPC_GET_COMPAT_CAPS ioctl.
> >
> > Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> > ---
> > arch/powerpc/kvm/book3s_hv.c | 37 +++++++++++++++++++++++++++++++++++-
> > 1 file changed, 36 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index f674386df62c..375e7a7fa9f8 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -6523,15 +6523,50 @@ static bool kvmppc_hash_v3_possible(void)
> > return true;
> > }
> >
> > +static int kvmppc_map_compat_capabilities(const __be32 cpu_version,
> > + unsigned long *capabilities)
> > +{
> > + switch (cpu_version) {
> > + case PVR_ARCH_31_P11:
> > + *capabilities |= KVM_PPC_COMPAT_CAP_POWER11;
> Do you need to do 'break' here instead of falling through. Since P11
> host can support P10 and P9 compat modes
I had addressed a similar comment from Harsh in v1 of the series here:
https://lore.kernel.org/all/20260507202740.96fb259f-22-amachhiw@linux.ibm.com/
The current implementation with break statements is intentional. This
function (kvmppc_map_compat_capabilities()) is called only when booting
a nested KVM guest (L2) on **KVM on PowerNV**.
When the L1 KVM guest is booted in a compat mode, L2 is supposed to boot
with the **same PVR version** as that of the L1, which is already taken
care of with the current changes. If L2 needs to boot with a different
*lower* compat mode, it would use max-cpu-compat, which takes a
different code path for setting the compat.
Even if I included all lower compat modes in the compat caps **APIv1**,
I don't think we'll be using those lower compat bits unless we wanted to
block a specific older compat for a given pvr level - which neither we
are doing in this series nor we may want to put such a restriction for
APIv1.
Please let me know if you think otherwise.
>
> > + break;
> > + case PVR_ARCH_31:
> > + *capabilities |= KVM_PPC_COMPAT_CAP_POWER10;
> > + break;
> > + case PVR_ARCH_300:
> > + *capabilities |= KVM_PPC_COMPAT_CAP_POWER9;
> > + break;
> > + default:
> > + return -EINVAL;
> > + }
> > +
> > + return 0;
> > +}
> >
> > static int kvmppc_get_compat_caps(struct kvm_ppc_compat_caps *host_caps)
> > {
> > + struct device_node *np;
> > unsigned long capabilities = 0;
> > + const __be32 *prop = NULL;
> > long rc = -EINVAL;
> > + u32 cpu_version;
> >
> > if (kvmhv_on_pseries()) {
> > - if (kvmhv_is_nestedv2())
> > + if (kvmhv_is_nestedv2()) {
> > rc = plpar_guest_get_capabilities(0, &capabilities);
> > + } else {
> > + for_each_node_by_type(np, "cpu") {
> > + prop = of_get_property(np, "cpu-version", NULL);
> > + if (prop) {
> > + cpu_version = be32_to_cpup(prop);
> > + break;
> > + }
> > + }
> > + if (!prop)
> > + return -EINVAL;
> > + rc = kvmppc_map_compat_capabilities(cpu_version,
> > + &capabilities);
> > + }
> should you check for 'rc' error here before assigning 'capabilities' to
> 'host_caps->compat_capabilities' . I understand it will be set to '0'
> due to its initialization at the top of the function. But would be
> better to make it more explicit
Sure. The return value rc is checked by the caller but more error
checking is always good I guess. :)
I'll add a check for rc something like this (or something similar):
if (rc) {
return -EINVAL;
}
host_caps->compat_capabilities = capabilities &
KVM_PPC_COMPAT_BITMASK;
Thanks,
Amit
>
> > host_caps->compat_capabilities = capabilities &
> > KVM_PPC_COMPAT_BITMASK;
> > }
> > --
> > 2.50.1 (Apple Git-155)
> >
> >
>
> --
> Cheers
> ~ Vaibhav
^ permalink raw reply
* Re: [PATCH 1/4] nfs: store the full NFS fileid in inode->i_ino
From: Mark Brown @ 2026-06-23 13:25 UTC (permalink / raw)
To: Jeff Layton
Cc: Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
linux-nfs, linux-kernel, linux-doc
In-Reply-To: <e5ebc36c9a7e356c8d1b98ce3a9d1f3420177334.camel@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 1088 bytes --]
On Tue, Jun 23, 2026 at 07:04:47AM -0400, Jeff Layton wrote:
> On Mon, 2026-06-22 at 18:38 -0400, Jeff Layton wrote:
> > Note that it's trying to stuff the inode number field into an unsigned
> > long. Before this patch, the maps file would have printed the old
> > (hashed) inode number on 32-bit. Now, it prints the full 64-bit inode
> > number.
...
> > We could argue that this is a bug in the testcase. It assumes that the
> > maps file will never print a value larger than ULONG_MAX in that field,
> > and I don't see why it would make that assumption in this day and age.
It wouldn't be the first LTP test that had a bug in it.
> > Are there actual programs in the field that scrape the maps file that
> > might be affected by this change?
Not to my knowledge.
> This testcase patch should fix it. I'll plan to send this to the LTP
> list, but it would be nice if someone could confirm the fix on arm32:
I'll try to give it a spin, though my test setup for LTP makes that very
awkward (it's embedded into a rootfs image and built as part of that) so
I wouldn't wait for me.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [PATCH v4 1/5] mm/zswap: Extend shrink_memcg() writeback capability
From: Hao Jia @ 2026-06-23 13:22 UTC (permalink / raw)
To: Yosry Ahmed
Cc: akpm, tj, hannes, shakeel.butt, mhocko, mkoutny, nphamcs,
chengming.zhou, muchun.song, roman.gushchin, linux-mm,
linux-kernel, linux-doc, Hao Jia
In-Reply-To: <ajnB8IZrFZwbIr9P@google.com>
On 2026/6/23 07:33, Yosry Ahmed wrote:
> On Thu, Jun 18, 2026 at 12:48:53PM +0800, Hao Jia wrote:
>> From: Hao Jia <jiahao1@lixiang.com>
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 761cd699e0a3..d7d031dee4cd 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -160,6 +160,11 @@ struct zswap_pool {
>> char tfm_name[CRYPTO_MAX_ALG_NAME];
>> };
>>
>> +struct zswap_shrink_walk_arg {
>> + unsigned long bytes_written;
>> + bool encountered_page_in_swapcache;
>> +};
>> +
>> /* Global LRU lists shared by all zswap pools. */
>> static struct list_lru zswap_list_lru;
>>
>> @@ -1089,8 +1094,9 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>> void *arg)
>> {
>> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>> - bool *encountered_page_in_swapcache = (bool *)arg;
>> + struct zswap_shrink_walk_arg *walk_arg = arg;
>> swp_entry_t swpentry;
>> + unsigned int length;
>> enum lru_status ret = LRU_REMOVED_RETRY;
>> int writeback_result;
>>
>> @@ -1135,8 +1141,13 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>> * Once the lru lock is dropped, the entry might get freed. The
>> * swpentry is copied to the stack, and entry isn't deref'd again
>> * until the entry is verified to still be alive in the tree.
>> + *
>> + * entry->length is also copied while the lock is held, because
>> + * zswap_writeback_entry() frees the entry on success and we still
>> + * need its compressed size to account for writeback.
>
> Hmm that's unnecessary, just update "The swpentry is copied to the
> stack.." above to "Copy neded fields to the stack.." or something.
I'll do this, thanks.
>
>> */
>> swpentry = entry->swpentry;
>> + length = entry->length;
>>
>> /*
>> * It's safe to drop the lock here because we return either
>> @@ -1155,12 +1166,13 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>> * into the warmer region. We should terminate shrinking (if we're in the dynamic
>> * shrinker context).
>> */
>> - if (writeback_result == -EEXIST && encountered_page_in_swapcache) {
>> + if (writeback_result == -EEXIST) {
>> ret = LRU_STOP;
>> - *encountered_page_in_swapcache = true;
>> + walk_arg->encountered_page_in_swapcache = true;
>> }
>> } else {
>> zswap_written_back_pages++;
>> + walk_arg->bytes_written += length;
>> }
>>
>> return ret;
>> @@ -1169,8 +1181,11 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>> static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
>> struct shrink_control *sc)
>> {
>> + struct zswap_shrink_walk_arg walk_arg = {
>> + .bytes_written = 0,
>> + .encountered_page_in_swapcache = false,
>> + };
>> unsigned long shrink_ret;
>> - bool encountered_page_in_swapcache = false;
>>
>> if (!zswap_shrinker_enabled ||
>> !mem_cgroup_zswap_writeback_enabled(sc->memcg)) {
>> @@ -1179,9 +1194,9 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
>> }
>>
>> shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb,
>> - &encountered_page_in_swapcache);
>> + &walk_arg);
>>
>> - if (encountered_page_in_swapcache)
>> + if (walk_arg.encountered_page_in_swapcache)
>> return SHRINK_STOP;
>>
>> return shrink_ret ? shrink_ret : SHRINK_STOP;
>> @@ -1275,10 +1290,32 @@ static struct shrinker *zswap_alloc_shrinker(void)
>> return shrinker;
>> }
>>
>> -static int shrink_memcg(struct mem_cgroup *memcg)
>> -{
>> - int nid, shrunk = 0, scanned = 0;
>> +/*
>> + * The maximum acceptable scan cost factor for writing back
>> + * PAGE_SIZE bytes of compressed data.
>> + */
>> +#define ZSWAP_WB_SCAN_FACTOR 16UL
>> +#define NR_ZSWAP_WB_BATCH 64UL
>>
>> +/*
>> + * Iterate over the per-node zswap LRUs of @memcg in batches, writing back
>> + * up to @nr_to_writeback * PAGE_SIZE bytes of compressed data.
>> + *
>> + * Return: The number of bytes written back, or -ENOENT if @memcg has
>> + * writeback disabled, is a zombie cgroup, or has empty zswap LRUs.
>> + */
>> +static long shrink_memcg(struct mem_cgroup *memcg,
>> + unsigned long nr_to_writeback)
>
>
> Is nr_to_writeback supposed to be the number of pages we want to
> writeback (regardless of their compressed size), or the compressed bytes
> we want to writeback divided by PAGE_SIZE?
>
> The way it's being used below seems like it's the latter, but the batch
> size should be in terms of scanned pages (i.e. uncompressed pages). So
> this is confusing.
>
> The zswap_store() path expects to reclaim one uncompressed page, but
> this will reclaim PAGE_SIZE worth of compressed memory when passing 1
> IIUC (actually maybe more, see below).
>
>> +{
>> + struct zswap_shrink_walk_arg walk_arg = {
>> + .bytes_written = 0,
>> + .encountered_page_in_swapcache = false,
>> + };
>> + u64 bytes_to_writeback = nr_to_writeback << PAGE_SHIFT;
>> + bool memcg_list_is_empty = true;
>> + int nid;
>> +
>> + /* Memcg with zswap writeback disabled are not candidates. */
>
> The comment is unnecessary here, it should be obvious.
I'll do this, thanks.
>
>> if (!mem_cgroup_zswap_writeback_enabled(memcg))
>> return -ENOENT;
>>
>> @@ -1290,24 +1327,65 @@ static int shrink_memcg(struct mem_cgroup *memcg)
>> return -ENOENT;
>>
>> for_each_node_state(nid, N_NORMAL_MEMORY) {
>> - unsigned long nr_to_walk = 1;
>> + unsigned long nr_to_scan, nr_scanned = 0;
>> + unsigned long remain;
>> + walk_arg.encountered_page_in_swapcache = false;
>> + /*
>> + * Cap by LRU length: bounds rewalks when referenced
>> + * entries keep rotating to the tail.
>> + */
>> + nr_to_scan = list_lru_count_one(&zswap_list_lru, nid, memcg);
>> + if (!nr_to_scan)
>> + continue;
>
> Hmm generally if we are running out of pages to scan then we should scan
> the rotated entries, and reclaim them on the second pass, right? So this
> should be working as intended. But I guess this doesn't work well when
> iterating multiple memcgs, as we don't want to drain referenced entries
> in one memcg before reclaiming already rotated entries on another.
>
> So I think the assumption here is that the caller will retry if needed,
> handling balancing scanning between multiple memcgs if needed. Maybe we
> should document this in the function doc above? We should explain that
> referenced entries will be rotated but not reclaimed as part of the same
> call.
>
>> + memcg_list_is_empty = false;
>> +
>> + /*
>> + * Cap by SCAN_FACTOR * remain budget: bounds scan cost
>> + * to the remaining writeback budget.
>> + */
>> + remain = DIV_ROUND_UP(bytes_to_writeback - walk_arg.bytes_written, PAGE_SIZE);
>> + nr_to_scan = min(nr_to_scan,
>> + remain * ZSWAP_WB_SCAN_FACTOR);
>
> For the zswap_store() path bytes_to_writeback=PAGE_SIZE, so remain will
> initially be 1. But then we multiply by this factor and now to scan 16
> pages? Also, where did this factor and equation come from?
>
> We'll also loop over nodes, so we may end up scanning 32 or more pages
> depending on the number of nodes in the system.
>
> If this is just a heuristic, we should really just start simple and add
> heuristics later as needed. The caller should probably pass in the
> number of pages to scan (i.e. uncompressed pages), and leave it to the
> caller to decide when to retry if the actual memory savings are
> realized.
>
>>
>> - shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
>> - &shrink_memcg_cb, NULL, &nr_to_walk);
>> - scanned += 1 - nr_to_walk;
>> + while (nr_scanned < nr_to_scan) {
>> + unsigned long nr_to_walk = min(NR_ZSWAP_WB_BATCH,
>> + nr_to_scan - nr_scanned);
>> +
>> + /*
>> + * Account for the committed budget rather than the walker's
>> + * actual delta. If the list is emptied concurrently, the
>> + * walker visits nothing and nr_scanned would never advance.
>> + */
>> + nr_scanned += nr_to_walk;
>> +
>> + list_lru_walk_one(&zswap_list_lru, nid, memcg,
>> + &shrink_memcg_cb,
>> + &walk_arg,
>> + &nr_to_walk);
>> +
>> + if (walk_arg.bytes_written >= bytes_to_writeback)
>> + return walk_arg.bytes_written;
>> +
>> + if (walk_arg.encountered_page_in_swapcache)
>> + break;
>> +
>> + cond_resched();
>> + }
>
> If the caller is expected to have a retry loop anyway, should we
> simplify this and just scan each per-node LRU once?
>
> We should also probably bail early if the number of scanned pages has
> already been reached? Currently shrink_memcg() scans one page at a time,
> so if it scans a bit more to balance between the nodes it's probably
> fine.
>
> But with batching, we could end up scanning hundres of extra pages just
> to balance between all nodes. Is node imbalance a real issue?
>
My initial thought was that if cold memory is evenly distributed across
nodes and we are doing a large writeback, it would be better to balance
the zswap entry writeback across all nodes rather than just draining
node 0 first. However, since we currently lack a proper metric to
represent hot/cold memory (such as age-based tracking), doing this
probably doesn't make much sense right now.
So, perhaps we want something like this? Please correct me if I'm wrong.
static long shrink_memcg(struct mem_cgroup *memcg,
unsigned long nr_to_scan)
{
struct zswap_shrink_walk_arg walk_arg = {
.bytes_written = 0,
.encountered_page_in_swapcache = false,
};
unsigned long nr_remaining = nr_to_scan;
bool memcg_list_is_empty = true;
int nid;
if (!mem_cgroup_zswap_writeback_enabled(memcg))
return -ENOENT;
if (memcg && !mem_cgroup_online(memcg))
return -ENOENT;
for_each_node_state(nid, N_NORMAL_MEMORY) {
unsigned long nr_to_walk;
/*
* Cap the per-node scan by the current LRU length. A referenced
* entry is only rotated to the tail (second chance) and may be
* revisited within a single walk; without this cap those rotated
* entries could drain the shared scan budget on one node.
*/
nr_to_walk = min(nr_remaining,
list_lru_count_one(&zswap_list_lru, nid, memcg));
if (!nr_to_walk)
continue;
memcg_list_is_empty = false;
nr_remaining -= nr_to_walk;
list_lru_walk_one(&zswap_list_lru, nid, memcg,
&shrink_memcg_cb, &walk_arg, &nr_to_walk);
/* Return the unused share of the budget to the pool. */
nr_remaining += nr_to_walk;
/* Bail out once the whole scan budget has been spent. */
if (!nr_remaining)
break;
cond_resched();
}
if (memcg_list_is_empty)
return -ENOENT;
return walk_arg.bytes_written;
}
Thanks,
Hao
^ permalink raw reply
* Re: [RFC PATCH] reserve_mem: add support for static memory
From: Pratyush Yadav @ 2026-06-23 13:10 UTC (permalink / raw)
To: Shyam Saini
Cc: linux-mm, linux-doc, linux-kernel, rppt, akpm, kees, tony.luck,
gpiccoli, bp, rdunlap, peterz, feng.tang, dapeng1.mi, elver,
enelsonmoore, kuba, lirongqing, ebiggers
In-Reply-To: <20260618224018.117978-1-shyamsaini@linux.microsoft.com>
On Thu, Jun 18 2026, Shyam Saini wrote:
> reserve_mem relies on dynamic memory allocation, this limits the
> usecase where memory and its address is required to be preserved
> across the boots. Eg: ramoops memory reservation on ACPI platforms
>
> So add support to pass a pre-determined static address and reserve
> memory at this specified address. This enables use case like ramoops
> on ACPI platforms to reliably access ramoops region across the boots.
Doesn't memmap= do exactly this? How is this different?
I always thought the point of reserve_mem was that you _don't_ have to
provide an explicit address, one is chosen for your machine
automatically.
>
> Also skip parsing of "align" parameter when static address is passed.
>
> Example syntax for static address
> reserve_mem=4M@0x1E0000000:oops ramoops.mem_name=oops
>
> Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com>
[...]
--
Regards,
Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v4 2/4] KVM: PPC: Book3S HV: Implement compat CPU capability retrieval for KVM on PowerVM
From: Amit Machhiwal @ 2026-06-23 13:01 UTC (permalink / raw)
To: Vaibhav Jain
Cc: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan,
Anushree Mathur, Paolo Bonzini, Nicholas Piggin, Michael Ellerman,
Christophe Leroy (CS GROUP), Jonathan Corbet, Shuah Khan, kvm,
linux-kernel, linux-doc, lkp
In-Reply-To: <878q8bcbh6.fsf@vajain21.in.ibm.com>
Hi Vaibhav,
Thanks for revewing this patch. My response is inline.
On 2026/06/19 11:34 AM, Vaibhav Jain wrote:
> Hi Amit.
>
> Thanks for the patch and incorporating V3 review comments. Further
> review comments inline below:
>
> Amit Machhiwal <amachhiw@linux.ibm.com> writes:
>
> > On POWER systems, the host CPU may run in a compatibility mode (e.g., a
> > Power11 processor operating in Power10 compatibility mode). In such
> > cases, the effective CPU level exposed to guests differs from the
> > physical processor generation.
> >
> > When running nested KVM guests, QEMU derives the host CPU type using
> > mfpvr(), which reflects the physical processor version. This can result
> > in a mismatch between the CPU model selected by QEMU and the
> > compatibility mode enforced by the host, leading to guest boot failures.
> >
> > For example, booting a nested guest on a Power11 LPAR configured in
> > Power10 compatibility mode fails with:
> >
> > KVM-NESTEDv2: couldn't set guest wide elements
> > [..KVM reg dump..]
> >
> > This occurs because QEMU selects a CPU model corresponding to the
> > physical processor (via mfpvr()), while the host operates in a lower
> > compatibility mode. As a result, KVM rejects the requested compatibility
> > level during guest initialization.
> >
> > Add support for retrieving host CPU compatibility capabilities for
> > nested guests on PowerVM (PAPR nested API v2). The hypervisor provides
> > the effective compatibility levels via the H_GUEST_GET_CAPABILITIES
> > hcall, which reflects the processor modes negotiated between the Power
> > hypervisor (L0) and the host partition (L1).
> >
> > On pseries systems, obtain the capability bitmap using
> > plpar_guest_get_capabilities() and return it via struct
> > kvm_ppc_compat_caps. The implementation defines KVM-specific capability
> > constants (KVM_PPC_COMPAT_CAP_POWER9/10/11) and applies masking to ensure
> > only supported processor modes are exposed to userspace. This information
> > is then exposed through the KVM_PPC_GET_COMPAT_CAPS ioctl.
> >
> > Hook the implementation into the Book3S HV kvmppc_ops so that it can be
> > invoked by the generic KVM ioctl handling code.
> >
> > Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> > ---
> > arch/powerpc/include/uapi/asm/kvm.h | 11 ++++++++++-
> > arch/powerpc/kvm/book3s_hv.c | 17 +++++++++++++++++
> > 2 files changed, 27 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> > index 8a38be6c3b03..730488681443 100644
> > --- a/arch/powerpc/include/uapi/asm/kvm.h
> > +++ b/arch/powerpc/include/uapi/asm/kvm.h
> > @@ -443,7 +443,16 @@ struct kvm_ppc_compat_caps {
> > __u64 size; /* Size of this structure */
> > __u64 compat_capabilities; /* Capabilities supported by the host */
> > };
> > -
> > +/*
> > + * Capability bits for compat_capabilities field in kvm_ppc_compat_caps.
> > + * These bits indicate which processor compatibility modes are supported.
> > + */
> > +#define KVM_PPC_COMPAT_CAP_POWER9 (1ULL << 62)
> > +#define KVM_PPC_COMPAT_CAP_POWER10 (1ULL << 61)
> > +#define KVM_PPC_COMPAT_CAP_POWER11 (1ULL << 60)
> > +#define KVM_PPC_COMPAT_BITMASK (KVM_PPC_COMPAT_CAP_POWER9 | \
> > + KVM_PPC_COMPAT_CAP_POWER10 | \
> > + KVM_PPC_COMPAT_CAP_POWER11)
> > /*
> > * Values for character and character_mask.
> > * These are identical to the values used by H_GET_CPU_CHARACTERISTICS.
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index f9380ef65750..f674386df62c 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -6523,6 +6523,22 @@ static bool kvmppc_hash_v3_possible(void)
> > return true;
> > }
> >
> > +
> > +static int kvmppc_get_compat_caps(struct kvm_ppc_compat_caps *host_caps)
> > +{
> > + unsigned long capabilities = 0;
> > + long rc = -EINVAL;
> > +
> > + if (kvmhv_on_pseries()) {
> > + if (kvmhv_is_nestedv2())
> > + rc = plpar_guest_get_capabilities(0,
> > &capabilities);
> I think instead of making the hcall you should use the
> 'nested_capabilities' extern symbol as it would already the same
> value. This symbol is already accessible in 'book3s_hv.c'
Agreed! Will change to use nested_capabilities directly instead of
making the hcall. This is more efficient as this will help reduce an
hcall overhead while the value is already cached during module
initialization (in kvmhv_nested_init()).
Thanks,
Amit
>
> > + host_caps->compat_capabilities = capabilities &
> > + KVM_PPC_COMPAT_BITMASK;
> > + }
> > +
> > + return rc;
> > +}
> > +
> > static struct kvmppc_ops kvm_ops_hv = {
> > .get_sregs = kvm_arch_vcpu_ioctl_get_sregs_hv,
> > .set_sregs = kvm_arch_vcpu_ioctl_set_sregs_hv,
> > @@ -6565,6 +6581,7 @@ static struct kvmppc_ops kvm_ops_hv = {
> > .hash_v3_possible = kvmppc_hash_v3_possible,
> > .create_vcpu_debugfs = kvmppc_arch_create_vcpu_debugfs_hv,
> > .create_vm_debugfs = kvmppc_arch_create_vm_debugfs_hv,
> > + .get_compat_caps = kvmppc_get_compat_caps,
> > };
> >
> > static int kvm_init_subcore_bitmap(void)
> > --
> > 2.50.1 (Apple Git-155)
> >
> >
>
> --
> Cheers
> ~ Vaibhav
^ permalink raw reply
* Re: [PATCH] docs/mm: clarify that we are not looking for LLM generated content
From: Jonathan Corbet @ 2026-06-23 12:59 UTC (permalink / raw)
To: David Hildenbrand (Arm), linux-doc
Cc: Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
Matthew Wilcox, Harry Yoo, linux-mm, linux-kernel
In-Reply-To: <d421c081-8686-4d46-8452-e543401b0503@kernel.org>
"David Hildenbrand (Arm)" <david@kernel.org> writes:
> On 4/20/26 23:03, David Hildenbrand (Arm) wrote:
>> Let's make it clear that we are not looking for LLM generated content
>> from contributors not familiar with the details of MM, as it shifts the
>> real work onto reviewers.
>>
>> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
>> ---
>> Documentation/mm/index.rst | 13 +++++++++++++
>> 1 file changed, 13 insertions(+)
>>
>> diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
>> index 7aa2a8886908..13a79f5d092c 100644
>> --- a/Documentation/mm/index.rst
>> +++ b/Documentation/mm/index.rst
>> @@ -7,6 +7,19 @@ of Linux. If you are looking for advice on simply allocating memory,
>> see the :ref:`memory_allocation`. For controlling and tuning guides,
>> see the :doc:`admin guide <../admin-guide/mm/index>`.
>>
>> +.. note::
>> +
>> + Unfortunately, parts of this guide are still incomplete or missing.
>> + While we appreciate contributions, documentation in this area is hard
>> + to get right and requires a lot of attention to detail. New contributors
>> + should reach out to the relevant maintainers early.
>> +
>> + This guide is expected to reflect reality, which requires contributors
>> + to have a detailed understanding. Documentation generated with LLMs
>> + by contributors unfamiliar with these details shifts the real work onto
>> + reviewers, which is why such contributions will be rejected without
>> + further comment.
>> +
>> .. toctree::
>> :maxdepth: 1
>>
>>
>> ---
>> base-commit: da6b5aae84beb0917ecb0c9fbc71169d145397ff
>> change-id: 20260420-llmdoc-21bf5fadbd6f
>>
>> Best regards,
>
> I assume this was not picked up yet? (via documentation or mm tree?)
I had figured Andrew would grab it; I can certainly do so if you'd like.
jon
^ permalink raw reply
* Re: [RFC PATCH v2 03/10] kvm: Prepare core VM structs and helpers for LUO support
From: tarunsahu @ 2026-06-23 12:48 UTC (permalink / raw)
To: Ackerley Tng, Jonathan Corbet, vannapurve, fvdl, Pasha Tatashin,
Shuah Khan, sagis, aneesh.kumar, skhawaja, vipinsh,
Pratyush Yadav, david, dmatlack, mark.rutland, Paolo Bonzini,
Mike Rapoport, Alexander Graf, seanjc, axelrasmussen
Cc: linux-kselftest, kexec, linux-kernel, linux-doc, kvm, linux-mm
In-Reply-To: <CAEvNRgGharGxs9s_ow0Z4iiQ9PCzdghch-4Fk6UMjiPP9tX-5g@mail.gmail.com>
Hi,
Thanks for reviewing the patch.
Ackerley Tng <ackerleytng@google.com> writes:
> Tarun Sahu <tarunsahu@google.com> writes:
>
>> Introduce core infrastructure to support VM preservation with LUO.
>>
>> First two changes are just refactoring, no functional change, third
>> change introduces a new member in struct kvm.
>> - Move ITOA_MAX_LEN to kvm_mm.h for reuse by upcoming kvm_luo code.
>> - Add a public kvm_create_vm_file() helper wrapping kvm_create_vm()
>> and anon_inode_getfile() to provide a unified VM file creation API.
>> - Track a weak reference to the backing file in struct kvm under
>> CONFIG_LIVEUPDATE_GUEST_MEMFD to enable reverse file resolution
>> without circular lifetime dependencies.
>>
>
> Given the above, I think this should be separate patches.
Okay.
>
>> Signed-off-by: Tarun Sahu <tarunsahu@google.com>
>> ---
>> include/linux/kvm_host.h | 14 +++++++
>> virt/kvm/kvm_main.c | 79 +++++++++++++++++++++++++++++-----------
>> virt/kvm/kvm_mm.h | 3 ++
>> 3 files changed, 75 insertions(+), 21 deletions(-)
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 4c14aee1fb06..9111a28637af 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -874,6 +874,18 @@ struct kvm {
>> #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
>> /* Protected by slots_lock (for writes) and RCU (for reads) */
>> struct xarray mem_attr_array;
>> +#endif
>> +#ifdef CONFIG_LIVEUPDATE_GUEST_MEMFD
>> + /*
>> + * Weak reference to the VFS file backing this KVM instance. Stored
>> + * without incrementing the file refcount to prevent a circular lifetime
>> + * dependency (since file->private_data already pins this struct kvm).
>> + * Used exclusively to resolve the file pointer back from struct kvm.
>> + *
>> + * Written/cleared via rcu_assign_pointer() and read locklessly under
>> + * RCU (e.g. via get_file_active() to prevent ABA races).
>> + */
>> + struct file *vm_file;
>> #endif
>
> We didn't really talk about this during the calls, but it seems weird to
> preserve a vm_file with pretty much nothing other than the vm type. The
> entire VM is re-created, which means it could potentially be a
> completely different VM?
>
> In some sense it's more flexible since the guest_memfd can be restored
> with some completely different VM, but it seems like it could introduce
> other issues.
>
> I think other KVM folks would probably have more thoughts here.
IIUC,
you are asking "Why preserve vm_fd with guest_memfd when we only
preserve vm_type?"
We discussed about this. Also explained here: (also copying it down)
[RFC PATCH v2 04/10] kvm: kvm_luo: Allow kvm preservation with LUO
https://lore.kernel.org/all/8730c0e11acbd0d645a8b7187cd5cd7de373380e.1780676742.git.tarunsahu@google.com/
and
https://lore.kernel.org/all/cover.1780667929.git.tarunsahu@google.com/
(This cover letter was sent separately from the patches due to a problem
in my automated script)
vm_fd is needed for guest_memfd retrieval, because guest_memfd can
not be retrieved without struct kvm and there is no other way to pass
that. (We talked about alternative like LINK IOCTL or break the
CREATE_GUEST_MEMFD IOCTL in two IOCTL: one just create GUEST_MEMFD
and another attach it to the vm_file (struct kvm)). We discarded the
alternative approach because it changes the guest_memfd design.
This patch also set the infrastucture to preserve the vm_fd which
will be extended later in future when we will introduce private support.
where TDX related data (sPTE) might be preserved via struct kvm. Also,
vCPUs state, IRQ routing table etc if needed can also be preserved.
>> + struct file *vm_file;
If You are asking about, the diff above (why vm_file is there)
There is no way to get vm_file from struct kvm which is needed
in guest_memfd preservation during freeze call to preserve the token of
vm_fd. This is used on retrieval time.
I have sent V3 as well here:
https://lore.kernel.org/all/20260622184851.2309827-1-tarunsahu@google.com/
V3 includes the few minor fixes suggested by sashiko.
we can continue reviewing on V2/V3. I will include all of the
suggestions in V4.
>
>> char stats_id[KVM_STATS_NAME_SIZE];
>> };
>> @@ -1074,7 +1086,9 @@ void kvm_get_kvm(struct kvm *kvm);
>> bool kvm_get_kvm_safe(struct kvm *kvm);
>> void kvm_put_kvm(struct kvm *kvm);
>> bool file_is_kvm(struct file *file);
>> +struct file *kvm_create_vm_file(unsigned long type, const char *fdname);
>> void kvm_put_kvm_no_destroy(struct kvm *kvm);
>> +void kvm_uevent_notify_vm_create(struct kvm *kvm);
>>
>> static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
>> {
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 89489996fbc1..65f0c5fb353e 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -67,9 +67,6 @@
>> #include <linux/kvm_dirty_ring.h>
>>
>>
>> -/* Worst case buffer size needed for holding an integer. */
>> -#define ITOA_MAX_LEN 12
>> -
>> MODULE_AUTHOR("Qumranet");
>> MODULE_DESCRIPTION("Kernel-based Virtual Machine (KVM) Hypervisor");
>> MODULE_LICENSE("GPL");
>> @@ -1349,6 +1346,19 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
>> {
>> struct kvm *kvm = filp->private_data;
>>
>> +#ifdef CONFIG_LIVEUPDATE_GUEST_MEMFD
>> + /*
>> + * Clear the weak reference of the vm file.
>> + * In case vm file is closed by userspace, but kvm still has
>> + * other users like vCPUs, clearing this pointer ensures
>> + * that we don't have a dangling pointer to a closed file.
>> + *
>> + * Cleared via rcu_assign_pointer() to ensure proper memory visibility
>> + * for concurrent lockless readers under RCU.
>> + */
>> + rcu_assign_pointer(kvm->vm_file, NULL);
>> +#endif
>> +
>> kvm_irqfd_release(kvm);
>>
>> kvm_put_kvm(kvm);
>> @@ -5476,11 +5486,47 @@ bool file_is_kvm(struct file *file)
>> }
>> EXPORT_SYMBOL_FOR_KVM_INTERNAL(file_is_kvm);
>>
>> +struct file *kvm_create_vm_file(unsigned long type, const char *fdname)
>> +{
>> + struct kvm *kvm = kvm_create_vm(type, fdname);
>> + struct file *file;
>> +
>> + if (IS_ERR(kvm))
>> + return ERR_CAST(kvm);
>> +
>> + file = anon_inode_getfile("kvm-vm", &kvm_vm_fops, kvm, O_RDWR);
>> + if (IS_ERR(file)) {
>> + kvm_put_kvm(kvm);
>> + return file;
>> + }
>> +
>> +#ifdef CONFIG_LIVEUPDATE_GUEST_MEMFD
>> + /*
>> + * Weak reference to the file (without get_file()) to prevent a circular
>> + * dependency. Safe because the file's release path clears this pointer
>> + * and drops its reference to the VM.
>> + *
>> + * Written via rcu_assign_pointer() because the pointer can be read
>> + * locklessly under RCU (e.g., in kvm_gmem_luo_preserve() via
>> + * get_file_active() to prevent lockless ABA races).
>> + */
>> + rcu_assign_pointer(kvm->vm_file, file);
>> +#endif
>> +
>> + /*
>> + * Don't call kvm_put_kvm anymore at this point; file->f_op is
>> + * already set, with ->release() being kvm_vm_release(). In error
>> + * cases it will be called by the final fput(file) and will take
>> + * care of doing kvm_put_kvm(kvm).
>> + */
>> +
>> + return file;
>> +}
>> +
>> static int kvm_dev_ioctl_create_vm(unsigned long type)
>> {
>> char fdname[ITOA_MAX_LEN + 1];
>> int r, fd;
>> - struct kvm *kvm;
>> struct file *file;
>>
>> fd = get_unused_fd_flags(O_CLOEXEC);
>> @@ -5489,31 +5535,17 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
>>
>> snprintf(fdname, sizeof(fdname), "%d", fd);
>>
>> - kvm = kvm_create_vm(type, fdname);
>> - if (IS_ERR(kvm)) {
>> - r = PTR_ERR(kvm);
>> - goto put_fd;
>> - }
>> -
>> - file = anon_inode_getfile("kvm-vm", &kvm_vm_fops, kvm, O_RDWR);
>> + file = kvm_create_vm_file(type, fdname);
>> if (IS_ERR(file)) {
>> r = PTR_ERR(file);
>> - goto put_kvm;
>> + goto put_fd;
>> }
>>
>> - /*
>> - * Don't call kvm_put_kvm anymore at this point; file->f_op is
>> - * already set, with ->release() being kvm_vm_release(). In error
>> - * cases it will be called by the final fput(file) and will take
>> - * care of doing kvm_put_kvm(kvm).
>> - */
>> - kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, kvm);
>> + kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, file->private_data);
>
> Notifying with file->private_data threw me off... I would rather inline
> the rcu_assign_pointer() in this function and have this line read
> notify(..., kvm) like before.
>
>>
>> fd_install(fd, file);
>> return fd;
>>
>> -put_kvm:
>> - kvm_put_kvm(kvm);
>> put_fd:
>> put_unused_fd(fd);
>> return r;
>> @@ -6341,6 +6373,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm)
>> kfree(env);
>> }
>>
>> +void kvm_uevent_notify_vm_create(struct kvm *kvm)
>> +{
>> + kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, kvm);
>> +}
>> +
>> static void kvm_init_debug(void)
>> {
>> const struct file_operations *fops;
>> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
>> index 9fcc5d5b7f8d..7aa1d65c3d46 100644
>> --- a/virt/kvm/kvm_mm.h
>> +++ b/virt/kvm/kvm_mm.h
>> @@ -3,6 +3,9 @@
>> #ifndef __KVM_MM_H__
>> #define __KVM_MM_H__ 1
>>
>> +/* Worst case buffer size needed for holding an integer as a string. */
>> +#define ITOA_MAX_LEN 12
>> +
>> /*
>> * Architectures can choose whether to use an rwlock or spinlock
>> * for the mmu_lock. These macros, for use in common code
>> --
>> 2.54.0.1032.g2f8565e1d1-goog
^ permalink raw reply
* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Bastien Nocera @ 2026-06-23 12:44 UTC (permalink / raw)
To: Eric Biggers, linux-crypto, Herbert Xu, Marcel Holtmann,
Luiz Augusto von Dentz
Cc: linux-doc, linux-api, linux-kernel, netdev, Linus Torvalds,
linux-bluetooth, ell
In-Reply-To: <20260430011544.31823-1-ebiggers@kernel.org>
Hey,
Replying to this older patch.
On Wed, 2026-04-29 at 18:15 -0700, Eric Biggers wrote:
<snip>
> This isn't intended to change anything overnight. After all, most Linux
> distros won't be able to disable the kconfig options quite yet, mainly
> because of iwd. But this should create a bit more impetus for these
> userspace programs to be fixed, and the documentation update should also
> help prevent more users from appearing.
There are 2 other users that I know of: bluez, and the ell library
(used by iwd and bluez).
From what I could tell, bluetoothd uses AF_ALG for cryptography:
https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/src/shared/crypto.c
https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/tools/mesh-gatt/crypto.c
It uses "ecb(aes)" and "cmac(aes)" as algorithms.
Finally, it also uses them both again:
https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/mesh/crypto.c
through ell:
https://git.kernel.org/pub/scm/libs/ell/ell.git/tree/ell/cipher.c
Because that's a question that also came up, bluetoothd also uses the
CAP_NET_ADMIN capability.
I'll let Luiz and Marcel take it over from here.
Cheers
^ permalink raw reply
* [PATCH][v2] mm/dmapool: Untangle CONFIG_SLUB_DEBUG_ON abuse and switch to static key
From: lirongqing @ 2026-06-23 12:12 UTC (permalink / raw)
To: Jonathan Corbet, Shuah Khan, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, linux-doc, linux-kernel,
linux-mm
Cc: Li RongQing
From: Li RongQing <lirongqing@baidu.com>
The dmapool subsystem historically wrapped its debugging logic inside an
This approach is fundamentally flawed because CONFIG_SLUB_DEBUG_ON
merely defines compile-time defaults for SLUB and caused two flaws:
On production kernels where CONFIG_SLUB_DEBUG=y but
CONFIG_SLUB_DEBUG_ON=n, dmapool debugging was completely compiled out
at compile time, leaving no way to enable it without rebuilding the
kernel.
On kernels with CONFIG_SLUB_DEBUG_ON=y, dmapool debugging stayed
unconditionally active even if a user explicitly disabled slub debugging
at boot time.
Clean up this mess by removing the #ifdef and switching to a runtime
static key (dmapool_debug_enabled), allowing dmapool debugging to be
toggled cleanly via its own boot parameter: dmapool_debug
Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
---
Diff with v1: Move the static key check out of pool_init_page etc
Documentation/admin-guide/kernel-parameters.txt | 5 +++
mm/dmapool.c | 57 ++++++++++++++-----------
2 files changed, 38 insertions(+), 24 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 19c9a19..66d853c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1304,6 +1304,11 @@ Kernel parameters
dis_ucode_ldr [X86] Disable the microcode loader.
+ dmapool_debug [MM]
+ Enable DMA pool debugging. This enables memory
+ poisoning and validation for DMA pool allocations.
+ Useful for debugging DMA API misuse.
+
dma_debug=off If the kernel is compiled with DMA_API_DEBUG support,
this option disables the debugging code at boot.
diff --git a/mm/dmapool.c b/mm/dmapool.c
index 5d8af6e..7bd037a 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -35,10 +35,23 @@
#include <linux/string.h>
#include <linux/types.h>
#include <linux/wait.h>
+#include <linux/static_key.h>
+#include <linux/init.h>
-#ifdef CONFIG_SLUB_DEBUG_ON
-#define DMAPOOL_DEBUG 1
-#endif
+/*
+ * Debugging support for dmapool using static key.
+ *
+ * This allows enabling dmapool debug at boot time via:
+ * dmapool_debug
+ */
+static DEFINE_STATIC_KEY_FALSE(dmapool_debug_enabled);
+
+static int __init dmapool_debug_setup(char *str)
+{
+ static_branch_enable(&dmapool_debug_enabled);
+ return 1;
+}
+__setup("dmapool_debug", dmapool_debug_setup);
struct dma_block {
struct dma_block *next_block;
@@ -92,7 +105,6 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
static DEVICE_ATTR_RO(pools);
-#ifdef DMAPOOL_DEBUG
static void pool_check_block(struct dma_pool *pool, struct dma_block *block,
gfp_t mem_flags)
{
@@ -161,23 +173,6 @@ static void pool_init_page(struct dma_pool *pool, struct dma_page *page)
{
memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
}
-#else
-static void pool_check_block(struct dma_pool *pool, struct dma_block *block,
- gfp_t mem_flags)
-{
-}
-
-static bool pool_block_err(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
-{
- if (want_init_on_free())
- memset(vaddr, 0, pool->size);
- return false;
-}
-
-static void pool_init_page(struct dma_pool *pool, struct dma_page *page)
-{
-}
-#endif
static struct dma_block *pool_block_pop(struct dma_pool *pool)
{
@@ -305,7 +300,9 @@ static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
unsigned int next_boundary = pool->boundary, offset = 0;
struct dma_block *block, *first = NULL, *last = NULL;
- pool_init_page(pool, page);
+ if (static_branch_unlikely(&dmapool_debug_enabled))
+ pool_init_page(pool, page);
+
while (offset + pool->size <= pool->allocation) {
if (offset + pool->size > next_boundary) {
offset = next_boundary;
@@ -433,7 +430,10 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
spin_unlock_irqrestore(&pool->lock, flags);
*handle = block->dma;
- pool_check_block(pool, block, mem_flags);
+
+ if (static_branch_unlikely(&dmapool_debug_enabled))
+ pool_check_block(pool, block, mem_flags);
+
if (want_init_on_alloc(mem_flags))
memset(block, 0, pool->size);
@@ -454,9 +454,18 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
{
struct dma_block *block = vaddr;
unsigned long flags;
+ bool err = false;
spin_lock_irqsave(&pool->lock, flags);
- if (!pool_block_err(pool, vaddr, dma)) {
+
+ if (static_branch_unlikely(&dmapool_debug_enabled))
+ err = pool_block_err(pool, vaddr, dma);
+ else {
+ if (want_init_on_free())
+ memset(vaddr, 0, pool->size);
+ }
+
+ if (!err) {
pool_block_push(pool, block, dma);
pool->nr_active--;
}
--
2.9.4
^ permalink raw reply related
* Re: Issue cloning kernel-doc-zh from HUST mirror
From: Weijie Yuan @ 2026-06-23 12:01 UTC (permalink / raw)
To: Dongliang Mu; +Cc: Siwei Chen, linux-doc, si.yanteng
In-Reply-To: <b03f244b-46b8-47e8-b7f5-d98d714ae15c@hust.edu.cn>
On Tue, Jun 23, 2026 at 04:51:20PM +0800, Dongliang Mu wrote:
> The curl 52 Empty reply from server error is not a Git or Ubuntu
> compatibility issue. It happens because the kernel-doc-zh repository is
> extremely large, and the HUST mirror server closes the HTTPS connection
> early due to timeout or proxy limits.
>
> You can try the following commands:
>
> 1. Shallow clone first (most reliable)
>
> git clone --depth 1
> https://mirrors.hust.edu.cn/git/kernel-doc-zh.git linux
>
> Then fetch full history:
>
> git fetch --unshallow
>
> If still failing, increase Git buffer like:
>
> git config --global http.postBuffer 1073741824
>
> Finally, I will contact maintainers of HUST mirror site and try
> some attempts to resolve this issue.
Thanks, and yes, shallow clone could work:
user@debian:~$ git clone --depth 1 https://mirrors.hust.edu.cn/git/kernel-doc-zh.git linux
Cloning into 'linux'...
remote: Enumerating objects: 93130, done.
remote: Counting objects: 100% (93130/93130), done.
remote: Compressing objects: 100% (90511/90511), done.
remote: Total 93130 (delta 7145), reused 20322 (delta 1615), pack-reused 0
Receiving objects: 100% (93130/93130), 259.15 MiB | 4.71 MiB/s, done.
Resolving deltas: 100% (7145/7145), done.
Updating files: 100% (87897/87897), done.
But:
user@debian:~$ cd linux
user@debian:~/linux$ git fetch --unshallow
(after ~10 minutes or so)
remote: Enumerating objects: 10638034, done.
remote: Counting objects: 100% (10638019/10638019), done.
remote: Compressing objects: 100% (1819658/1819658), done.
error: RPC failed; curl 56 GnuTLS recv error (-9): Error decoding the received TLS packet.
error: 5476 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
and in a mac (26.5.1) Git 2.54, probably the same:
$ git clone https://mirrors.hust.edu.cn/git/kernel-doc-zh.git linux
Cloning into 'linux'...
remote: Enumerating objects: 11406904, done.
error: RPC failed; curl 18 transfer closed with outstanding read data remaining
error: 7537 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
after setting http.postBuffer 1073741824 on debian:
user@debian:~/linux$ git config --global http.postBuffer 1073741824
user@debian:~/linux$ git fetch --unshallow
(after ~10 minutes)
remote: Enumerating objects: 10638034, done.
remote: Counting objects: 100% (10638019/10638019), done.
remote: Compressing objects: 100% (1819658/1819658), done.
error: RPC failed; curl 56 GnuTLS recv error (-9): Error decoding the received TLS packet.
error: 5917 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
Apparently the above is of limited value, as it is straightforward to
reproduce.
> Please use the first git repo:
>
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/alexs/linux.git
The average speed of cloning from git.kernel.org is less than 10 KiB/s
from Chengdu ;-)
But of course, a shallow clone is enough for translation work ;-)
Thanks!
^ permalink raw reply
* Re: [PATCH v7 06/10] tracing/probes: Type casting always involves nested calls
From: Masami Hiramatsu @ 2026-06-23 11:50 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Steven Rostedt, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217910709.643090.6379198143797177881.stgit@devnote2>
Sashiko commented:
> > + }
> > + *close = '\0';
> >
> > - ctx->offset += 1; /* for the '(' */
> > - /* We need to parse the nested one */
> > - ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
> > - pcode, end, ctx);
> > - if (ret < 0)
> > - return ret;
> > - ctx->nested_level--;
> > - clear_struct_btf(ctx);
> > + /* We need to parse the nested one */
> > + ret = parse_probe_arg(tmp, find_fetch_type(NULL, ctx->flags),
> > + pcode, end, ctx);
>
> Does routing the nested argument through parse_probe_arg() instead of
> parse_btf_arg() break the documented eprobe syntax (STRUCT)FIELD->MEMBER?
> Since this patch removes the TPARG_FL_TEVENT handling block from
> parse_btf_arg(), and handle_typecast() now recursively invokes
> parse_probe_arg() on the extracted FIELD name, a bare eprobe FIELD name
> that does not start with a '$' prefix will hit the default case in
> parse_probe_arg().
> This causes it to be unconditionally rejected with -EINVAL (NOSUP_BTFARG)
> because eprobe flags (TPARG_FL_TEVENT) do not satisfy
> tparg_is_function_entry() or tparg_is_function_return(). This acts as an
> ABI breakage for existing user-space scripts relying on this eprobe syntax.
Wait... Did the type-casting support patch accidentally allow access to
event fields without the "$"?
Hmm, if so, it should be documented, and need to support correctly
with/without typecast.
Thank you,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v4 1/4] KVM: PPC: Introduce KVM_CAP_PPC_COMPAT_CAPS and wire up ioctl
From: Amit Machhiwal @ 2026-06-23 11:11 UTC (permalink / raw)
To: Vaibhav Jain
Cc: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan,
Anushree Mathur, Paolo Bonzini, Nicholas Piggin, Michael Ellerman,
Christophe Leroy (CS GROUP), Jonathan Corbet, Shuah Khan, kvm,
linux-kernel, linux-doc, lkp
In-Reply-To: <871pe3cazk.fsf@vajain21.in.ibm.com>
Hi Vaibhav,
Thanks for the detailed review. My responses are inline below.
On 2026/06/19 11:44 AM, Vaibhav Jain wrote:
> Hi Amit.
>
> Thanks for the patch and incorporating V3 review comments. Further
> review comments inline below:
>
> Amit Machhiwal <amachhiw@linux.ibm.com> writes:
>
> > Introduce a new capability and ioctl to expose CPU compatibility modes
> > supported by the host processor for nested guests.
> >
> > On IBM POWER systems, newer processor generations (N) can operate in
> > compatibility modes corresponding to earlier generations, like (N-1) and
> > (N-2). This is particularly relevant for nested virtualization, where
> > nested KVM guests may need to run with a specific processor compatibility
> > level.
> >
> > Introduce KVM_CAP_PPC_COMPAT_CAPS capability and the corresponding
> > KVM_PPC_GET_COMPAT_CAPS vm ioctl. The ioctl returns a bitmap describing
> > the compatibility modes supported by the host in respective bit numbers,
> > allowing userspace (e.g., QEMU) to select an appropriate compatibility
> > level when configuring nested KVM guests.
> >
> > The ioctl handling is added in kvm_arch_vm_ioctl() and retrieves host
> > CPU compatibility capabilities via a PowerPC-specific backend
> > implementation when available. The implementation validates the structure
> > size from userspace to ensure forward compatibility and returns
> > appropriate error codes (EINVAL for invalid size, EFAULT for copy
> > failures, ENOTTY if backend is not implemented). The struct
> > kvm_ppc_compat_caps includes a size field to support future ABI
> > extensions.
> >
> > Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> > ---
> > arch/powerpc/include/asm/kvm_ppc.h | 1 +
> > arch/powerpc/include/uapi/asm/kvm.h | 7 ++++++
> > arch/powerpc/kvm/powerpc.c | 35 +++++++++++++++++++++++++++++
> > include/uapi/linux/kvm.h | 4 ++++
> > 4 files changed, 47 insertions(+)
> >
> > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> > index 0953f2daa466..169ea6a7fbad 100644
> > --- a/arch/powerpc/include/asm/kvm_ppc.h
> > +++ b/arch/powerpc/include/asm/kvm_ppc.h
> > @@ -319,6 +319,7 @@ struct kvmppc_ops {
> > bool (*hash_v3_possible)(void);
> > int (*create_vm_debugfs)(struct kvm *kvm);
> > int (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry);
> > + int (*get_compat_caps)(struct kvm_ppc_compat_caps *host_caps);
> > };
> >
> > extern struct kvmppc_ops *kvmppc_hv_ops;
> > diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> > index 077c5437f521..8a38be6c3b03 100644
> > --- a/arch/powerpc/include/uapi/asm/kvm.h
> > +++ b/arch/powerpc/include/uapi/asm/kvm.h
> > @@ -437,6 +437,13 @@ struct kvm_ppc_cpu_char {
> > __u64 behaviour_mask; /* valid bits in behaviour */
> > };
> >
> > +/* For KVM_PPC_GET_COMPAT_CAPS */
> > +struct kvm_ppc_compat_caps {
> > + __u64 flags; /* Reserved for future use */
> > + __u64 size; /* Size of this structure */
> Suggesting moving the 'size' as the first member of the struct. That way
> copying the struct from userspace becomes bit easier.
Yeah, I think it would make more sense and will simplify the
copy_from_user() call. I will make the change in v5. I will change to:
struct kvm_ppc_compat_caps {
__u64 size;
__u64 flags;
__u64 compat_capabilities;
};
>
> > + __u64 compat_capabilities; /* Capabilities supported by the host */
> > +};
> > +
> > /*
> > * Values for character and character_mask.
> > * These are identical to the values used by H_GET_CPU_CHARACTERISTICS.
> > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> > index 98de68379b18..9153b0034b45 100644
> > --- a/arch/powerpc/kvm/powerpc.c
> > +++ b/arch/powerpc/kvm/powerpc.c
> > @@ -701,6 +701,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > }
> > }
> > break;
> > +#if defined(CONFIG_KVM_BOOK3S_HV_POSSIBLE)
> > + case KVM_CAP_PPC_COMPAT_CAPS:
> > + r = 0;
> > + if (kvmhv_on_pseries())
> > + r = 1;
> > + break;
> > +#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
> > default:
> > r = 0;
> > break;
> > @@ -2467,6 +2474,34 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> > r = kvm->arch.kvm_ops->svm_off(kvm);
> > break;
> > }
> > + case KVM_PPC_GET_COMPAT_CAPS: {
> > + struct kvm_ppc_compat_caps host_caps;
> > + u64 user_size;
> > +
> > + r = -EFAULT;
> > + /* First, get the size field from userspace to validate */
> > + if (copy_from_user(&user_size, &((struct kvm_ppc_compat_caps
> > + __user *)argp)->size, sizeof(user_size))) {
> move the struct size member to the first field. That way
> from_from_user() call is simplified and you wont have to do some wired
> pointer arithmetic.
Will do as mentioned above.
>
>
> > + goto out;
> > + }
> > +
> > + /* Validate size - must be at least the current structure size */
> > + r = -EINVAL;
> > + if (user_size < sizeof(host_caps))
> > + goto out;
> Check should be strengthed to
> if (user_size != sizeof(host_caps))
> So that in case used space sends a struct larger than what kernel knows
> abt it will be rejected. This will prevent surprises in future in case
> VMM sends a larger struct expecting kernel to know abt it but an older
> kernel only knows abt older smaller sized struct. Also look at the
> review comment below.
Agreed. I'll change the validation to use strict equality. This is
simpler and clearer - userspace must provide exactly the size the kernel
expects.
>
> > +
> > + r = -ENOTTY;
> > + memset(&host_caps, 0, sizeof(host_caps));
> > + if (!kvm->arch.kvm_ops->get_compat_caps)
> > + goto out;
> > +
> > + r = kvm->arch.kvm_ops->get_compat_caps(&host_caps);
> > + /* Set the actual size of the structure we're returning */
> > + host_caps.size = sizeof(host_caps);
> > + if (!r && copy_to_user(argp, &host_caps, sizeof(host_caps)))
> > + r = -EFAULT;
> You are allowing a future userspace VMM to potentially send a larger
> 'struct kvm_ppc_compat_caps' that what kernel knows about. This makes
> error handling in userspace bit involved since there might be some
> fields in the 'struct kvm_ppc_compat_caps' given from userspace may
> remain un-initialized when userspace sees it. So please mention this
> subtle behaviour should be mentioned in patch description and also
> update it the doc in the later patch.
With the strict equality check (user_size != sizeof(host_caps)), this
concern should be addressed - we won't accept larger structs from
userspace. However, I'll still improve the documentation to:
1. In the commit message:
- Explain the size field validation
- Document that exact size match is required
- Clarify error handling behavior
2. In Documentation/virt/kvm/api.rst:
- Add improved documentation for KVM_PPC_GET_COMPAT_CAPS
- Document the size field requirement and validation
Thanks,
Amit
>
> > + break;
> > + }
> > default: {
> > struct kvm *kvm = filp->private_data;
> > r = kvm->arch.kvm_ops->arch_vm_ioctl(filp, ioctl, arg);
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 6c8afa2047bf..1788a0068662 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -996,6 +996,7 @@ struct kvm_enable_cap {
> > #define KVM_CAP_S390_USER_OPEREXEC 246
> > #define KVM_CAP_S390_KEYOP 247
> > #define KVM_CAP_S390_VSIE_ESAMODE 248
> > +#define KVM_CAP_PPC_COMPAT_CAPS 249
> >
> > struct kvm_irq_routing_irqchip {
> > __u32 irqchip;
> > @@ -1349,6 +1350,9 @@ struct kvm_s390_keyop {
> > #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr)
> > #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr)
> >
> > +/* Available with KVM_CAP_PPC_COMPAT_CAPS */
> > +#define KVM_PPC_GET_COMPAT_CAPS _IOR(KVMIO, 0xe4, struct kvm_ppc_compat_caps)
> > +
> > /*
> > * ioctls for vcpu fds
> > */
> > --
> > 2.50.1 (Apple Git-155)
> >
> >
>
> --
> Cheers
> ~ Vaibhav
^ permalink raw reply
* Re: [PATCH 1/4] nfs: store the full NFS fileid in inode->i_ino
From: Jeff Layton @ 2026-06-23 11:04 UTC (permalink / raw)
To: Mark Brown
Cc: Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
linux-nfs, linux-kernel, linux-doc
In-Reply-To: <655d0d2a5f8203c52c78d37462328449e49b7feb.camel@kernel.org>
On Mon, 2026-06-22 at 18:38 -0400, Jeff Layton wrote:
> On Mon, 2026-06-22 at 22:05 +0100, Mark Brown wrote:
> > On Tue, May 12, 2026 at 12:12:42PM -0400, Jeff Layton wrote:
> > > Now that inode->i_ino is a 64-bit value, store the full NFS fileid in
> > > it directly instead of an XOR-folded hash. This makes NFS_FILEID() and
> > > set_nfs_fileid() operate on inode->i_ino rather than the separate
> > > nfsi->fileid field.
> >
> > This patch is in -next now and is triggering a failure for in the LTP
> > ioctl10.c test for me on arm:
> >
> > tst_buffers.c:57: TINFO: Test is using guarded buffers
> > tst_test.c:2047: TINFO: LTP version: 20260130
> > tst_test.c:2050: TINFO: Tested kernel: 7.1.0-next-20260622 #1 SMP @1782128788 armv7l
> >
> > ...
> >
> > ioctl10.c:111: TFAIL: q->inode (11493907226) != entry.vm_inode (4294967295)
> >
>
> Note that the vm_inode value is arm32's ULONG_MAX.
>
> > arm64 seems unaffected, I didn't really investigate but I'll note that
> > unsigned long is 32 bit on arm.
> >
> > Full log:
> >
> > https://lava.sirena.org.uk/scheduler/job/2904745#L3852
> >
> > bisect log with more test job links:
> >
>
>
> The testcase does this:
>
> static void parse_maps_file(const char *filename, const char *keyword, struct map_entry *entry)
> {
> FILE *fp = SAFE_FOPEN(filename, "r");
>
> char line[1024];
>
> while (fgets(line, sizeof(line), fp) != NULL) {
> if (fnmatch(keyword, line, 0) == 0) {
> if (sscanf(line, "%lx-%lx %s %lx %x:%x %lu %s",
> &entry->vm_start, &entry->vm_end, entry->vm_flags_str,
> &entry->vm_pgoff, &entry->vm_major, &entry->vm_minor,
> &entry->vm_inode, entry->vm_name) < 7)
> tst_brk(TFAIL, "parse maps file /proc/self/maps failed");
>
> entry->vm_flags = parse_vm_flags(entry->vm_flags_str);
>
> SAFE_FCLOSE(fp);
> return;
> }
> }
>
> SAFE_FCLOSE(fp);
> tst_brk(TFAIL, "parse maps file /proc/self/maps failed");
> }
>
> Note that it's trying to stuff the inode number field into an unsigned
> long. Before this patch, the maps file would have printed the old
> (hashed) inode number on 32-bit. Now, it prints the full 64-bit inode
> number.
>
> I asked The Big Pickle and it says:
>
> "In glibc (userspace): The C standard says this is undefined behavior.
> In practice, glibc's scanf internally uses strtoul/strtoull, which on
> overflow store ULONG_MAX/ULLONG_MAX and set errno = ERANGE. However,
> scanf itself does not propagate ERANGE to the caller — it still returns
> 1 (success). So you'd silently get ULONG_MAX stored."
>
> We could argue that this is a bug in the testcase. It assumes that the
> maps file will never print a value larger than ULONG_MAX in that field,
> and I don't see why it would make that assumption in this day and age.
>
> Are there actual programs in the field that scrape the maps file that
> might be affected by this change?
This testcase patch should fix it. I'll plan to send this to the LTP
list, but it would be nice if someone could confirm the fix on arm32:
-----------------------8<---------------------
[PATCH LTP] ioctl10: fix the sscanf() call to handle 64-bit inode on 32-bit arch
This test started failing recently on arm32, when we switched the
kernel to displaying the full 64-bit inode number in the maps file.
Change the testcase to allow for a full 64-bit inode number on all
arches. The value it's compared to is already 64-bits so widening this
field is all that is necessary.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
testcases/kernel/syscalls/ioctl/ioctl10.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/testcases/kernel/syscalls/ioctl/ioctl10.c b/testcases/kernel/syscalls/ioctl/ioctl10.c
index b668c9e93889..d7e40f3c8643 100644
--- a/testcases/kernel/syscalls/ioctl/ioctl10.c
+++ b/testcases/kernel/syscalls/ioctl/ioctl10.c
@@ -35,7 +35,7 @@ struct map_entry {
unsigned long vm_pgoff;
unsigned int vm_major;
unsigned int vm_minor;
- unsigned long vm_inode;
+ uint64_t vm_inode;
char vm_name[256];
unsigned int vm_flags;
};
@@ -68,7 +68,7 @@ static void parse_maps_file(const char *filename, const char *keyword, struct ma
while (fgets(line, sizeof(line), fp) != NULL) {
if (fnmatch(keyword, line, 0) == 0) {
- if (sscanf(line, "%lx-%lx %s %lx %x:%x %lu %s",
+ if (sscanf(line, "%lx-%lx %s %lx %x:%x %llu %s",
&entry->vm_start, &entry->vm_end, entry->vm_flags_str,
&entry->vm_pgoff, &entry->vm_major, &entry->vm_minor,
&entry->vm_inode, entry->vm_name) < 7)
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v2 05/11] hugetlb: Convert the vmf->pgoff to PAGE_SIZE granularity
From: XIAO WU @ 2026-06-23 10:54 UTC (permalink / raw)
To: Jane Chu, akpm
Cc: willy, jack, viro, brauner, muchun.song, osalvador, david, hughd,
baolin.wang, linmiaohe, nao.horiguchi, lorenzo, rppt, peterx,
corbet, linux-doc, linux-mm, linux-kernel, linux-fsdevel
In-Reply-To: <20260617172534.1740152-6-jane.chu@oracle.com>
Hi Jane,
Thanks for this series — the conversion to PAGE-granularity indexing is a
nice cleanup.
I came across a Sashiko AI review of this patch series, which flagged
several issues, one of which I was able to confirm triggers a real kernel
crash:
https://sashiko.dev/#/patchset/20260617172534.1740152-1-jane.chu@oracle.com
> +++ b/mm/hugetlb.c
> @@ -5952,8 +5955,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm,
struct vm_area_struct *vma,
> .address = address & huge_page_mask(h),
> .real_address = address,
> .flags = flags,
> - .pgoff = vma_hugecache_offset(h, vma,
> - address & huge_page_mask(h)),
> + .pgoff = linear_page_index(vma, address),
This change sets vmf.pgoff to linear_page_index(vma, address), but
`address` here is the raw unaligned fault address, not the huge-page-aligned
address. Previously, vma_hugecache_offset() used `address &
huge_page_mask(h)`
which produced a huge-page-aligned index.
When a page fault occurs at a non-huge-page-aligned address within a hugetlb
mapping (e.g., vm_start + 0x1000 for a 2MB page), the resulting pgoff is not
a multiple of pages_per_huge_page (512 for 2MB). This unaligned index
propagates through:
hugetlb_fault() → hugetlb_no_page() → hugetlb_add_to_page_cache()
→ __filemap_add_folio()
where this assertion fires (mm/filemap.c:862):
VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
With CONFIG_DEBUG_VM=y, this becomes a BUG() and panics the kernel.
I was able to reproduce this in a QEMU VM. The fix should be trivial:
pass the aligned address to linear_page_index().
=== Reproduction ===
Kernel: 7.1.0-rc5-g7ba451f8a24f #1 SMP PREEMPT_DYNAMIC x86_64
Config: CONFIG_HUGETLBFS=y, CONFIG_DEBUG_VM=y, CONFIG_KASAN=y
Trigger: mmap a hugetlbfs file, then access an address at offset 0x1000
(one 4K page) into the mapping, which is unaligned relative to the 2MB
huge page boundary.
=== Full PoC ===
Compile with: gcc -o poc poc.c -static
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <fcntl.h>
#include <errno.h>
#ifndef MAP_HUGETLB
#define MAP_HUGETLB 0x40000
#endif
#ifndef MAP_HUGE_SHIFT
#define MAP_HUGE_SHIFT 26
#endif
/*
* Bug: hugetlb_fault() sets vmf.pgoff = linear_page_index(vma, address)
* using the raw unaligned fault address. This unaligned pgoff reaches
* __filemap_add_folio() which VM_BUG_ON_FOLIO's on it.
*/
static long get_hugepage_size(void)
{
FILE *f;
char line[256];
long size = 2 * 1024 * 1024;
f = fopen("/proc/meminfo", "r");
if (!f)
return size;
while (fgets(line, sizeof(line), f)) {
if (sscanf(line, "Hugepagesize: %ld kB", &size) == 1)
size *= 1024;
}
fclose(f);
return size;
}
int main(void)
{
void *addr;
size_t hpage_size;
const char *hugetlbfs_path = "/mnt/huge/testfile";
int fd;
int ret;
hpage_size = get_hugepage_size();
printf("[+] Huge page size: %zu bytes\n", hpage_size);
/* Mount hugetlbfs */
mkdir("/mnt/huge", 0755);
ret = syscall(__NR_mount, "hugetlbfs", "/mnt/huge", "hugetlbfs", 0,
NULL);
if (ret < 0 && errno != EBUSY && errno != ENOENT)
perror("mount hugetlbfs");
/* Reserve 1 huge page */
{
FILE *f = fopen("/proc/sys/vm/nr_hugepages", "w");
if (f) { fprintf(f, "1"); fclose(f); }
}
/* Create hugetlbfs file and mmap it */
fd = open(hugetlbfs_path, O_CREAT | O_RDWR, 0644);
if (fd < 0) {
perror("open hugetlbfs");
printf("[!] Trying anonymous MAP_HUGETLB\n");
addr = mmap(NULL, hpage_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap MAP_HUGETLB");
return 1;
}
} else {
ftruncate(fd, hpage_size);
addr = mmap(NULL, hpage_size, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
close(fd);
if (addr == MAP_FAILED) {
perror("mmap hugetlbfs file");
return 1;
}
}
printf("[+] Mapping at %p\n", addr);
/*
* Trigger: access address at offset 0x1000 into the huge page.
* vm_start is huge-page-aligned, but vm_start + 0x1000 is not.
* hugetlb_fault() sets vmf.pgoff = linear_page_index(vma, address)
* with the unaligned address, producing an unaligned pgoff.
*/
printf("[+] Triggering fault at unaligned offset (%p +
0x1000)...\n", addr);
fflush(stdout);
volatile char *trigger = (volatile char *)addr + 0x1000;
*trigger = 0x41;
printf("[+] Survived: value = 0x%02x\n", *trigger);
return 0;
}
=== Crash Log ===
Linux syzkaller 7.1.0-rc5-g7ba451f8a24f #1 SMP PREEMPT_DYNAMIC x86_64
[ 527.288433][ T9873] page dumped because: VM_BUG_ON_FOLIO(index &
(folio_nr_pages(folio) - 1))
[ 527.300642][ T9873] kernel BUG at mm/filemap.c:862!
[ 527.301090][ T9873] Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
[ 527.301640][ T9873] CPU: 0 UID: 0 PID: 9873 Comm: poc Not tainted
[ 527.303803][ T9873] RIP: 0010:__filemap_add_folio+0xf39/0x1200
[ 527.311913][ T9873] Call Trace:
[ 527.312345][ T9873] <TASK>
[ 527.312676][ T9873] hugetlb_add_to_page_cache+0xe3/0x240
[ 527.313414][ T9873] hugetlb_no_page+0x1301/0x21b0
[ 527.314402][ T9873] hugetlb_fault+0x531/0x1570
[ 527.315259][ T9873] handle_mm_fault+0x970/0xaf0
[ 527.316565][ T9873] do_user_addr_fault+0x60b/0x14c0
[ 527.317434][ T9873] asm_exc_page_fault+0x26/0x30
[ 527.318733][ T9873] RIP: 0033:0x401fa2
[ 527.326921][ T9873] <TASK>
[ 527.327245][ T9873] RIP: 0010:__filemap_add_folio+0xf39/0x1200
[ 527.335300][ T9873] Kernel panic - not syncing: Fatal exception
The Sashiko review also flagged a few other pre-existing issues in
this series that I haven't verified yet:
1. [Critical] remove_inode_hugepages() in patch 9: passing folio->index
(base-page index) to hugetlb_unmap_file_folio() which multiplies by
pages_per_huge_page(h), effectively squaring the offset and causing
the interval tree search to miss VMAs (potential UAF).
2. [High] hugetlbfs_zero_partial_page() in patch 7: Usama already
pointed out the start >> PAGE_SHIFT question — `start` is a byte
offset but filemap_lock_folio() expects a page index.
3. [Critical] filemap_get_pages() in patch 4: the `if (is_hugetlbfs)
goto done` path returns 0 with an empty batch, which could cause
filemap_read() to loop forever when reading a hole in a hugetlbfs
file.
Thanks,
Xiao
^ permalink raw reply
* Re: [PATCH v4 0/2] cpufreq: CPPC: add autonomous mode boot parameter support
From: Sumit Gupta @ 2026-06-23 10:17 UTC (permalink / raw)
To: Pierre Gondois, Viresh Kumar
Cc: rafael, ionela.voinescu, zhenglifeng1, zhanjie9, corbet, skhan,
rdunlap, mario.limonciello, linux-pm, linux-doc, linux-kernel,
linux-tegra, treding, jonathanh, vsethi, ksitaraman, sanjayc,
mochs, bbasu, sumitg
In-Reply-To: <f269fbc4-8b8f-4829-97bc-cf4cc9246aec@nvidia.com>
On 22/06/26 14:58, Sumit Gupta wrote:
>
> On 19/06/26 14:59, Pierre Gondois wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 6/18/26 07:28, Viresh Kumar wrote:
>>> On 16-06-26, 18:22, Sumit Gupta wrote:
>>>> The dependency it was waiting on, the "cpufreq: Set policy->min and
>>>> max as real QoS constraints" series, is now in linux-pm (linux-next).
>>>> I rebased on top and verified autonomous mode works as expected, and
>>>> it applies cleanly on the current linux-next.
>>>>
>>>> The [1] reference in patch 2/2 points to v2 of that series; the merged
>>>> version is v3 [2].
>>>>
>>>> If there are no further comments, please consider acking and queuing
>>>> this for the next cycle.
>>> I was waiting for CPPC reviewers to provide some feedback.i
>>>
>>> Jie / Lifeng / Pierre ?
>>>
>> I think the patchset has the same issue described at:
>>
>> https://lore.kernel.org/all/86780f97-29ee-4a72-b311-38c89434b707@arm.com/
>>
>>
>> I don't know if this is important to other persons,
>> but IMO it would be preferable to have a solution to this issue
>> before adding more functionalities relying on registers that are left
>> in an unknown state.
>>
>> If there are any other opinion ?
>>
>
> The concern is valid, but this isn't a new gap. The registers the boot
> parameter programs are already writable via existing sysfs:
> - auto_sel via auto_select
> - EPP via energy_performance_preference_val
> So userspace can already leave these in a non-default state across
> unload / CPU hotplug in mainline. The boot parameter just sets the
> same registers at boot via the same paths.
>
> I am already working on the save/restore change we discussed on
> the ospm_nominal_perf thread, as a dedicated follow-up grouping
> all OSPM-set registers (ospm_nominal_perf, auto_sel, EPP) together.
> I think doing it once uniformly is cleaner.
>
> Both features are already under review, so my preference is to take
> them first and add the save/restore on top, rather than merging it
> first and respinning both features under it. Either order works for me
> if you and the maintainers prefer infra-first.
>
> Thanks,
> Sumit
>
>
I have sent v5 of the autonomous mode series [1] with a small fix.
Also posted patch [3] to preserve OSPM set regs across hotplug/unload.
It applies on top of [1] & [2] (both not yet merged).
[1]
[PATCH v5 0/2] cpufreq: CPPC: add autonomous mode boot parameter support
https://lore.kernel.org/lkml/20260623080652.3353386-1-sumitg@nvidia.com/
[2]
[PATCH v5] ACPI: CPPC: Add ospm_nominal_perf support
https://lore.kernel.org/lkml/20260615185934.2383514-1-sumitg@nvidia.com/
[3]
[PATCH] cpufreq: CPPC: Preserve OSPM-set registers across hotplug and
unload
https://lore.kernel.org/lkml/20260623095403.3407436-1-sumitg@nvidia.com/
Thanks,
Sumit
^ permalink raw reply
* Re: [PATCH v3 1/2] dt-bindings: iio: dac: Add AD5529R
From: Janani Sunil @ 2026-06-23 10:07 UTC (permalink / raw)
To: David Lechner, Nuno Sá, Rodrigo Alencar
Cc: Jonathan Cameron, Conor Dooley, Janani Sunil, Lars-Peter Clausen,
Michael Hennerich, Nuno Sá, Andy Shevchenko, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Philipp Zabel, Jonathan Corbet,
Shuah Khan, linux-iio, devicetree, linux-kernel, linux-doc,
Mark Brown
In-Reply-To: <c72fb508-05a4-429a-9ca7-86e42a115fa8@baylibre.com>
On 6/22/26 17:36, David Lechner wrote:
> On 6/22/26 7:20 AM, Nuno Sá wrote:
>> On Mon, Jun 22, 2026 at 12:51:20PM +0100, Rodrigo Alencar wrote:
>>> On 22/06/26 11:29, Nuno Sá wrote:
>>>> On Mon, Jun 22, 2026 at 10:24:05AM +0100, Rodrigo Alencar wrote:
>>>>> On 21/06/26 15:33, Jonathan Cameron wrote:
>>>>>> On Fri, 19 Jun 2026 16:54:11 +0100
>>>>>> Nuno Sá <noname.nuno@gmail.com> wrote:
>>>>>>
>>>>>>> On Fri, Jun 19, 2026 at 03:12:07PM +0100, Conor Dooley wrote:
>>>>>>>> On Fri, Jun 19, 2026 at 02:01:08PM +0100, Nuno Sá wrote:
>>>>>>>>> On Fri, Jun 19, 2026 at 12:40:54PM +0100, Conor Dooley wrote:
>>>>>>>>>> On Fri, Jun 19, 2026 at 12:36:55PM +0100, Conor Dooley wrote:
>>>>>>>>>>> On Fri, Jun 19, 2026 at 12:33:11PM +0200, Janani Sunil wrote:
>>>>>>>>>>>> On 6/14/26 21:44, Jonathan Cameron wrote:
>>>>>>>>>>>>> On Tue, 9 Jun 2026 16:47:23 +0200
>>>>>>>>>>>>> Janani Sunil <jan.sun97@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 5/26/26 15:11, Rodrigo Alencar wrote:
>>>>>>>>>>>>>>> On 26/05/19 05:42PM, Janani Sunil wrote:
>>>>>>>>>>>>>>>> Devicetree bindings for AD5529R 16 channel 12/16 bit high voltage,
>>>>>>>>>>>>>>>> buffered voltage output digital-to-analog converter (DAC) with an
>>>>>>>>>>>>>>>> integrated precision reference.
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>> Probably others may comment on that, but...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This parent node may support device addressing for multi-device support through
>>>>>>>>>>>>>>> those ID pins. I suppose that each device may have its own power supplies or
>>>>>>>>>>>>>>> other resources like the toggle pins or reset and enable.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That way I suppose that an example would look like...
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +patternProperties:
>>>>>>>>>>>>>>>> + "^channel@([0-9]|1[0-5])$":
>>>>>>>>>>>>>>>> + type: object
>>>>>>>>>>>>>>>> + description: Child nodes for individual channel configuration
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + properties:
>>>>>>>>>>>>>>>> + reg:
>>>>>>>>>>>>>>>> + description: Channel number.
>>>>>>>>>>>>>>>> + minimum: 0
>>>>>>>>>>>>>>>> + maximum: 15
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + adi,output-range-microvolt:
>>>>>>>>>>>>>>>> + description: |
>>>>>>>>>>>>>>>> + Output voltage range for this channel as [min, max] in microvolts.
>>>>>>>>>>>>>>>> + If not specified, defaults to 0V to 5V range.
>>>>>>>>>>>>>>>> + oneOf:
>>>>>>>>>>>>>>>> + - items:
>>>>>>>>>>>>>>>> + - const: 0
>>>>>>>>>>>>>>>> + - enum: [5000000, 10000000, 20000000, 40000000]
>>>>>>>>>>>>>>>> + - items:
>>>>>>>>>>>>>>>> + - const: -5000000
>>>>>>>>>>>>>>>> + - const: 5000000
>>>>>>>>>>>>>>>> + - items:
>>>>>>>>>>>>>>>> + - const: -10000000
>>>>>>>>>>>>>>>> + - const: 10000000
>>>>>>>>>>>>>>>> + - items:
>>>>>>>>>>>>>>>> + - const: -15000000
>>>>>>>>>>>>>>>> + - const: 15000000
>>>>>>>>>>>>>>>> + - items:
>>>>>>>>>>>>>>>> + - const: -20000000
>>>>>>>>>>>>>>>> + - const: 20000000
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + required:
>>>>>>>>>>>>>>>> + - reg
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + additionalProperties: false
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +required:
>>>>>>>>>>>>>>>> + - compatible
>>>>>>>>>>>>>>>> + - reg
>>>>>>>>>>>>>>>> + - vdd-supply
>>>>>>>>>>>>>>>> + - avdd-supply
>>>>>>>>>>>>>>>> + - hvdd-supply
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +dependencies:
>>>>>>>>>>>>>>>> + spi-cpha: [ spi-cpol ]
>>>>>>>>>>>>>>>> + spi-cpol: [ spi-cpha ]
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +allOf:
>>>>>>>>>>>>>>>> + - $ref: /schemas/spi/spi-peripheral-props.yaml#
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +unevaluatedProperties: false
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +examples:
>>>>>>>>>>>>>>>> + - |
>>>>>>>>>>>>>>>> + #include <dt-bindings/gpio/gpio.h>
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + spi {
>>>>>>>>>>>>>>>> + #address-cells = <1>;
>>>>>>>>>>>>>>>> + #size-cells = <0>;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + dac@0 {
>>>>>>>>>>>>>>>> + compatible = "adi,ad5529r-16";
>>>>>>>>>>>>>>>> + reg = <0>;
>>>>>>>>>>>>>>>> + spi-max-frequency = <25000000>;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + vdd-supply = <&vdd_regulator>;
>>>>>>>>>>>>>>>> + avdd-supply = <&avdd_regulator>;
>>>>>>>>>>>>>>>> + hvdd-supply = <&hvdd_regulator>;
>>>>>>>>>>>>>>>> + hvss-supply = <&hvss_regulator>;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + reset-gpios = <&gpio0 87 GPIO_ACTIVE_LOW>;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + #address-cells = <1>;
>>>>>>>>>>>>>>>> + #size-cells = <0>;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + channel@0 {
>>>>>>>>>>>>>>>> + reg = <0>;
>>>>>>>>>>>>>>>> + adi,output-range-microvolt = <0 5000000>;
>>>>>>>>>>>>>>>> + };
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + channel@1 {
>>>>>>>>>>>>>>>> + reg = <1>;
>>>>>>>>>>>>>>>> + adi,output-range-microvolt = <(-10000000) 10000000>;
>>>>>>>>>>>>>>>> + };
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + channel@2 {
>>>>>>>>>>>>>>>> + reg = <2>;
>>>>>>>>>>>>>>>> + adi,output-range-microvolt = <0 40000000>;
>>>>>>>>>>>>>>>> + };
>>>>>>>>>>>>>>>> + };
>>>>>>>>>>>>>>>> + };
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> spi {
>>>>>>>>>>>>>>> #address-cells = <1>;
>>>>>>>>>>>>>>> #size-cells = <0>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> multi-dac@0 {
>>>>>>>>>>>>>>> compatible = "adi,ad5529r-16";
>>>>>>>>>>>>>>> reg = <0>;
>>>>>>>>>>>>>>> spi-max-frequency = <25000000>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #address-cells = <1>;
>>>>>>>>>>>>>>> #size-cells = <0>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> dac@0 {
>>>>>>>>>>>>>>> reg = <0>;
>>>>>>>>>>>>>>> vdd-supply = <&vdd_regulator>;
>>>>>>>>>>>>>>> avdd-supply = <&avdd_regulator>;
>>>>>>>>>>>>>>> hvdd-supply = <&hvdd_regulator>;
>>>>>>>>>>>>>>> hvss-supply = <&hvss_regulator>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> reset-gpios = <&gpio0 87 GPIO_ACTIVE_LOW>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #address-cells = <1>;
>>>>>>>>>>>>>>> #size-cells = <0>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> channel@0 {
>>>>>>>>>>>>>>> reg = <0>;
>>>>>>>>>>>>>>> adi,output-range-microvolt = <0 5000000>;
>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> channel@1 {
>>>>>>>>>>>>>>> reg = <1>;
>>>>>>>>>>>>>>> adi,output-range-microvolt = <(-10000000) 10000000>;
>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> channel@2 {
>>>>>>>>>>>>>>> reg = <2>;
>>>>>>>>>>>>>>> adi,output-range-microvolt = <0 40000000>;
>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> dac@1 {
>>>>>>>>>>>>>>> reg = <1>;
>>>>>>>>>>>>>>> vdd-supply = <&vdd_regulator>;
>>>>>>>>>>>>>>> avdd-supply = <&avdd_regulator>;
>>>>>>>>>>>>>>> hvdd-supply = <&hvdd_regulator>;
>>>>>>>>>>>>>>> hvss-supply = <&hvss_regulator>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> reset-gpios = <&gpio0 88 GPIO_ACTIVE_LOW>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #address-cells = <1>;
>>>>>>>>>>>>>>> #size-cells = <0>;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> channel@0 {
>>>>>>>>>>>>>>> reg = <0>;
>>>>>>>>>>>>>>> adi,output-range-microvolt = <0 5000000>;
>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> channel@1 {
>>>>>>>>>>>>>>> reg = <1>;
>>>>>>>>>>>>>>> adi,output-range-microvolt = <(-10000000) 10000000>;
>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> then you might need something like:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> patternProperties:
>>>>>>>>>>>>>>> "^dac@[0-3]$":
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and put most of the things under this node pattern.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So the main driver that you're putting together might need to handle up to four instances.
>>>>>>>>>>>>>>> Even if your current driver cannot handle this, the dt-bindings might need cover that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Need to double check if each dac node needs a separate compatible, so you would maybe populate
>>>>>>>>>>>>>>> a platform data to be shared with the child nodes, which would be a separate driver.
>>>>>>>>>>>>>>> (not sure if it would make sense to mix and match ad5529r-16 and ad5529r-12).
>>>>>>>>>>>>>> Hi Rodrigo,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you for looking at this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For now, I would prefer to keep the binding scoped to a single AD5529R device instance. The current
>>>>>>>>>>>>>> hardware/use case we have only needs one device node and the driver is written around that model as well.
>>>>>>>>>>>>>> While the device addressing pins could allow multi-device topology, we do not have an actual platform using
>>>>>>>>>>>>>> that configuration at the moment, so I would prefer not to introduce an extra parent/child binding structure
>>>>>>>>>>>>>> speculatively without a validating use case.
>>>>>>>>>>>>> Interesting feature - kind of similar to address control on a typical i2c bus device, or
>>>>>>>>>>>>> looking at it another way a kind of distributed SPI mux.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Challenge of a binding is we need to anticipate the future. So I think we do need something
>>>>>>>>>>>>> like Rodrigo is suggesting even if we only (for now) support a single instance in the driver.
>>>>>>>>>>>>> That would leave the path open to supporting the addressing at a later date.
>>>>>>>>>>>>> An alternative might be to look at it like a chained device setup. In those we pretend there
>>>>>>>>>>>>> is just one device with a lot of channels etc. The snag is that here things are more loosely
>>>>>>>>>>>>> coupled whereas for those devices it tends to be you have to read / write the same register
>>>>>>>>>>>>> in all devices in the chain as one big SPI message.
>>>>>>>>>>>>>
>>>>>>>>>>>>> +CC Mark Brown as he may know of some precedence for this feature. For his reference..
>>>>>>>>>>>>> - Each of these device has 2 ID pins. The SPI transfers have to contain the 2 bit
>>>>>>>>>>>>> value that matches that or they are ignored. Thus a single bus + 1 chip select can
>>>>>>>>>>>>> be used to talk to 4 devices. Question is what that looks like in device tree + I guess
>>>>>>>>>>>>> longer term how to support it cleanly in SPI.
>>>>>>>>>>> I'd swear I have seen this before, from some Microchip devices. Let me
>>>>>>>>>>> see if I can find what I am thinking of...
>>>>>>>>>>
>>>>>>>>>> microchip,mcp3911 and microchip,mcp3564 both seem to do this with
>>>>>>>>>> slightly different properties.
>>>>>>>>>>
>>>>>>>>>> microchip,device-addr:
>>>>>>>>>> description: Device address when multiple MCP3911 chips are present on the same SPI bus.
>>>>>>>>>> $ref: /schemas/types.yaml#/definitions/uint32
>>>>>>>>>> enum: [0, 1, 2, 3]
>>>>>>>>>> default: 0
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> microchip,hw-device-address:
>>>>>>>>>> $ref: /schemas/types.yaml#/definitions/uint32
>>>>>>>>>> minimum: 0
>>>>>>>>>> maximum: 3
>>>>>>>>>> description:
>>>>>>>>>> The address is set on a per-device basis by fuses in the factory,
>>>>>>>>>> configured on request. If not requested, the fuses are set for 0x1.
>>>>>>>>>> The device address is part of the device markings to avoid
>>>>>>>>>> potential confusion. This address is coded on two bits, so four possible
>>>>>>>>>> addresses are available when multiple devices are present on the same
>>>>>>>>>> SPI bus with only one Chip Select line for all devices.
>>>>>>>>>> Each device communication starts by a CS falling edge, followed by the
>>>>>>>>>> clocking of the device address (BITS[7:6] - top two bits of COMMAND BYTE
>>>>>>>>>> which is first one on the wire).
>>>>>>>>>>
>>>>>>>>>> This sounds exactly like the sort of feature that you're dealing with
>>>>>>>>>> here?
>>>>>>>>>>
>>>>>>>>> The core idea yes but for this chip, things are a bit more annoying (but
>>>>>>>>> Janani can correct me if I'm wrong). Here, each device can, in theory,
>>>>>>>>> have it's own supplies, pins and at the very least, channels with maybe
>>>>>>>>> different scales. That is why Janani is proposing dac nodes. Given I
>>>>>>>>> honestly don't like much of that "adi,ad5529r-bus" compatible I wondered
>>>>>>>>> about solving this at the spi level.
>>>>>>>>>
>>>>>>>>> Ah and to make it more annoying, we can also mix 12 and 16 bits variants
>>>>>>>>> together in the same bus.
>>>>>>>> I'm definitely missing something, because that property for the
>>>>>>>> microchip devices is not impacted what else is on the bus. AFAICT, you
>>>>>>>> could have an mcp3911 and an mcp3564 on the same bus even though both
>>>>>>>> are completely different devices with different drivers. They have
>>>>>>>> individual device nodes and their own supplies etc etc. These aren't
>>>>>>>> per-channel properties on an adc or dac, they're per child device on a
>>>>>>>> spi bus.
>>>>>>> Maybe I'm the one missing something :). IIRC, spi would not allow two
>>>>>>> devices on the same CS right? Because for this chip we would need
>>>>>>> something like:
>>>>>>>
>>>>>>> spi {
>>>>>>> dac@0 {
>>>>>>> reg = <0>;
>>>>>>> adi,pin-id = <0>;
>>>>>>> };
>>>>>>>
>>>>>>> dac@1 {
>>>>>>> reg = <0>; // which seems already problematic?
>>>>>>> adi,pin-id <1>;
>>>>>>> };
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> //up to 4
>>>>>>> };
>>>>>> Yeah. It's not clear to me how that works for the microchip devices
>>>>>> (I suspect it doesn't!)
>>>>>>
>>>>>> Just thinking as I type, but could we do something a bit nasty with
>>>>>> a gpio mux that doesn't actually switch but represents the GPIO being
>>>>>> shared? Given this is all tied to the spi bus that should all happen
>>>>>> under serializing locks.
>>>>>>
>>>>>> Agreed though that this would be nicer as an SPI thing that let
>>>>>> us specify that a single CS is share by multiple devices and their
>>>>>> is some other signal acting to select which one we are talking to.
>>>>>>
>>>>> If the device-addressing on the same chip-select is to be handled
>>>>> by the spi framework, wouldn't we lose device-specific features?
>>>>>
>>>>> I understand that this multi-device feature is there mostly to extend the
>>>>> channel count from 16 to 32, 48 or 64. I suppose the command:
>>>>>
>>>>> "MULTI DEVICE SW LDAC MODE"
>>>>>
>>>>> exists so that software can update channel values accross multiple devices.
>>>> Right! You do have a point! I agree the main driver for a feature like
>>>> this is likely to extend the channel count and effectively "aggregate"
>>>> devices.
>>>>
>>>> But I would say that even with the spi solution the MULTI DEVICE stuff
>>>> should be doable (as we still need a sort of adi,pin-id property).
>>> I don't think we can have something like an IIO buffer shared by multiple
>>> devices. Synchronizing separate devices would be doable with proper hardware
>>> support for this (probably involving an FGPA).
>> True!
>>
>>>
>>>> But yes, I do feel that the whole feature is for aggregation so seeing
>>>> one device with 32 channels is the expectation here? Rather than seeing
>>>> two devices with 16 channels.
>>> Yes, I think aggregation is the whole point there... so that the IIO driver
>>> is multi-device-aware.
>> Which makes me feel that different pins per device might be possible
>> from an HW point of view but does not make much sense. For example, for
>> the buffer example I would expect LDAC to be shared between all the
>> devices.
>>
>> - Nuno Sá
> I think I mentioned this on a previous revision, but I still think the
> simplest way to go about it would be to assume that all chips treated
> as an aggregate device have everything wired in parallel and just add
> support for per-chip wiring on an as-needed basis. This is how we have
> handled daisy-chained devices so far.
Hi David,
One thing about this approach is that is does not cover a combination of 12 and 16 bit parts in the chain,
since the compatible string would be at the top level and apply to all chips. To handle this without per chip child nodes or per-chip compatible,
I propose an "adi, resolution" property as an integer array, indexed by the device position:
dac@0 {
compatible = "adi,ad5529r";
reg = <0>;
adi,device-addrs = <0 1>;
adi,resolution = <16 12>; /* per-chip, indexed by position */
reset-gpios = <&gpio0 87 GPIO_ACTIVE_LOW>;
vdd-supply = <&vdd_reg>;
hvdd-supply = <&hvdd_reg>;
channel@0 { reg = <0>; adi,output-range-microvolt = <0 5000000>; };
channel@16 { reg = <16>; adi,output-range-microvolt = <0 40000000>; };
};
1) This follows the daisy-chain/aggregated model as you suggested, exposing N*16 channels as a single IIO device.
2) Keeps the binding flat- no phantom compatible at a parent bus node, no per-chip child nodes.
3) Enables a 12 bit + 16 bit device combination in the chain, without needing a per-chip compatible.
4) adi, device-addrs specifies the HW address, allowing the driver to encode it into the SPI frame.
5) Supplies and GPIOs remain simple- assuming parallel wiring across all chips.
Jonathan, you had earlier suggested using separate compatibles
(adi,ad5529r-16 and adi,ad5529r-12) to handle the
resolution difference.
However, with the aggregated flat binding model,
separate per-chip compatibles would require child nodes which brings
back the phantom compatible problem at the parent level. The
adi,resolution array is intended as an alternative that achieves the
same goal-expressing per-chip resolution, without needing a per-chip
compatible or child node structure.
Does this look reasonable?
Best Regards,
Janani Sunil
^ permalink raw reply
* Re: Issue cloning kernel-doc-zh from HUST mirror
From: Siwei Chen @ 2026-06-23 10:04 UTC (permalink / raw)
To: linux-doc, Dongliang Mu; +Cc: si.yanteng, wy
In-Reply-To: <b03f244b-46b8-47e8-b7f5-d98d714ae15c@hust.edu.cn>
在 2026年6月23日星期二 中国标准时间 16:51:20,Dongliang Mu 写道:
> Hello Siwei,
>
> The long answer is as follows:
>
> The curl 52 Empty reply from server error is not a Git or Ubuntu
> compatibility issue. It happens because the kernel-doc-zh repository is
> extremely large, and the HUST mirror server closes the HTTPS connection
> early due to timeout or proxy limits.
>
> You can try the following commands:
>
>
> 1. Shallow clone first (most reliable)
>
>
>
> git clone --depth 1
> https://mirrors.hust.edu.cn/git/kernel-doc-zh.git linux
>
>
>
> Then fetch full history:
>
>
>
> git fetch --unshallow
>
> If still failing, increase Git buffer like:
>
> git config --global http.postBuffer 1073741824
>
>
>
> Finally, I will contact maintainers of HUST mirror site and try
> some attempts to resolve this issue.
>
> Dongliang Mu
>
Hello, Dongliang
Thank you for the detailed explanation and suggestions.
I will try the shallow clone approach and the other workarounds you mentioned.
I also appreciate your willingness to contact the HUST mirror maintainers and
investigate the issue further.
Thanks again for your help.
Best regards,
Siwei Chen
^ permalink raw reply
* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Binbin Wu @ 2026-06-23 9:48 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-18-9d2959357853@google.com>
On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> @@ -606,12 +608,20 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> next = start;
> while (safe && filemap_get_folios(mapping, &next, last, &fbatch)) {
>
> - for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> + for (i = 0; i < folio_batch_count(&fbatch);) {
> struct folio *folio = fbatch.folios[i];
>
> - if (folio_ref_count(folio) !=
> - folio_nr_pages(folio) + filemap_get_folios_refcount) {
> - safe = false;
> + safe = (folio_ref_count(folio) ==
> + folio_nr_pages(folio) +
> + filemap_get_folios_refcount);
> +
> + if (safe) {
> + ++i;
> + } else if (folio_may_be_lru_cached(folio) &&
> + !lru_drained) {
> + lru_add_drain_all();
It seems unprivileged userspace is able to trigger lru_add_drain_all() repeatedly
by invoking KVM_SET_MEMORY_ATTRIBUTES2 in a loop, which could lead to DoS risk?
> + lru_drained = true;
> + } else {
> *err_index = max(start, folio->index);
> break;
> }
>
^ permalink raw reply
* Re: [PATCH v8 21/46] KVM: guest_memfd: Zero page while getting pfn
From: Yan Zhao @ 2026-06-23 8:56 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com>
On Thu, Jun 18, 2026 at 05:31:58PM -0700, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
>
> Move the folio initialization logic from kvm_gmem_get_pfn() into
> __kvm_gmem_get_pfn() to also zero pages if the page is to be used in
> kvm_gmem_populate().
>
> With in-place conversion, the existing data in a guest_memfd page can be
> populated into guest memory through platform-specific ioctls.
>
> Without first zeroing the page obtained using __kvm_gmem_get_pfn(), it
> might contain uninitialized host memory, which would leak to the guest if
> the populate completes.
>
> guest_memfd pages are zeroed at most once in the page's entire lifetime
> with guest_memfd, and that is tracked using the uptodate flag.
>
> Zeroing the page in __kvm_gmem_get_pfn() is chosen over zeroing in
> kvm_gmem_get_folio() since other flows, such as a future write() syscall,
> can get a page, write to the page and then set page uptodate without
> zeroing.
>
> This aligns with the concept of zeroing before first use - the other place
> where zeroing happens is in kvm_gmem_fault_user_mapping().
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> virt/kvm/guest_memfd.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 90bc1a26512b6..86c9f5b0863cb 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1137,6 +1137,11 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file,
> return ERR_PTR(-EHWPOISON);
> }
>
> + if (!folio_test_uptodate(folio)) {
> + clear_highpage(folio_page(folio, 0));
> + folio_mark_uptodate(folio);
> + }
Note:
In the __kvm_gmem_populate() path, this folio_mark_uptodate() call makes the
later one after post_populate() pointless.
__kvm_gmem_populate
|1.__kvm_gmem_get_pfn
| |->folio = kvm_gmem_get_folio()
| | if (!folio_test_uptodate(folio))
| | folio_mark_uptodate(folio);
|2. ret = post_populate()
|3. if (!ret)
| folio_mark_uptodate(folio);
> *pfn = folio_file_pfn(folio, index);
> if (max_order)
> *max_order = 0;
> @@ -1166,11 +1171,6 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> goto out;
> }
>
> - if (!folio_test_uptodate(folio)) {
> - clear_highpage(folio_page(folio, 0));
> - folio_mark_uptodate(folio);
> - }
> -
> if (kvm_gmem_is_private_mem(inode, index))
> r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>
>
^ permalink raw reply
* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Yan Zhao @ 2026-06-23 8:41 UTC (permalink / raw)
To: Sean Christopherson, ackerleytng, aik, andrew.jones, binbin.wu,
brauner, chao.p.peng, david, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <ajoWngKaZ+wfIyR+@yzhao56-desk.sh.intel.com>
On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
> On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
> > On Mon, Jun 22, 2026, Yan Zhao wrote:
> > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
> > > > From: Ackerley Tng <ackerleytng@google.com>
> > > >
> > > > Update tdx_gmem_post_populate() to handle cases where a source page is
> > > > not explicitly provided. Instead of returning -EOPNOTSUPP when src_page
> > > > is NULL, default to using the page associated with the destination PFN.
> > > >
> > > > This change allows for in-place memory conversion where the data is
> > > > already present in the target PFN, ensuring the TDX module has a valid
> > > > source page reference for the TDH.MEM.PAGE.ADD operation.
> > > >
> > > > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > ---
> > > > Documentation/virt/kvm/x86/intel-tdx.rst | 4 ++++
> > > > arch/x86/kvm/vmx/tdx.c | 11 ++++++++---
> > > > 2 files changed, 12 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
> > > > index 6a222e9d09541..74357fe87f9ec 100644
> > > > --- a/Documentation/virt/kvm/x86/intel-tdx.rst
> > > > +++ b/Documentation/virt/kvm/x86/intel-tdx.rst
> > > > @@ -158,6 +158,10 @@ KVM_TDX_INIT_MEM_REGION
> > > > Initialize @nr_pages TDX guest private memory starting from @gpa with userspace
> > > > provided data from @source_addr. @source_addr must be PAGE_SIZE-aligned.
> > > >
> > > > +If guest_memfd in-place conversion is enabled, pass NULL for @source_addr to
> > > > +initialize the memory region using memory contents already populated in
> > > > +guest_memfd memory.
> > > > +
> > > > Note, before calling this sub command, memory attribute of the range
> > > > [gpa, gpa + nr_pages] needs to be private. Userspace can use
> > > > KVM_SET_MEMORY_ATTRIBUTES to set the attribute.
> > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > > index ffe9d0db58c59..56d10333c61a7 100644
> > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> > > > return -EIO;
> > > >
> > > > - if (!src_page)
> > > > - return -EOPNOTSUPP;
> > > > + if (!src_page) {
> > > > + if (!gmem_in_place_conversion)
> > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
> > > without the MMAP flag, the absence of src_page should still be treated as an
> > > error.
> >
> > Why MMAP?
> Hmm, I was showing a scenario that in-place conversion couldn't occur.
> I didn't mean that with the MMAP flag, mmap() and user write must occur.
>
> > Shouldn't this be a general "if (!src_page && !up-to-date)"? Just
> > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
> > and written memory. And when write() lands, MMAP wouldn't be necessary to
> > initialize the memory.
> Do you mean using up-to-date flag as below?
>
> if (!src_page) {
> src_page = pfn_to_page(pfn);
> if (!folio_test_uptodate(page_folio(src_page)))
> return -EOPNOTSUPP;
> }
Another concern with this fix is that:
commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
folio uptodate before reaching post_populate().
[1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/
> One concern is that TDX now does not much care about the up-to-date flag since
> TDX doesn't rely on the flag to clear pages on conversions.
> I'm not sure if the flag can be reliably checked in this case. e.g.,
> now the whole folio is marked up-to-date even if only part of it is faulted by
> user access.
> Ensuring that the up-to-date flag works correctly with huge page support seems
> to have more effort than introducing a dedicated flag for TDX.
>
> > > Additionally, to properly enable in-place copying for the TDX initial memory
> > > region, userspace must not only specify source_addr to NULL, but also follow
> > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
> > > 1. create guest_memfd with MMAP flag
> > > 2. mmap the guest_memfd.
> > > 3. convert the initial memory range to shared.
> > > 4. copy initial content to the source page.
> > > 5. convert the initial memory range to private
> > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
> > > 7. do not unmap the source backend.
> > >
> > > So, would it be reasonable to introduce a dedicated flag that allows userspace
> > > to explicitly opt into the in-place copy functionality? e.g.,
> >
> > Why? It's userspace's responsibility to get the above right. If userspace fails
> > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
> I mean if userspace specifies a NULL source_addr by mistake, it's better for
> kernel to detect this mistake, similar to how it validates whether source_addr
> is PAGE_ALIGNED.
> Since userspace already needs to perform additional steps to enable in-place
> copy, specifying a dedicated flag to indicate that the NULL source_addr is
> intentional seems like a reasonable burden.
^ permalink raw reply
* Re: [PATCH v8 17/46] KVM: guest_memfd: Advertise KVM_SET_MEMORY_ATTRIBUTES2 ioctl
From: Binbin Wu @ 2026-06-23 9:14 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-17-9d2959357853@google.com>
On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
>
> Introduce KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES to advertise the
> availability of the KVM_SET_MEMORY_ATTRIBUTES2 ioctl.
>
> KVM_SET_MEMORY_ATTRIBUTES2 is a guest_memfd-scoped version of the existing
> KVM_SET_MEMORY_ATTRIBUTES VM ioctl. It allows userspace to manage memory
> attributes, such as KVM_MEMORY_ATTRIBUTE_PRIVATE, directly on a guest_memfd
> file descriptor.
>
> This new version uses struct kvm_memory_attributes2, which adds an
> error_offset field to the output. This allows KVM to return the specific
> offset that triggered an error, which is especially useful for handling
> EAGAIN results caused by transient page reference counts during attribute
> conversions.
>
> Update the KVM API documentation to define the new ioctl and its behavior,
> and add the necessary UAPI definitions and capability checks.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Suggested-by: Michael Roth <michael.roth@amd.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Two nits below.
>
> +4.145 KVM_SET_MEMORY_ATTRIBUTES2
> +---------------------------------
> +
> +:Capability: KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES
> +:Architectures: all
> +:Type: guest_memfd ioctl
> +:Parameters: struct kvm_memory_attributes2 (in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Errors:
> +
> + ========== ===============================================================
> + EINVAL The specified `offset` or `size` were invalid (e.g. not
^
was
> + page aligned, causes an overflow, or size is zero).
> + EFAULT The parameter address was invalid.
> + EAGAIN Some page within requested range had unexpected refcounts. The
> + offset of the page will be returned in `error_offset`.
> + ENOMEM Ran out of memory trying to track private/shared state
> + ========== ===============================================================
[...]
> +
> +Set attributes for a range of offsets within a guest_memfd to
> +KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed
> +memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is
^
guest use
> +supported, after a successful call to set
> +KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable
> +into host userspace and will only be mappable by the guest.
> +
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox