* Re: [PATCH RFC v4 08/44] KVM: Introduce KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-03-31 22:53 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm
In-Reply-To: <20260326-gmem-inplace-conversion-v4-8-e202fe950ffd@google.com>
On Thu, Mar 26, 2026 at 03:24:17PM -0700, Ackerley Tng wrote:
> Introduce a "version 2" of KVM_SET_MEMORY_ATTRIBUTES to support returning
> information back to userspace.
Hi Ackerley,
Not trying to bikeshed below, but I'm working on getting related QEMU
patches cleaned up to post soon and was working through some of the new
uAPI bits, and plumbing some of these capabilities in seems a little
awkward in a couple places so wondering if we should revisit how some of
this API is defined...
>
> This new ioctl and structure will, in a later patch, be shared as a
> guest_memfd ioctl, where the padding in the new kvm_memory_attributes2
> structure will be for writing the response from the guest_memfd ioctl to
> userspace.
>
> A new ioctl is necessary for these reasons:
>
> 1. KVM_SET_MEMORY_ATTRIBUTES is currently a write-only ioctl and does not
> allow userspace to read fields. There's nothing in code (yet?) that
> validates this, but using _IOWR for consistency would be prudent.
>
> 2. KVM_SET_MEMORY_ATTRIBUTES, when used as a guest_memfd ioctl, will need
> an additional field to provide userspace with more error details.
>
> Alternatively, a completely new ioctl could be defined, unrelated to
> KVM_SET_MEMORY_ATTRIBUTES, but using the same ioctl number and struct for
> the vm and guest_memfd ioctls streamlines the interface for userspace. In
> addition, any memory attributes, implemented on the vm or guest_memfd
> ioctl, can be easily shared with the other.
>
> Add KVM_CAP_MEMORY_ATTRIBUTES2 to indicate that struct
> kvm_memory_attributes2 exists and can be used either with
> KVM_SET_MEMORY_ATTRIBUTES2 via the vm or guest_memfd ioctl.
The guest_memfd support for the KVM_SET_MEMORY_ATTRIBUTES2 ioctl isn't
added until patch #10, and to scan for it you sort of need to infer it
via KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES reporting non-zero (i.e.
KVM_MEMORY_ATTRIBUTE_PRIVATE), so it's confusing to state that
KVM_CAP_MEMORY_ATTRIBUTES2 means you can use the struct via a guest_memfd
ioctl.
I think the above is trying to simply say that the corresponding struct
exists, and remain agnostic about how it can be used. But if that were
the case, there would be no way to know when KVM_SET_MEMORY_ATTRIBUTES2 is
available in the first place, so in the case of KVM ioctls at least,
KVM_CAP_MEMORY_ATTRIBUTES2 is advertising both the struct and the ioctl,
whereas for guest_memfd it's only advertising the struct and not saying
anything about whether a similar gmem ioctl is available to use it.
Instead, maybe they should both have the same semantics:
KVM_CAP_MEMORY_ATTRIBUTES2: *SET_ATTRIBUTES* ioctl exists for KVM that utilizes
struct kvm_memory_attributes2
KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES: *SET_ATTRIBUTES* ioctl exists for
guest_memfd that utilizes struct kvm_memory_attributes2
In which case you would leave out any mention of guest_memfd here as far as
the documentation does, and then in patch #10 you could modify it to be
something like:
4.145 KVM_SET_MEMORY_ATTRIBUTES2
---------------------------------
-:Capability: KVM_CAP_MEMORY_ATTRIBUTES2
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES2, KVM_GUEST_MEMFD_CAP_MEMORY_ATTRIBUTES
-:Architectures: x86
+:Architectures: all
-:Type: vm ioctl
+:Type: vm, guest_memfd ioctl
:Parameters: struct kvm_memory_attributes2 (in/out)
:Returns: 0 on success, <0 on error
and *then* add in your mentions of how the usage/fields differ for
guest_memfd/KVM_GUEST_MEMFD_CAP_MEMORY_ATTRIBUTES case vs. KVM ioctls.
This avoids needing to issue 2 checks for the guest_memfd variant vs. 1
for KVM, but more importantly avoids subtle differences in how these
similarly-named capabilities are used/documented that might cause
unecessary confusion.
Thanks,
Mike
>
> Handle KVM_CAP_MEMORY_ATTRIBUTES2 and return the same supported attributes
> as would be returned for KVM_CAP_MEMORY_ATTRIBUTES - the supported
> attributes are the same for now, regardless of the CAP requested.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> Documentation/virt/kvm/api.rst | 32 ++++++++++++++++++++++++++++++++
> include/uapi/linux/kvm.h | 12 ++++++++++++
> virt/kvm/kvm_main.c | 40 +++++++++++++++++++++++++++++++++++++---
> 3 files changed, 81 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 032516783e962..0b61e2579e1d8 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6359,6 +6359,8 @@ S390:
> Returns -EINVAL if the VM has the KVM_VM_S390_UCONTROL flag set.
> Returns -EINVAL if called on a protected VM.
>
> +.. _KVM_SET_MEMORY_ATTRIBUTES:
> +
> 4.141 KVM_SET_MEMORY_ATTRIBUTES
> -------------------------------
>
> @@ -6551,6 +6553,36 @@ KVM_S390_KEYOP_SSKE
> Sets the storage key for the guest address ``guest_addr`` to the key
> specified in ``key``, returning the previous value in ``key``.
>
> +4.145 KVM_SET_MEMORY_ATTRIBUTES2
> +---------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES2
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes2 (in/out)
> +:Returns: 0 on success, <0 on error
> +
> +KVM_SET_MEMORY_ATTRIBUTES2 is an extension to
> +KVM_SET_MEMORY_ATTRIBUTES that supports returning (writing) values to
> +userspace. The original (pre-extension) fields are shared with
> +KVM_SET_MEMORY_ATTRIBUTES identically.
> +
> +Attribute values are shared with KVM_SET_MEMORY_ATTRIBUTES.
> +
> +::
> +
> + struct kvm_memory_attributes2 {
> + __u64 address;
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> + __u64 reserved[12];
> + };
> +
> + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> +
> +See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`.
> +
> .. _kvm_run:
>
> 5. The kvm_run structure
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 80364d4dbebb0..16567d4a769e5 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -989,6 +989,7 @@ struct kvm_enable_cap {
> #define KVM_CAP_ARM_SEA_TO_USER 245
> #define KVM_CAP_S390_USER_OPEREXEC 246
> #define KVM_CAP_S390_KEYOP 247
> +#define KVM_CAP_MEMORY_ATTRIBUTES2 248
>
> struct kvm_irq_routing_irqchip {
> __u32 irqchip;
> @@ -1637,6 +1638,17 @@ struct kvm_memory_attributes {
> __u64 flags;
> };
>
> +/* Available with KVM_CAP_MEMORY_ATTRIBUTES2 */
> +#define KVM_SET_MEMORY_ATTRIBUTES2 _IOWR(KVMIO, 0xd2, struct kvm_memory_attributes2)
> +
> +struct kvm_memory_attributes2 {
> + __u64 address;
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> + __u64 reserved[12];
> +};
> +
> #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
>
> #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 70b594dafc5cc..3c261904322f0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2621,9 +2621,10 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> return r;
> }
> static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> - struct kvm_memory_attributes *attrs)
> + struct kvm_memory_attributes2 *attrs)
> {
> gfn_t start, end;
> + int i;
>
> /* flags is currently not used. */
> if (attrs->flags)
> @@ -2634,6 +2635,10 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> return -EINVAL;
> if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> return -EINVAL;
> + for (i = 0; i < ARRAY_SIZE(attrs->reserved); i++) {
> + if (attrs->reserved[i])
> + return -EINVAL;
> + }
>
> start = attrs->address >> PAGE_SHIFT;
> end = (attrs->address + attrs->size) >> PAGE_SHIFT;
> @@ -4966,6 +4971,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> case KVM_CAP_DEVICE_CTRL:
> return 1;
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> + case KVM_CAP_MEMORY_ATTRIBUTES2:
> case KVM_CAP_MEMORY_ATTRIBUTES:
> if (!vm_memory_attributes)
> return 0;
> @@ -5191,6 +5197,14 @@ do { \
> sizeof_field(struct kvm_userspace_memory_region2, field)); \
> } while (0)
>
> +#define SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(field) \
> +do { \
> + BUILD_BUG_ON(offsetof(struct kvm_memory_attributes, field) != \
> + offsetof(struct kvm_memory_attributes2, field)); \
> + BUILD_BUG_ON(sizeof_field(struct kvm_memory_attributes, field) != \
> + sizeof_field(struct kvm_memory_attributes2, field)); \
> +} while (0)
> +
> static long kvm_vm_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> @@ -5373,15 +5387,35 @@ static long kvm_vm_ioctl(struct file *filp,
> }
> #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> + case KVM_SET_MEMORY_ATTRIBUTES2:
> case KVM_SET_MEMORY_ATTRIBUTES: {
> - struct kvm_memory_attributes attrs;
> + struct kvm_memory_attributes2 attrs;
> + unsigned long size;
> +
> + if (ioctl == KVM_SET_MEMORY_ATTRIBUTES) {
> + /*
> + * Fields beyond struct kvm_memory_attributes shouldn't
> + * be accessed, but avoid leaking kernel memory in case
> + * of a bug.
> + */
> + memset(&attrs, 0, sizeof(attrs));
> + size = sizeof(struct kvm_memory_attributes);
> + } else {
> + size = sizeof(struct kvm_memory_attributes2);
> + }
> +
> + /* Ensure the common parts of the two structs are identical. */
> + SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(address);
> + SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(size);
> + SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(attributes);
> + SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(flags);
>
> r = -ENOTTY;
> if (!vm_memory_attributes)
> goto out;
>
> r = -EFAULT;
> - if (copy_from_user(&attrs, argp, sizeof(attrs)))
> + if (copy_from_user(&attrs, argp, size))
> goto out;
>
> r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
>
> --
> 2.53.0.1018.g2bb0e51243-goog
>
^ permalink raw reply
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-03-31 23:31 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm
In-Reply-To: <20260326-gmem-inplace-conversion-v4-10-e202fe950ffd@google.com>
On Thu, Mar 26, 2026 at 03:24:19PM -0700, Ackerley Tng wrote:
> For shared to private conversions, if refcounts on any of the folios
> within the range are elevated, fail the conversion with -EAGAIN.
>
> At the point of shared to private conversion, all folios in range are
> also unmapped. The filemap_invalidate_lock() is held, so no faulting
> can occur. Hence, from that point on, only transient refcounts can be
> taken on the folios associated with that guest_memfd.
>
> Hence, it is safe to do the conversion from shared to private.
>
> After conversion is complete, refcounts may become elevated, but that
> is fine since users of transient refcounts don't actually access
> memory.
>
> For private to shared conversions, there are no refcount checks, since
> the guest is the only user of private pages, and guest_memfd will be the
> only holder of refcounts on private pages.
I think KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES deserves some mention in
the commit log.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> Documentation/virt/kvm/api.rst | 48 +++++++-
> include/linux/kvm_host.h | 10 ++
> include/uapi/linux/kvm.h | 9 +-
> virt/kvm/Kconfig | 1 +
> virt/kvm/guest_memfd.c | 245 ++++++++++++++++++++++++++++++++++++++---
> virt/kvm/kvm_main.c | 17 ++-
> 6 files changed, 300 insertions(+), 30 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0b61e2579e1d8..15148c80cfdb6 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -117,7 +117,7 @@ description:
> x86 includes both i386 and x86_64.
>
> Type:
> - system, vm, or vcpu.
> + system, vm, vcpu or guest_memfd.
>
> Parameters:
> what parameters are accepted by the ioctl.
> @@ -6557,11 +6557,22 @@ KVM_S390_KEYOP_SSKE
> ---------------------------------
>
> :Capability: KVM_CAP_MEMORY_ATTRIBUTES2
> -:Architectures: x86
> -:Type: vm ioctl
> +:Architectures: all
> +:Type: vm, guest_memfd ioctl
> :Parameters: struct kvm_memory_attributes2 (in/out)
> :Returns: 0 on success, <0 on error
>
> +Errors:
> +
> + ========== ===============================================================
> + EINVAL The specified `offset` or `size` were invalid (e.g. not
> + page aligned, causes an overflow, or size is zero).
> + EFAULT The parameter address was invalid.
> + EAGAIN Some page within requested range had unexpected refcounts. The
> + offset of the page will be returned in `error_offset`.
> + ENOMEM Ran out of memory trying to track private/shared state
> + ========== ===============================================================
> +
> KVM_SET_MEMORY_ATTRIBUTES2 is an extension to
> KVM_SET_MEMORY_ATTRIBUTES that supports returning (writing) values to
> userspace. The original (pre-extension) fields are shared with
> @@ -6572,15 +6583,42 @@ Attribute values are shared with KVM_SET_MEMORY_ATTRIBUTES.
> ::
>
> struct kvm_memory_attributes2 {
> - __u64 address;
> + /* in */
> + union {
> + __u64 address;
> + __u64 offset;
> + };
> __u64 size;
> __u64 attributes;
> __u64 flags;
> - __u64 reserved[12];
> + /* out */
> + __u64 error_offset;
> + __u64 reserved[11];
> };
>
> #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
>
> +Set attributes for a range of offsets within a guest_memfd to
> +KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed
> +memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is
> +supported, after a successful call to set
> +KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable
> +into host userspace and will only be mappable by the guest.
> +
> +To allow the range to be mappable into host userspace again, call
> +KVM_SET_MEMORY_ATTRIBUTES2 on the guest_memfd again with
> +KVM_MEMORY_ATTRIBUTE_PRIVATE unset.
> +
> +If this ioctl returns -EAGAIN, the offset of the page with unexpected
> +refcounts will be returned in `error_offset`. This can occur if there
> +are transient refcounts on the pages, taken by other parts of the
> +kernel.
That's only true for the guest_memfd ioctl, for KVM ioctl these new
fields and r/w behavior are basically ignored. So you might need to be
clearer on which fields/behavior are specific to guest_memfd like in
the preceeding paragraphs..
..or maybe it's better to do the opposite and just have a blanket 'for
now, all newly-described behavior pertains only to usage via a
guest_memfd ioctl, and for KVM ioctls only the fields/behaviors common
with KVM_SET_MEMORY_ATTRIBUTES are applicable.', since it doesn't seem
like vm_memory_attributes=1 is long for this world and that's the only
case where KVM memory attribute ioctls seem relevant.
But then it makes me wonder, if we adopt the semantics I mentioned
earlier and have KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES advertise both
the gmem ioctl support as well as the struct kvm_memory_attributes2
support, if we should even advertise KVM_CAP_MEMORY_ATTRIBUTES2 at all
as part of this series.
> +
> +Userspace is expected to figure out how to remove all known refcounts
> +on the shared pages, such as refcounts taken by get_user_pages(), and
> +try the ioctl again. A possible source of these long term refcounts is
> +if the guest_memfd memory was pinned in IOMMU page tables.
One might read this to mean error_offset is used purely for the EAGAIN
case, so it might be worth touching on the other cases as well.
-Mike
> +
> See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`.
>
> .. _kvm_run:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 19f026f8de390..1ea14c66fc82e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2514,6 +2514,16 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
> }
>
> #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +static inline u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> +#ifdef kvm_arch_has_private_mem
> + if (!kvm || kvm_arch_has_private_mem(kvm))
> + return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +#endif
> +
> + return 0;
> +}
> +
> typedef unsigned long (kvm_get_memory_attributes_t)(struct kvm *kvm, gfn_t gfn);
> DECLARE_STATIC_CALL(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 16567d4a769e5..29baaa60de35a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -990,6 +990,7 @@ struct kvm_enable_cap {
> #define KVM_CAP_S390_USER_OPEREXEC 246
> #define KVM_CAP_S390_KEYOP 247
> #define KVM_CAP_MEMORY_ATTRIBUTES2 248
> +#define KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES 249
>
> struct kvm_irq_routing_irqchip {
> __u32 irqchip;
> @@ -1642,11 +1643,15 @@ struct kvm_memory_attributes {
> #define KVM_SET_MEMORY_ATTRIBUTES2 _IOWR(KVMIO, 0xd2, struct kvm_memory_attributes2)
>
> struct kvm_memory_attributes2 {
> - __u64 address;
> + union {
> + __u64 address;
> + __u64 offset;
> + };
> __u64 size;
> __u64 attributes;
> __u64 flags;
> - __u64 reserved[12];
> + __u64 error_offset;
> + __u64 reserved[11];
> };
>
> #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 3fea89c45cfb4..e371e079e2c50 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -109,6 +109,7 @@ config KVM_VM_MEMORY_ATTRIBUTES
>
> config KVM_GUEST_MEMFD
> select XARRAY_MULTI
> + select KVM_MEMORY_ATTRIBUTES
> bool
>
> config HAVE_KVM_ARCH_GMEM_PREPARE
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index d414ebfcb4c19..0cff9a85a4c53 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -183,10 +183,12 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>
> static enum kvm_gfn_range_filter kvm_gmem_get_invalidate_filter(struct inode *inode)
> {
> - if (GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED)
> - return KVM_FILTER_SHARED;
> -
> - return KVM_FILTER_PRIVATE;
> + /*
> + * TODO: Limit invalidations based on the to-be-invalidated range, i.e.
> + * invalidate shared/private if and only if there can possibly be
> + * such mappings.
> + */
> + return KVM_FILTER_SHARED | KVM_FILTER_PRIVATE;
> }
>
> static void __kvm_gmem_invalidate_begin(struct gmem_file *f, pgoff_t start,
> @@ -552,11 +554,235 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> }
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_memory_attributes);
>
> +static bool kvm_gmem_range_has_attributes(struct maple_tree *mt,
> + pgoff_t index, size_t nr_pages,
> + u64 attributes)
> +{
> + pgoff_t end = index + nr_pages - 1;
> + void *entry;
> +
> + lockdep_assert(mt_lock_is_held(mt));
> +
> + mt_for_each(mt, entry, index, end) {
> + if (xa_to_value(entry) != attributes)
> + return false;
> + }
> +
> + return true;
> +}
> +
> +static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> + size_t nr_pages, pgoff_t *err_index)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + const int filemap_get_folios_refcount = 1;
> + pgoff_t last = start + nr_pages - 1;
> + struct folio_batch fbatch;
> + bool safe = true;
> + int i;
> +
> + folio_batch_init(&fbatch);
> + while (safe && filemap_get_folios(mapping, &start, last, &fbatch)) {
> +
> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> + struct folio *folio = fbatch.folios[i];
> +
> + if (folio_ref_count(folio) !=
> + folio_nr_pages(folio) + filemap_get_folios_refcount) {
> + safe = false;
> + *err_index = folio->index;
> + break;
> + }
> + }
> +
> + folio_batch_release(&fbatch);
> + cond_resched();
> + }
> +
> + return safe;
> +}
> +
> +/*
> + * Preallocate memory for attributes to be stored on a maple tree, pointed to
> + * by mas. Adjacent ranges with attributes identical to the new attributes
> + * will be merged. Also sets mas's bounds up for storing attributes.
> + *
> + * This maintains the invariant that ranges with the same attributes will
> + * always be merged.
> + */
> +static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
> + pgoff_t start, size_t nr_pages)
> +{
> + pgoff_t end = start + nr_pages;
> + pgoff_t last = end - 1;
> + void *entry;
> +
> + /* Try extending range. entry is NULL on overflow/wrap-around. */
> + mas_set_range(mas, end, end);
> + entry = mas_find(mas, end);
> + if (entry && xa_to_value(entry) == attributes)
> + last = mas->last;
> +
> + if (start > 0) {
> + mas_set_range(mas, start - 1, start - 1);
> + entry = mas_find(mas, start - 1);
> + if (entry && xa_to_value(entry) == attributes)
> + start = mas->index;
> + }
> +
> + mas_set_range(mas, start, last);
> + return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
> +}
> +
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> + struct folio_batch fbatch;
> + pgoff_t next = start;
> + int i;
> +
> + folio_batch_init(&fbatch);
> + while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> + struct folio *folio = fbatch.folios[i];
> + unsigned long pfn = folio_pfn(folio);
> +
> + kvm_arch_gmem_invalidate(pfn, pfn + folio_nr_pages(folio));
> + }
> +
> + folio_batch_release(&fbatch);
> + cond_resched();
> + }
> +}
> +#else
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> +#endif
> +
> +static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> + size_t nr_pages, uint64_t attrs,
> + pgoff_t *err_index)
> +{
> + bool to_private = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> + struct address_space *mapping = inode->i_mapping;
> + struct gmem_inode *gi = GMEM_I(inode);
> + pgoff_t end = start + nr_pages;
> + struct maple_tree *mt;
> + struct ma_state mas;
> + int r;
> +
> + mt = &gi->attributes;
> +
> + filemap_invalidate_lock(mapping);
> +
> + mas_init(&mas, mt, start);
> +
> + if (kvm_gmem_range_has_attributes(mt, start, nr_pages, attrs)) {
> + r = 0;
> + goto out;
> + }
> +
> + r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> + if (r) {
> + *err_index = start;
> + goto out;
> + }
> +
> + if (to_private) {
> + unmap_mapping_pages(mapping, start, nr_pages, false);
> +
> + if (!kvm_gmem_is_safe_for_conversion(inode, start, nr_pages,
> + err_index)) {
> + mas_destroy(&mas);
> + r = -EAGAIN;
> + goto out;
> + }
> + }
> +
> + /*
> + * From this point on guest_memfd has performed necessary
> + * checks and can proceed to do guest-breaking changes.
> + */
> +
> + kvm_gmem_invalidate_begin(inode, start, end);
> +
> + if (!to_private)
> + kvm_gmem_invalidate(inode, start, end);
> +
> + mas_store_prealloc(&mas, xa_mk_value(attrs));
> +
> + kvm_gmem_invalidate_end(inode, start, end);
> +out:
> + filemap_invalidate_unlock(mapping);
> + return r;
> +}
> +
> +static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
> +{
> + struct gmem_file *f = file->private_data;
> + struct inode *inode = file_inode(file);
> + struct kvm_memory_attributes2 attrs;
> + pgoff_t err_index;
> + size_t nr_pages;
> + pgoff_t index;
> + int i, r;
> +
> + if (copy_from_user(&attrs, argp, sizeof(attrs)))
> + return -EFAULT;
> +
> + if (attrs.flags)
> + return -EINVAL;
> + if (attrs.error_offset)
> + return -EINVAL;
> + for (i = 0; i < ARRAY_SIZE(attrs.reserved); i++) {
> + if (attrs.reserved[i])
> + return -EINVAL;
> + }
> + if (attrs.attributes & ~kvm_supported_mem_attributes(f->kvm))
> + return -EINVAL;
> + if (attrs.size == 0 || attrs.offset + attrs.size < attrs.offset)
> + return -EINVAL;
> + if (!PAGE_ALIGNED(attrs.offset) || !PAGE_ALIGNED(attrs.size))
> + return -EINVAL;
> +
> + if (attrs.offset >= inode->i_size ||
> + attrs.offset + attrs.size > inode->i_size)
> + return -EINVAL;
> +
> + nr_pages = attrs.size >> PAGE_SHIFT;
> + index = attrs.offset >> PAGE_SHIFT;
> + r = __kvm_gmem_set_attributes(inode, index, nr_pages, attrs.attributes,
> + &err_index);
> + if (r) {
> + attrs.error_offset = ((uint64_t)err_index) << PAGE_SHIFT;
> +
> + if (copy_to_user(argp, &attrs, sizeof(attrs)))
> + return -EFAULT;
> + }
> +
> + return r;
> +}
> +
> +static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
> + unsigned long arg)
> +{
> + switch (ioctl) {
> + case KVM_SET_MEMORY_ATTRIBUTES2:
> + if (vm_memory_attributes)
> + return -ENOTTY;
> +
> + return kvm_gmem_set_attributes(file, (void __user *)arg);
> + default:
> + return -ENOTTY;
> + }
> +}
> +
> +
> static struct file_operations kvm_gmem_fops = {
> .mmap = kvm_gmem_mmap,
> .open = generic_file_open,
> .release = kvm_gmem_release,
> .fallocate = kvm_gmem_fallocate,
> + .unlocked_ioctl = kvm_gmem_ioctl,
> };
>
> static int kvm_gmem_migrate_folio(struct address_space *mapping,
> @@ -942,20 +1168,13 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
> static bool kvm_gmem_range_is_private(struct gmem_inode *gi, pgoff_t index,
> size_t nr_pages, struct kvm *kvm, gfn_t gfn)
> {
> - pgoff_t end = index + nr_pages - 1;
> - void *entry;
> -
> if (vm_memory_attributes)
> return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
> KVM_MEMORY_ATTRIBUTE_PRIVATE,
> KVM_MEMORY_ATTRIBUTE_PRIVATE);
>
> - mt_for_each(&gi->attributes, entry, index, end) {
> - if (xa_to_value(entry) != KVM_MEMORY_ATTRIBUTE_PRIVATE)
> - return false;
> - }
> -
> - return true;
> + return kvm_gmem_range_has_attributes(&gi->attributes, index, nr_pages,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
> }
>
> static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 3c261904322f0..85c14197587d4 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2435,16 +2435,6 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> -static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> -{
> -#ifdef kvm_arch_has_private_mem
> - if (!kvm || kvm_arch_has_private_mem(kvm))
> - return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> -#endif
> -
> - return 0;
> -}
> -
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> {
> @@ -2635,6 +2625,8 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> return -EINVAL;
> if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> return -EINVAL;
> + if (attrs->error_offset)
> + return -EINVAL;
> for (i = 0; i < ARRAY_SIZE(attrs->reserved); i++) {
> if (attrs->reserved[i])
> return -EINVAL;
> @@ -4983,6 +4975,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> return 1;
> case KVM_CAP_GUEST_MEMFD_FLAGS:
> return kvm_gmem_get_supported_flags(kvm);
> + case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
> + if (vm_memory_attributes)
> + return 0;
> +
> + return kvm_supported_mem_attributes(kvm);
> #endif
> default:
> break;
>
> --
> 2.53.0.1018.g2bb0e51243-goog
>
^ permalink raw reply
* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Masami Hiramatsu @ 2026-04-01 1:02 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Breno Leitao, Jonathan Corbet, Shuah Khan, linux-kernel,
linux-trace-kernel, linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <acueCFv4neO7zQGI@thinkstation>
On Tue, 31 Mar 2026 11:18:33 +0100
Kiryl Shutsemau <kirill@shutemov.name> wrote:
> On Wed, Mar 25, 2026 at 11:22:04PM +0900, Masami Hiramatsu wrote:
> > > diff --git a/init/main.c b/init/main.c
> > > index 453ac9dff2da0..14a04c283fa48 100644
> > > --- a/init/main.c
> > > +++ b/init/main.c
> > > @@ -416,9 +416,64 @@ static int __init warn_bootconfig(char *str)
> > > return 0;
> > > }
> > >
> > > +/*
> > > + * do_early_param() is defined later in this file but called from
> > > + * bootconfig_apply_early_params() below, so we need a forward declaration.
> > > + */
> > > +static int __init do_early_param(char *param, char *val,
> > > + const char *unused, void *arg);
> > > +
> > > +/*
> > > + * bootconfig_apply_early_params - dispatch kernel.* keys from the embedded
> > > + * bootconfig as early_param() calls.
> > > + *
> > > + * early_param() handlers must run before most of the kernel initialises
> > > + * (e.g. before the GIC driver reads irqchip.gicv3_pseudo_nmi). A bootconfig
> > > + * attached to the initrd arrives too late for this because the initrd is not
> > > + * mapped yet when early params are processed. The embedded bootconfig lives
> > > + * in the kernel image itself (.init.data), so it is always reachable.
> > > + *
> > > + * This function is called from setup_boot_config() which runs in
> > > + * start_kernel() before parse_early_param(), making the timing correct.
> > > + */
> > > +static void __init bootconfig_apply_early_params(void)
> >
> > [sashiko comment]
> > | Does this run early enough for architectural parameters?
> > | While setup_boot_config() runs before parse_early_param() in start_kernel(),
> > | it runs after setup_arch(). setup_boot_config() relies on xbc_init() which
> > | uses the memblock allocator, requiring setup_arch() to have already
> > | initialized it.
> > | However, the kernel expects many early parameters (like mem=, earlycon,
> > | noapic, and iommu) to be parsed during setup_arch() via the architecture's
> > | call to parse_early_param(). Since setup_arch() completes before
> > | setup_boot_config() runs, will these architectural early parameters be
> > | silently ignored because the decisions they influence were already
> > | finalized?
> >
> > This is the major reason that I did not support early parameter
> > in bootconfig. Some archs initialize kernel_cmdline in setup_arch()
> > and setup early parameters in it.
> > To fix this, we need to change setup_arch() for each architecture so
> > that it calls this bootconfig_apply_early_params().
>
> Hi Masami,
>
> I have a question about bootconfig design. Is it necessary to parse the
> bootconfig at boot time?
>
> We could avoid a lot of complexity if we flattened the bootconfig into a
> simple key-value list before embedding it in the kernel image or
> attaching it to the initrd. That would eliminate the need for memory
> allocations and allow the config to be used earlier during boot.
Hi Kiryl,
Thanks for a good question.
If it is embedded in the kernel, yes, we can compile it. But if it is
attached to initrd, the kernel needs to verify it. So anyway we have to
parse the key-value. Of course we can make it a binary data structure,
but it still needs verification. Moreover, memory allocation is not
required by design (the first design uses a statically allocated data).
Even if it is normalized as key-value, we can not trust the contents
without outer verification system, like verified boot. Therefore, our
basic strategy is to parse the text.
However, if the content can be verified externally, for example, if a
verified boot program verifies the compiled boot configuration, then
it would be possible to use that directly.
Thank you,
>
> What am I missing?
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH 0/2] Fix trace remotes read with an offline CPU
From: Vincent Donnefort @ 2026-04-01 2:50 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz
Cc: kernel-team, linux-kernel, Vincent Donnefort
This small series is fixing non-consuming read of a trace remote when the
trace_buffer is created after a CPU is offline.
It also extends hotplug testing coverage to include this test case.
I have based this series on top of kvmarm/next which contains the hypervisor
tracing patches.
Vincent Donnefort (2):
tracing: Non-consuming read for trace remotes with an offline CPU
tracing: selftests: Extend hotplug testing for trace remotes
kernel/trace/trace_remote.c | 21 ++++-
.../selftests/ftrace/test.d/remotes/functions | 15 +++-
.../ftrace/test.d/remotes/hotplug.tc | 86 +++++++++++++++++++
.../test.d/remotes/hypervisor/hotplug.tc | 11 +++
.../selftests/ftrace/test.d/remotes/trace.tc | 27 +-----
.../ftrace/test.d/remotes/trace_pipe.tc | 25 ------
6 files changed, 129 insertions(+), 56 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc
base-commit: 5ad2ff071b5980f072a85c8114649218971c586e
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply
* [PATCH 1/2] tracing: Non-consuming read for trace remotes with an offline CPU
From: Vincent Donnefort @ 2026-04-01 2:50 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz
Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260401025003.3258729-1-vdonnefort@google.com>
When a trace_buffer is created while a CPU is offline, this CPU is
cleared from the trace_buffer CPU mask, preventing the creation of a
non-consuming iterator (ring_buffer_iter). For trace remotes, it means
the iterator fails to be allocated (-ENOMEM) even though there are
available ring buffers in the trace_buffer.
For non-consuming reads of trace remotes, skip missing ring_buffer_iter
to allow reading the available ring buffers.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/kernel/trace/trace_remote.c b/kernel/trace/trace_remote.c
index 0d78e5f5fe98..39d5414c1723 100644
--- a/kernel/trace/trace_remote.c
+++ b/kernel/trace/trace_remote.c
@@ -282,6 +282,14 @@ static void trace_remote_put(struct trace_remote *remote)
trace_remote_try_unload(remote);
}
+static bool trace_remote_has_cpu(struct trace_remote *remote, int cpu)
+{
+ if (cpu == RING_BUFFER_ALL_CPUS)
+ return true;
+
+ return ring_buffer_poll_remote(remote->trace_buffer, cpu) == 0;
+}
+
static void __poll_remote(struct work_struct *work)
{
struct delayed_work *dwork = to_delayed_work(work);
@@ -324,6 +332,10 @@ static int __alloc_ring_buffer_iter(struct trace_remote_iterator *iter, int cpu)
iter->rb_iters[cpu] = ring_buffer_read_start(iter->remote->trace_buffer, cpu,
GFP_KERNEL);
if (!iter->rb_iters[cpu]) {
+ /* This CPU isn't part of trace_buffer. Skip it */
+ if (!trace_remote_has_cpu(iter->remote, cpu))
+ continue;
+
__free_ring_buffer_iter(iter, RING_BUFFER_ALL_CPUS);
return -ENOMEM;
}
@@ -347,10 +359,10 @@ static struct trace_remote_iterator
if (ret)
return ERR_PTR(ret);
- /* Test the CPU */
- ret = ring_buffer_poll_remote(remote->trace_buffer, cpu);
- if (ret)
+ if (!trace_remote_has_cpu(remote, cpu)) {
+ ret = -ENODEV;
goto err;
+ }
iter = kzalloc_obj(*iter);
if (iter) {
@@ -476,6 +488,9 @@ __peek_event(struct trace_remote_iterator *iter, int cpu, u64 *ts, unsigned long
return ring_buffer_peek(iter->remote->trace_buffer, cpu, ts, lost_events);
case TRI_NONCONSUMING:
rb_iter = __get_rb_iter(iter, cpu);
+ if (!rb_iter)
+ return NULL;
+
rb_evt = ring_buffer_iter_peek(rb_iter, ts);
if (!rb_evt)
return NULL;
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply related
* [PATCH 2/2] tracing: selftests: Extend hotplug testing for trace remotes
From: Vincent Donnefort @ 2026-04-01 2:50 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz
Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260401025003.3258729-1-vdonnefort@google.com>
The hotplug testing only tries reading a trace remote buffer, loaded
before a CPU is offline. Extend this testing to cover:
* A trace remote buffer loaded after a CPU is offline.
* A trace remote buffer loaded before a CPU is online.
Because of these added test cases, move the hotplug testing into a
separate hotplug.tc file.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/functions b/tools/testing/selftests/ftrace/test.d/remotes/functions
index 97a09d564a34..ca531a0336dc 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/functions
+++ b/tools/testing/selftests/ftrace/test.d/remotes/functions
@@ -32,6 +32,15 @@ assert_unloaded()
grep -q "(unloaded)" buffer_size_kb
}
+reload_remote()
+{
+ echo 0 > tracing_on
+ clear_trace
+ assert_unloaded
+ echo 1 > tracing_on
+ assert_loaded
+}
+
dump_trace_pipe()
{
output=$(mktemp $TMPDIR/remote_test.XXXXXX)
@@ -79,10 +88,12 @@ get_cpu_ids()
sed -n 's/^processor\s*:\s*\([0-9]\+\).*/\1/p' /proc/cpuinfo
}
-get_page_size() {
+get_page_size()
+{
sed -ne 's/^.*data.*size:\([0-9][0-9]*\).*/\1/p' events/header_page
}
-get_selftest_event_size() {
+get_selftest_event_size()
+{
sed -ne 's/^.*field:.*;.*size:\([0-9][0-9]*\);.*/\1/p' events/*/selftest/format | awk '{s+=$1} END {print s}'
}
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc b/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc
new file mode 100644
index 000000000000..86b55824d631
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc
@@ -0,0 +1,86 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test trace remote read with an offline CPU
+# requires: remotes/test
+
+. $TEST_DIR/remotes/functions
+
+hotunplug_one_cpu()
+{
+ [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 1
+
+ for cpu in $(get_cpu_ids); do
+ echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 1
+ break
+ done
+
+ echo $cpu
+}
+
+# Check non-consuming and consuming read
+check_read()
+{
+ for i in $(seq 1 8); do
+ echo $i > write_event
+ done
+
+ check_trace 1 8 trace
+
+ output=$(dump_trace_pipe)
+ check_trace 1 8 $output
+ rm $output
+}
+
+test_hotplug()
+{
+ echo 0 > trace
+ assert_loaded
+
+ #
+ # Test a trace buffer containing an offline CPU
+ #
+
+ cpu=$(hotunplug_one_cpu) || exit_unsupported
+
+ check_read
+
+ #
+ # Test a trace buffer with a missing CPU
+ #
+
+ reload_remote
+
+ check_read
+
+ #
+ # Test a trace buffer with a CPU added later
+ #
+
+ echo 1 > /sys/devices/system/cpu/cpu$cpu/online
+ assert_loaded
+
+ check_read
+
+ # Test if the ring-buffer for the newly added CPU is both writable and
+ # readable
+ for i in $(seq 1 8); do
+ taskset -c $cpu echo $i > write_event
+ done
+
+ cd per_cpu/cpu$cpu/
+
+ check_trace 1 8 trace
+
+ output=$(dump_trace_pipe)
+ check_trace 1 8 $output
+ rm $output
+
+ cd -
+}
+
+if [ -z "$SOURCE_REMOTE_TEST" ]; then
+ set -e
+
+ setup_remote_test
+ test_hotplug
+fi
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc
new file mode 100644
index 000000000000..580ec32c8f81
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc
@@ -0,0 +1,11 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test hypervisor trace read with an offline CPU
+# requires: remotes/hypervisor/write_event
+
+SOURCE_REMOTE_TEST=1
+. $TEST_DIR/remotes/hotplug.tc
+
+set -e
+setup_remote "hypervisor"
+test_hotplug
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/trace.tc b/tools/testing/selftests/ftrace/test.d/remotes/trace.tc
index 170f7648732a..bc9377a70e8d 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/trace.tc
+++ b/tools/testing/selftests/ftrace/test.d/remotes/trace.tc
@@ -58,11 +58,7 @@ test_trace()
#
# Ensure the writer is not on the reader page by reloading the buffer
- echo 0 > tracing_on
- echo 0 > trace
- assert_unloaded
- echo 1 > tracing_on
- assert_loaded
+ reload_remote
# Ensure ring-buffer overflow by emitting events from the same CPU
for cpu in $(get_cpu_ids); do
@@ -96,27 +92,6 @@ test_trace()
cd - > /dev/null
done
-
- #
- # Test with hotplug
- #
-
- [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 0
-
- echo 0 > trace
-
- for cpu in $(get_cpu_ids); do
- echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 0
- break
- done
-
- for i in $(seq 1 8); do
- echo $i > write_event
- done
-
- check_trace 1 8 trace
-
- echo 1 > /sys/devices/system/cpu/cpu$cpu/online
}
if [ -z "$SOURCE_REMOTE_TEST" ]; then
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc b/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc
index 669a7288ed7c..7f7b7b79c490 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc
+++ b/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc
@@ -92,31 +92,6 @@ test_trace_pipe()
rm $output
cd - > /dev/null
done
-
- #
- # Test interaction with hotplug
- #
-
- [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 0
-
- echo 0 > trace
-
- for cpu in $(get_cpu_ids); do
- echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 0
- break
- done
-
- for i in $(seq 1 8); do
- echo $i > write_event
- done
-
- output=$(dump_trace_pipe)
-
- check_trace 1 8 $output
-
- rm $output
-
- echo 1 > /sys/devices/system/cpu/cpu$cpu/online
}
if [ -z "$SOURCE_REMOTE_TEST" ]; then
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply related
* Re: [PATCH v9 3/3] tracing/Documentation: Add a section about backup instance
From: Masami Hiramatsu @ 2026-04-01 3:15 UTC (permalink / raw)
To: Steven Rostedt; +Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260331172105.609d59aa@gandalf.local.home>
On Tue, 31 Mar 2026 17:21:05 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 1 Apr 2026 01:32:41 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
>
> > From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> >
> > Add a section about backup instance to the debugging.rst.
> >
> > Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > ---
> > Changes in v6:
> > - Fix typos.
> > ---
> > Documentation/trace/debugging.rst | 19 +++++++++++++++++++
> > 1 file changed, 19 insertions(+)
> >
> > diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst
> > index 4d88c346fc38..15857951b506 100644
> > --- a/Documentation/trace/debugging.rst
> > +++ b/Documentation/trace/debugging.rst
> > @@ -159,3 +159,22 @@ If setting it from the kernel command line, it is recommended to also
> > disable tracing with the "traceoff" flag, and enable tracing after boot up.
> > Otherwise the trace from the most recent boot will be mixed with the trace
> > from the previous boot, and may make it confusing to read.
> > +
> > +Using a backup instance for keeping previous boot data
> > +------------------------------------------------------
> > +
> > +It is also possible to record trace data at system boot time by specifying
> > +events with the persistent ring buffer, but in this case the data before the
> > +reboot will be lost before it can be read. This problem can be solved by a
> > +backup instance. From the kernel command line::
> > +
> > + reserve_mem=12M:4096:trace trace_instance=boot_map@trace,sched,irq trace_instance=backup=boot_map
>
> The above didn't actually work well without my patch to enable events on
> the persistent ring buffer with the backup. But keep it, as it will work
> with my patch ;-)
Ah, thanks!
Let me rebase on it.
>
> -- Steve
>
>
> > +
> > +On boot up, the previous data in the "boot_map" is copied to the "backup"
> > +instance, and the "sched:*" and "irq:*" events for the current boot are traced
> > +in the "boot_map". Thus the user can read the previous boot data from the "backup"
> > +instance without stopping the trace.
> > +
> > +Note that this "backup" instance is readonly, and will be removed automatically
> > +if you clear the trace data or read out all trace data from the "trace_pipe"
> > +or the "trace_pipe_raw" files.
> > \ No newline at end of file
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v9 2/3] tracing: Remove the backup instance automatically after read
From: Masami Hiramatsu @ 2026-04-01 3:19 UTC (permalink / raw)
To: Steven Rostedt; +Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260331171936.6f84e357@gandalf.local.home>
On Tue, 31 Mar 2026 17:19:36 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 1 Apr 2026 01:32:33 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
>
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index 8cec7bd70438..1d73400a01c7 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -539,8 +539,65 @@ void trace_set_ring_buffer_expanded(struct trace_array *tr)
> > tr->ring_buffer_expanded = true;
> > }
> >
> > +static int __remove_instance(struct trace_array *tr);
> > +
> > +static void trace_array_autoremove(struct work_struct *work)
> > +{
> > + struct trace_array *tr = container_of(work, struct trace_array, autoremove_work);
> > +
> > + guard(mutex)(&event_mutex);
> > + guard(mutex)(&trace_types_lock);
>
> Hmm, should we do a check if the tr still exists? Couldn't the user delete
> this via a rmdir after the last file closed and this was kicked?
>
> CPU 0 CPU 1
> ----- -----
> open(trace_pipe);
> read(..);
> close(trace_pipe);
> kick the work queue to delete it....
> rmdir();
> [instance deleted]
I thought this requires trace_types_lock, and after kicked the queue,
can rmdir() gets the tr? (__trace_array_get() return error if
tr->free_on_close is set)
>
> __remove_instance();
>
> [ now the tr is freed, and the remove will crash!]
>
>
> What would prevent this is this is to use trace_array_destroy() that checks
> this and also adds the proper locking:
>
> static void trace_array_autoremove(struct work_struct *work)
> {
> struct trace_array *tr = container_of(work, struct trace_array, autoremove_work);
>
> trace_array_destroy(tr);
> }
OK, let's use it.
>
>
> > +
> > + /*
> > + * This can be fail if someone gets @tr before starting this
> > + * function, but in that case, this will be kicked again when
> > + * putting it. So we don't care about the result.
> > + */
> > + __remove_instance(tr);
> > +}
> > +
> > +static struct workqueue_struct *autoremove_wq;
> > +
> > +static void trace_array_kick_autoremove(struct trace_array *tr)
> > +{
> > + if (autoremove_wq && !work_pending(&tr->autoremove_work))
> > + queue_work(autoremove_wq, &tr->autoremove_work);
>
> Doesn't queue_work() check if it's pending? Do we really need to check it
> twice?
Indeed, it checked the flag.
>
> > +}
> > +
> > +static void trace_array_cancel_autoremove(struct trace_array *tr)
> > +{
> > + if (autoremove_wq && work_pending(&tr->autoremove_work))
> > + cancel_work(&tr->autoremove_work);
>
> Same here, as can't this be racy?
Yeah, and this should use cancel_work_sync().
>
> > +}
> > +
> > +static void trace_array_init_autoremove(struct trace_array *tr)
> > +{
> > + INIT_WORK(&tr->autoremove_work, trace_array_autoremove);
> > +}
> > +
> > +static void trace_array_start_autoremove(void)
> > +{
> > + if (autoremove_wq)
> > + return;
> > +
> > + autoremove_wq = alloc_workqueue("tr_autoremove_wq",
> > + WQ_UNBOUND | WQ_HIGHPRI, 0);
> > + if (!autoremove_wq)
> > + pr_warn("Unable to allocate tr_autoremove_wq. autoremove
> > disabled.\n"); +}
> > +
> > LIST_HEAD(ftrace_trace_arrays);
>
> -- Steve
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH v2 0/2] Fix trace remotes read with an offline CPU
From: Vincent Donnefort @ 2026-04-01 4:50 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz
Cc: kernel-team, linux-kernel, Vincent Donnefort
This small series is fixing non-consuming read of a trace remote when the
trace_buffer is created after a CPU is offline.
It also extends hotplug testing coverage to include this test case.
I have based this series on top of kvmarm/next which contains the hypervisor
tracing patches.
Changes in v2:
* Add a trap to restore offlined CPU on test failure in hotplug.tc.
* Fix assert_loaded/assert_unloaded error propagation.
* Ensure we probe for existing events when setting up consuming read
iterator.
v1: https://lore.kernel.org/all/20260401025003.3258729-1-vdonnefort@google.com/
Vincent Donnefort (2):
tracing: Non-consuming read for trace remotes with an offline CPU
tracing: selftests: Extend hotplug testing for trace remotes
kernel/trace/trace_remote.c | 22 ++++-
.../selftests/ftrace/test.d/remotes/functions | 19 +++-
.../ftrace/test.d/remotes/hotplug.tc | 88 +++++++++++++++++++
.../test.d/remotes/hypervisor/hotplug.tc | 11 +++
.../selftests/ftrace/test.d/remotes/trace.tc | 27 +-----
.../ftrace/test.d/remotes/trace_pipe.tc | 25 ------
6 files changed, 134 insertions(+), 58 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc
base-commit: 5ad2ff071b5980f072a85c8114649218971c586e
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply
* [PATCH v2 1/2] tracing: Non-consuming read for trace remotes with an offline CPU
From: Vincent Donnefort @ 2026-04-01 4:50 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz
Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260401045100.3394299-1-vdonnefort@google.com>
When a trace_buffer is created while a CPU is offline, this CPU is
cleared from the trace_buffer CPU mask, preventing the creation of a
non-consuming iterator (ring_buffer_iter). For trace remotes, it means
the iterator fails to be allocated (-ENOMEM) even though there are
available ring buffers in the trace_buffer.
For non-consuming reads of trace remotes, skip missing ring_buffer_iter
to allow reading the available ring buffers.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/kernel/trace/trace_remote.c b/kernel/trace/trace_remote.c
index 0d78e5f5fe98..d6c3f94d67cd 100644
--- a/kernel/trace/trace_remote.c
+++ b/kernel/trace/trace_remote.c
@@ -282,6 +282,14 @@ static void trace_remote_put(struct trace_remote *remote)
trace_remote_try_unload(remote);
}
+static bool trace_remote_has_cpu(struct trace_remote *remote, int cpu)
+{
+ if (cpu == RING_BUFFER_ALL_CPUS)
+ return true;
+
+ return ring_buffer_poll_remote(remote->trace_buffer, cpu) == 0;
+}
+
static void __poll_remote(struct work_struct *work)
{
struct delayed_work *dwork = to_delayed_work(work);
@@ -324,6 +332,10 @@ static int __alloc_ring_buffer_iter(struct trace_remote_iterator *iter, int cpu)
iter->rb_iters[cpu] = ring_buffer_read_start(iter->remote->trace_buffer, cpu,
GFP_KERNEL);
if (!iter->rb_iters[cpu]) {
+ /* This CPU isn't part of trace_buffer. Skip it */
+ if (!trace_remote_has_cpu(iter->remote, cpu))
+ continue;
+
__free_ring_buffer_iter(iter, RING_BUFFER_ALL_CPUS);
return -ENOMEM;
}
@@ -347,10 +359,10 @@ static struct trace_remote_iterator
if (ret)
return ERR_PTR(ret);
- /* Test the CPU */
- ret = ring_buffer_poll_remote(remote->trace_buffer, cpu);
- if (ret)
+ if (!trace_remote_has_cpu(remote, cpu)) {
+ ret = -ENODEV;
goto err;
+ }
iter = kzalloc_obj(*iter);
if (iter) {
@@ -361,6 +373,7 @@ static struct trace_remote_iterator
switch (type) {
case TRI_CONSUMING:
+ ring_buffer_poll_remote(remote->trace_buffer, cpu);
INIT_DELAYED_WORK(&iter->poll_work, __poll_remote);
schedule_delayed_work(&iter->poll_work, msecs_to_jiffies(remote->poll_ms));
break;
@@ -476,6 +489,9 @@ __peek_event(struct trace_remote_iterator *iter, int cpu, u64 *ts, unsigned long
return ring_buffer_peek(iter->remote->trace_buffer, cpu, ts, lost_events);
case TRI_NONCONSUMING:
rb_iter = __get_rb_iter(iter, cpu);
+ if (!rb_iter)
+ return NULL;
+
rb_evt = ring_buffer_iter_peek(rb_iter, ts);
if (!rb_evt)
return NULL;
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply related
* [PATCH v2 2/2] tracing: selftests: Extend hotplug testing for trace remotes
From: Vincent Donnefort @ 2026-04-01 4:51 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz
Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260401045100.3394299-1-vdonnefort@google.com>
The hotplug testing only tries reading a trace remote buffer, loaded
before a CPU is offline. Extend this testing to cover:
* A trace remote buffer loaded after a CPU is offline.
* A trace remote buffer loaded before a CPU is online.
Because of these added test cases, move the hotplug testing into a
separate hotplug.tc file.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/functions b/tools/testing/selftests/ftrace/test.d/remotes/functions
index 97a09d564a34..05224fac3653 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/functions
+++ b/tools/testing/selftests/ftrace/test.d/remotes/functions
@@ -24,12 +24,21 @@ setup_remote_test()
assert_loaded()
{
- grep -q "(loaded)" buffer_size_kb
+ grep -q "(loaded)" buffer_size_kb || return 1
}
assert_unloaded()
{
- grep -q "(unloaded)" buffer_size_kb
+ grep -q "(unloaded)" buffer_size_kb || return 1
+}
+
+reload_remote()
+{
+ echo 0 > tracing_on
+ clear_trace
+ assert_unloaded
+ echo 1 > tracing_on
+ assert_loaded
}
dump_trace_pipe()
@@ -79,10 +88,12 @@ get_cpu_ids()
sed -n 's/^processor\s*:\s*\([0-9]\+\).*/\1/p' /proc/cpuinfo
}
-get_page_size() {
+get_page_size()
+{
sed -ne 's/^.*data.*size:\([0-9][0-9]*\).*/\1/p' events/header_page
}
-get_selftest_event_size() {
+get_selftest_event_size()
+{
sed -ne 's/^.*field:.*;.*size:\([0-9][0-9]*\);.*/\1/p' events/*/selftest/format | awk '{s+=$1} END {print s}'
}
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc b/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc
new file mode 100644
index 000000000000..145617eb8061
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc
@@ -0,0 +1,88 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test trace remote read with an offline CPU
+# requires: remotes/test
+
+. $TEST_DIR/remotes/functions
+
+hotunplug_one_cpu()
+{
+ [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 1
+
+ for cpu in $(get_cpu_ids); do
+ echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 1
+ break
+ done
+
+ echo $cpu
+}
+
+# Check non-consuming and consuming read
+check_read()
+{
+ for i in $(seq 1 8); do
+ echo $i > write_event
+ done
+
+ check_trace 1 8 trace
+
+ output=$(dump_trace_pipe)
+ check_trace 1 8 $output
+ rm $output
+}
+
+test_hotplug()
+{
+ echo 0 > trace
+ assert_loaded
+
+ #
+ # Test a trace buffer containing an offline CPU
+ #
+
+ cpu=$(hotunplug_one_cpu) || exit_unsupported
+ trap "echo 1 > /sys/devices/system/cpu/cpu$cpu/online" EXIT
+
+ check_read
+
+ #
+ # Test a trace buffer with a missing CPU
+ #
+
+ reload_remote
+
+ check_read
+
+ #
+ # Test a trace buffer with a CPU added later
+ #
+
+ echo 1 > /sys/devices/system/cpu/cpu$cpu/online
+ trap "" EXIT
+ assert_loaded
+
+ check_read
+
+ # Test if the ring-buffer for the newly added CPU is both writable and
+ # readable
+ for i in $(seq 1 8); do
+ taskset -c $cpu echo $i > write_event
+ done
+
+ cd per_cpu/cpu$cpu/
+
+ check_trace 1 8 trace
+
+ output=$(dump_trace_pipe)
+ check_trace 1 8 $output
+ rm $output
+
+ cd -
+}
+
+if [ -z "$SOURCE_REMOTE_TEST" ]; then
+ set -e
+
+ setup_remote_test
+ test_hotplug
+fi
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc
new file mode 100644
index 000000000000..580ec32c8f81
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc
@@ -0,0 +1,11 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test hypervisor trace read with an offline CPU
+# requires: remotes/hypervisor/write_event
+
+SOURCE_REMOTE_TEST=1
+. $TEST_DIR/remotes/hotplug.tc
+
+set -e
+setup_remote "hypervisor"
+test_hotplug
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/trace.tc b/tools/testing/selftests/ftrace/test.d/remotes/trace.tc
index 170f7648732a..bc9377a70e8d 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/trace.tc
+++ b/tools/testing/selftests/ftrace/test.d/remotes/trace.tc
@@ -58,11 +58,7 @@ test_trace()
#
# Ensure the writer is not on the reader page by reloading the buffer
- echo 0 > tracing_on
- echo 0 > trace
- assert_unloaded
- echo 1 > tracing_on
- assert_loaded
+ reload_remote
# Ensure ring-buffer overflow by emitting events from the same CPU
for cpu in $(get_cpu_ids); do
@@ -96,27 +92,6 @@ test_trace()
cd - > /dev/null
done
-
- #
- # Test with hotplug
- #
-
- [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 0
-
- echo 0 > trace
-
- for cpu in $(get_cpu_ids); do
- echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 0
- break
- done
-
- for i in $(seq 1 8); do
- echo $i > write_event
- done
-
- check_trace 1 8 trace
-
- echo 1 > /sys/devices/system/cpu/cpu$cpu/online
}
if [ -z "$SOURCE_REMOTE_TEST" ]; then
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc b/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc
index 669a7288ed7c..7f7b7b79c490 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc
+++ b/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc
@@ -92,31 +92,6 @@ test_trace_pipe()
rm $output
cd - > /dev/null
done
-
- #
- # Test interaction with hotplug
- #
-
- [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 0
-
- echo 0 > trace
-
- for cpu in $(get_cpu_ids); do
- echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 0
- break
- done
-
- for i in $(seq 1 8); do
- echo $i > write_event
- done
-
- output=$(dump_trace_pipe)
-
- check_trace 1 8 $output
-
- rm $output
-
- echo 1 > /sys/devices/system/cpu/cpu$cpu/online
}
if [ -z "$SOURCE_REMOTE_TEST" ]; then
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply related
* [PATCH] ring-buffer: Enforce read ordering of trace_buffer cpumask and buffers
From: Vincent Donnefort @ 2026-04-01 5:36 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
Cc: kernel-team, linux-kernel, Vincent Donnefort
On CPU hotplug, if it is the first time a trace_buffer sees a CPU, a
ring_buffer_per_cpu will be allocated and its corresponding bit toggled
in the cpumask. Many readers check this cpumask to know if they can
safely read the ring_buffer_per_cpu but they are doing so without memory
ordering and may observe the cpumask bit set while having NULL buffer
pointer.
Enforce the memory read ordering by sending an IPI to all online CPUs.
The hotplug path is a slow-path anyway and it saves us from adding read
barriers in numerous call sites.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 170170bd83bd..10d2d0404434 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7468,6 +7468,12 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
return 0;
}
+static void rb_cpu_sync(void *data)
+{
+ /* Not really needed, but documents what is happening */
+ smp_rmb();
+}
+
/*
* We only allocate new buffers, never free them if the CPU goes down.
* If we were to free the buffer, then the user would lose any trace that was in
@@ -7506,7 +7512,18 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node)
cpu);
return -ENOMEM;
}
- smp_wmb();
+
+ /*
+ * Ensure trace_buffer readers observe the newly allocated
+ * ring_buffer_per_cpu before they check the cpumask. Instead of using a
+ * read barrier for all readers, send an IPI.
+ */
+ if (unlikely(system_state == SYSTEM_RUNNING)) {
+ on_each_cpu(rb_cpu_sync, NULL, 1);
+ /* Not really needed, but documents what is happening */
+ smp_wmb();
+ }
+
cpumask_set_cpu(cpu, buffer->cpumask);
return 0;
}
base-commit: 7aaa8047eafd0bd628065b15757d9b48c5f9c07d
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply related
* [PATCH v10 0/3] tracing: Remove backup instance after read all
From: Masami Hiramatsu (Google) @ 2026-04-01 6:37 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
Hi,
Here is the v10 of the series to improve backup instances of
the persistent ring buffer. The previous version is here:
https://lore.kernel.org/all/177497473558.569199.6527680985537865638.stgit@mhiramat.tok.corp.google.com/
In this version, I fixed to use trace_array_destroy() and
cancel_work_sync() to avoid possible race. And remove unneeded
work_pending() check in trace_array_kick_autoremove()
(for trace_array_cancel_autoremove() to avoid deadlock of
workqueue, it still use work_pending() check) [2/3] and
add a newline at EOF [3/3]
Series Description
------------------
Since backup instances are a kind of snapshot of the persistent
ring buffer, it should be readonly. And if it is readonly
there is no reason to keep it after reading all data via trace_pipe
because the data has been consumed. But user should be able to remove
the readonly instance by rmdir or truncating `trace` file.
Thus, [1/3] makes backup instances readonly (not able to write any
events, cleanup trace, change buffer size). Also, [2/3] removes the
backup instance after consuming all data via trace_pipe.
With this improvements, even if we makes a backup instance (using
the same amount of memory of the persistent ring buffer), it will
be removed after reading the data automatically.
Thanks,
---
Masami Hiramatsu (Google) (3):
tracing: Make the backup instance non-reusable
tracing: Remove the backup instance automatically after read
tracing/Documentation: Add a section about backup instance
Documentation/trace/debugging.rst | 19 ++++
kernel/trace/trace.c | 160 ++++++++++++++++++++++++++++---------
kernel/trace/trace.h | 13 +++
kernel/trace/trace_boot.c | 5 +
kernel/trace/trace_events.c | 76 ++++++++++--------
5 files changed, 202 insertions(+), 71 deletions(-)
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH v10 1/3] tracing: Make the backup instance non-reusable
From: Masami Hiramatsu (Google) @ 2026-04-01 6:37 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <177502545902.1311542.15822886746171808738.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Since there is no reason to reuse the backup instance, make it
readonly (but erasable).
Note that only backup instances are readonly, because
other trace instances will be empty unless it is writable.
Only backup instances have copy entries from the original.
With this change, most of the trace control files are removed
from the backup instance, including eventfs enable/filter etc.
# find /sys/kernel/tracing/instances/backup/events/ | wc -l
4093
# find /sys/kernel/tracing/instances/boot_map/events/ | wc -l
9573
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v9:
- Add forcibly readonly check in open() operations.
Changes in v8:
- Remove read-only checks in read() operation.
Changes in v7:
- Return -EACCES instead of -EPERM.
Changes in v6:
- Remove tracing_on file from readonly instances.
- Remove unused writable_mode from tracing_init_tracefs_percpu().
- Cleanup init_tracer_tracefs() and create_event_toplevel_files().
- Remove TRACE_MODE_WRITE_MASK.
- Add TRACE_ARRAY_FL_RDONLY.
Changes in v5:
- Rebased on the latest for-next (and hide show_event_filters/triggers
if the instance is readonly.
Changes in v4:
- Make trace data erasable. (not reusable)
Changes in v3:
- Resuse the beginning part of event_entries for readonly files.
- Remove readonly file_operations and checking readonly flag in
each write operation.
Changes in v2:
- Use readonly file_operations to prohibit writing instead of
checking flags in write() callbacks.
- Remove writable files from eventfs.
---
kernel/trace/trace.c | 81 +++++++++++++++++++++++++++----------------
kernel/trace/trace.h | 7 ++++
kernel/trace/trace_boot.c | 5 ++-
kernel/trace/trace_events.c | 76 +++++++++++++++++++++++-----------------
4 files changed, 104 insertions(+), 65 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 7f2fbf9c36df..fdffa62a2b4e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3414,6 +3414,11 @@ int tracing_open_generic_tr(struct inode *inode, struct file *filp)
if (ret)
return ret;
+ if ((filp->f_mode & FMODE_WRITE) && trace_array_is_readonly(tr)) {
+ trace_array_put(tr);
+ return -EACCES;
+ }
+
filp->private_data = inode->i_private;
return 0;
@@ -6435,6 +6440,11 @@ static int tracing_clock_open(struct inode *inode, struct file *file)
if (ret)
return ret;
+ if ((file->f_mode & FMODE_WRITE) && trace_array_is_readonly(tr)) {
+ trace_array_put(tr);
+ return -EACCES;
+ }
+
ret = single_open(file, tracing_clock_show, inode->i_private);
if (ret < 0)
trace_array_put(tr);
@@ -8771,17 +8781,22 @@ static __init void create_trace_instances(struct dentry *d_tracer)
static void
init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
{
+ umode_t writable_mode = TRACE_MODE_WRITE;
int cpu;
+ if (trace_array_is_readonly(tr))
+ writable_mode = TRACE_MODE_READ;
+
trace_create_file("available_tracers", TRACE_MODE_READ, d_tracer,
- tr, &show_traces_fops);
+ tr, &show_traces_fops);
- trace_create_file("current_tracer", TRACE_MODE_WRITE, d_tracer,
- tr, &set_tracer_fops);
+ trace_create_file("current_tracer", writable_mode, d_tracer,
+ tr, &set_tracer_fops);
- trace_create_file("tracing_cpumask", TRACE_MODE_WRITE, d_tracer,
+ trace_create_file("tracing_cpumask", writable_mode, d_tracer,
tr, &tracing_cpumask_fops);
+ /* Options are used for changing print-format even for readonly instance. */
trace_create_file("trace_options", TRACE_MODE_WRITE, d_tracer,
tr, &tracing_iter_fops);
@@ -8791,12 +8806,36 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
trace_create_file("trace_pipe", TRACE_MODE_READ, d_tracer,
tr, &tracing_pipe_fops);
- trace_create_file("buffer_size_kb", TRACE_MODE_WRITE, d_tracer,
+ trace_create_file("buffer_size_kb", writable_mode, d_tracer,
tr, &tracing_entries_fops);
trace_create_file("buffer_total_size_kb", TRACE_MODE_READ, d_tracer,
tr, &tracing_total_entries_fops);
+ trace_create_file("trace_clock", writable_mode, d_tracer, tr,
+ &trace_clock_fops);
+
+ trace_create_file("timestamp_mode", TRACE_MODE_READ, d_tracer, tr,
+ &trace_time_stamp_mode_fops);
+
+ tr->buffer_percent = 50;
+
+ trace_create_file("buffer_subbuf_size_kb", writable_mode, d_tracer,
+ tr, &buffer_subbuf_size_fops);
+
+ create_trace_options_dir(tr);
+
+ if (tr->range_addr_start)
+ trace_create_file("last_boot_info", TRACE_MODE_READ, d_tracer,
+ tr, &last_boot_fops);
+
+ for_each_tracing_cpu(cpu)
+ tracing_init_tracefs_percpu(tr, cpu);
+
+ /* Read-only instance has above files only. */
+ if (trace_array_is_readonly(tr))
+ return;
+
trace_create_file("free_buffer", 0200, d_tracer,
tr, &tracing_free_buffer_fops);
@@ -8808,49 +8847,29 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
trace_create_file("trace_marker_raw", 0220, d_tracer,
tr, &tracing_mark_raw_fops);
- trace_create_file("trace_clock", TRACE_MODE_WRITE, d_tracer, tr,
- &trace_clock_fops);
-
- trace_create_file("tracing_on", TRACE_MODE_WRITE, d_tracer,
- tr, &rb_simple_fops);
-
- trace_create_file("timestamp_mode", TRACE_MODE_READ, d_tracer, tr,
- &trace_time_stamp_mode_fops);
-
- tr->buffer_percent = 50;
-
trace_create_file("buffer_percent", TRACE_MODE_WRITE, d_tracer,
- tr, &buffer_percent_fops);
-
- trace_create_file("buffer_subbuf_size_kb", TRACE_MODE_WRITE, d_tracer,
- tr, &buffer_subbuf_size_fops);
+ tr, &buffer_percent_fops);
trace_create_file("syscall_user_buf_size", TRACE_MODE_WRITE, d_tracer,
- tr, &tracing_syscall_buf_fops);
+ tr, &tracing_syscall_buf_fops);
- create_trace_options_dir(tr);
+ trace_create_file("tracing_on", TRACE_MODE_WRITE, d_tracer,
+ tr, &rb_simple_fops);
trace_create_maxlat_file(tr, d_tracer);
if (ftrace_create_function_files(tr, d_tracer))
MEM_FAIL(1, "Could not allocate function filter files");
- if (tr->range_addr_start) {
- trace_create_file("last_boot_info", TRACE_MODE_READ, d_tracer,
- tr, &last_boot_fops);
#ifdef CONFIG_TRACER_SNAPSHOT
- } else {
+ if (!tr->range_addr_start)
trace_create_file("snapshot", TRACE_MODE_WRITE, d_tracer,
tr, &snapshot_fops);
#endif
- }
trace_create_file("error_log", TRACE_MODE_WRITE, d_tracer,
tr, &tracing_err_log_fops);
- for_each_tracing_cpu(cpu)
- tracing_init_tracefs_percpu(tr, cpu);
-
ftrace_init_tracefs(tr, d_tracer);
}
@@ -9635,7 +9654,7 @@ __init static void enable_instances(void)
* Backup buffers can be freed but need vfree().
*/
if (backup)
- tr->flags |= TRACE_ARRAY_FL_VMALLOC;
+ tr->flags |= TRACE_ARRAY_FL_VMALLOC | TRACE_ARRAY_FL_RDONLY;
if (start || backup) {
tr->flags |= TRACE_ARRAY_FL_BOOT | TRACE_ARRAY_FL_LAST_BOOT;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 6abd9e16ef21..2aa11178af59 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -464,6 +464,7 @@ enum {
TRACE_ARRAY_FL_MOD_INIT = BIT(3),
TRACE_ARRAY_FL_MEMMAP = BIT(4),
TRACE_ARRAY_FL_VMALLOC = BIT(5),
+ TRACE_ARRAY_FL_RDONLY = BIT(6),
};
#ifdef CONFIG_MODULES
@@ -493,6 +494,12 @@ extern unsigned long trace_adjust_address(struct trace_array *tr, unsigned long
extern struct trace_array *printk_trace;
+static inline bool trace_array_is_readonly(struct trace_array *tr)
+{
+ /* backup instance is read only. */
+ return tr->flags & TRACE_ARRAY_FL_RDONLY;
+}
+
/*
* The global tracer (top) should be the first trace array added,
* but we check the flag anyway.
diff --git a/kernel/trace/trace_boot.c b/kernel/trace/trace_boot.c
index dbe29b4c6a7a..2ca2541c8a58 100644
--- a/kernel/trace/trace_boot.c
+++ b/kernel/trace/trace_boot.c
@@ -61,7 +61,8 @@ trace_boot_set_instance_options(struct trace_array *tr, struct xbc_node *node)
v = memparse(p, NULL);
if (v < PAGE_SIZE)
pr_err("Buffer size is too small: %s\n", p);
- if (tracing_resize_ring_buffer(tr, v, RING_BUFFER_ALL_CPUS) < 0)
+ if (trace_array_is_readonly(tr) ||
+ tracing_resize_ring_buffer(tr, v, RING_BUFFER_ALL_CPUS) < 0)
pr_err("Failed to resize trace buffer to %s\n", p);
}
@@ -597,7 +598,7 @@ trace_boot_enable_tracer(struct trace_array *tr, struct xbc_node *node)
p = xbc_node_find_value(node, "tracer", NULL);
if (p && *p != '\0') {
- if (tracing_set_tracer(tr, p) < 0)
+ if (trace_array_is_readonly(tr) || tracing_set_tracer(tr, p) < 0)
pr_err("Failed to set given tracer: %s\n", p);
}
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index de807a9e2371..7ddcee312471 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1401,6 +1401,9 @@ static int __ftrace_set_clr_event(struct trace_array *tr, const char *match,
{
int ret;
+ if (trace_array_is_readonly(tr))
+ return -EACCES;
+
mutex_lock(&event_mutex);
ret = __ftrace_set_clr_event_nolock(tr, match, sub, event, set, mod);
mutex_unlock(&event_mutex);
@@ -2969,8 +2972,8 @@ event_subsystem_dir(struct trace_array *tr, const char *name,
} else
__get_system(system);
- /* ftrace only has directories no files */
- if (strcmp(name, "ftrace") == 0)
+ /* ftrace only has directories no files, readonly instance too. */
+ if (strcmp(name, "ftrace") == 0 || trace_array_is_readonly(tr))
nr_entries = 0;
else
nr_entries = ARRAY_SIZE(system_entries);
@@ -3135,28 +3138,30 @@ event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file)
int ret;
static struct eventfs_entry event_entries[] = {
{
- .name = "enable",
+ .name = "format",
.callback = event_callback,
- .release = event_release,
},
+#ifdef CONFIG_PERF_EVENTS
{
- .name = "filter",
+ .name = "id",
.callback = event_callback,
},
+#endif
+#define NR_RO_EVENT_ENTRIES (1 + IS_ENABLED(CONFIG_PERF_EVENTS))
+/* Readonly files must be above this line and counted by NR_RO_EVENT_ENTRIES. */
{
- .name = "trigger",
+ .name = "enable",
.callback = event_callback,
+ .release = event_release,
},
{
- .name = "format",
+ .name = "filter",
.callback = event_callback,
},
-#ifdef CONFIG_PERF_EVENTS
{
- .name = "id",
+ .name = "trigger",
.callback = event_callback,
},
-#endif
#ifdef CONFIG_HIST_TRIGGERS
{
.name = "hist",
@@ -3189,7 +3194,10 @@ event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file)
if (!e_events)
return -ENOMEM;
- nr_entries = ARRAY_SIZE(event_entries);
+ if (trace_array_is_readonly(tr))
+ nr_entries = NR_RO_EVENT_ENTRIES;
+ else
+ nr_entries = ARRAY_SIZE(event_entries);
name = trace_event_name(call);
ei = eventfs_create_dir(name, e_events, event_entries, nr_entries, file);
@@ -4532,31 +4540,44 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr)
int nr_entries;
static struct eventfs_entry events_entries[] = {
{
- .name = "enable",
+ .name = "header_page",
.callback = events_callback,
},
{
- .name = "header_page",
+ .name = "header_event",
.callback = events_callback,
},
+#define NR_RO_TOP_ENTRIES 2
+/* Readonly files must be above this line and counted by NR_RO_TOP_ENTRIES. */
{
- .name = "header_event",
+ .name = "enable",
.callback = events_callback,
},
};
- entry = trace_create_file("set_event", TRACE_MODE_WRITE, parent,
- tr, &ftrace_set_event_fops);
- if (!entry)
- return -ENOMEM;
+ if (!trace_array_is_readonly(tr)) {
+ entry = trace_create_file("set_event", TRACE_MODE_WRITE, parent,
+ tr, &ftrace_set_event_fops);
+ if (!entry)
+ return -ENOMEM;
- trace_create_file("show_event_filters", TRACE_MODE_READ, parent, tr,
- &ftrace_show_event_filters_fops);
+ /* There are not as crucial, just warn if they are not created */
+ trace_create_file("show_event_filters", TRACE_MODE_READ, parent, tr,
+ &ftrace_show_event_filters_fops);
- trace_create_file("show_event_triggers", TRACE_MODE_READ, parent, tr,
- &ftrace_show_event_triggers_fops);
+ trace_create_file("show_event_triggers", TRACE_MODE_READ, parent, tr,
+ &ftrace_show_event_triggers_fops);
- nr_entries = ARRAY_SIZE(events_entries);
+ trace_create_file("set_event_pid", TRACE_MODE_WRITE, parent,
+ tr, &ftrace_set_event_pid_fops);
+
+ trace_create_file("set_event_notrace_pid",
+ TRACE_MODE_WRITE, parent, tr,
+ &ftrace_set_event_notrace_pid_fops);
+ nr_entries = ARRAY_SIZE(events_entries);
+ } else {
+ nr_entries = NR_RO_TOP_ENTRIES;
+ }
e_events = eventfs_create_events_dir("events", parent, events_entries,
nr_entries, tr);
@@ -4565,15 +4586,6 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr)
return -ENOMEM;
}
- /* There are not as crucial, just warn if they are not created */
-
- trace_create_file("set_event_pid", TRACE_MODE_WRITE, parent,
- tr, &ftrace_set_event_pid_fops);
-
- trace_create_file("set_event_notrace_pid",
- TRACE_MODE_WRITE, parent, tr,
- &ftrace_set_event_notrace_pid_fops);
-
tr->event_dir = e_events;
return 0;
^ permalink raw reply related
* [PATCH v10 2/3] tracing: Remove the backup instance automatically after read
From: Masami Hiramatsu (Google) @ 2026-04-01 6:37 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <177502545902.1311542.15822886746171808738.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Since the backup instance is readonly, after reading all data
via pipe, no data is left on the instance. Thus it can be
removed safely after closing all files.
This also removes it if user resets the ring buffer manually
via 'trace' file.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v10:
- Fix to use trace_array_destroy() to make it safer.
- Use cancel_work_sync() to wait for cancel when removing.
- Remove unneeded work_pending() check in trace_array_kick_autoremove()
(for trace_array_cancel_autoremove() to avoid deadlock of workqueue
it still use work_pending() check)
Changes in v9:
- Fix to initialize autoremove workqueue only for backup.
- Fix to return -ENODEV if trace_array_get() refers freeing instance.
Changes in v6:
- Fix typo in comment.
- Only when there is a readonly trace array, initialize autoremove_wq.
- Fix to exit loop in trace_array_get() if tr is found in the list.
Changes in v4:
- Update description.
---
kernel/trace/trace.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++----
kernel/trace/trace.h | 6 ++++
2 files changed, 79 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index fdffa62a2b4e..3b3ec6645feb 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -539,8 +539,59 @@ void trace_set_ring_buffer_expanded(struct trace_array *tr)
tr->ring_buffer_expanded = true;
}
+static void trace_array_autoremove(struct work_struct *work)
+{
+ struct trace_array *tr = container_of(work, struct trace_array, autoremove_work);
+
+ trace_array_destroy(tr);
+}
+
+static struct workqueue_struct *autoremove_wq;
+
+static void trace_array_kick_autoremove(struct trace_array *tr)
+{
+ if (autoremove_wq)
+ queue_work(autoremove_wq, &tr->autoremove_work);
+}
+
+static void trace_array_cancel_autoremove(struct trace_array *tr)
+{
+ /*
+ * Since this can be called inside trace_array_autoremove(),
+ * it has to avoid deadlock of the workqueue.
+ */
+ if (work_pending(&tr->autoremove_work))
+ cancel_work_sync(&tr->autoremove_work);
+}
+
+static void trace_array_init_autoremove(struct trace_array *tr)
+{
+ INIT_WORK(&tr->autoremove_work, trace_array_autoremove);
+}
+
+static void trace_array_start_autoremove(void)
+{
+ if (autoremove_wq)
+ return;
+
+ autoremove_wq = alloc_workqueue("tr_autoremove_wq",
+ WQ_UNBOUND | WQ_HIGHPRI, 0);
+ if (!autoremove_wq)
+ pr_warn("Unable to allocate tr_autoremove_wq. autoremove disabled.\n");
+}
+
LIST_HEAD(ftrace_trace_arrays);
+static int __trace_array_get(struct trace_array *this_tr)
+{
+ /* When free_on_close is set, this is not available anymore. */
+ if (autoremove_wq && this_tr->free_on_close)
+ return -ENODEV;
+
+ this_tr->ref++;
+ return 0;
+}
+
int trace_array_get(struct trace_array *this_tr)
{
struct trace_array *tr;
@@ -548,8 +599,7 @@ int trace_array_get(struct trace_array *this_tr)
guard(mutex)(&trace_types_lock);
list_for_each_entry(tr, &ftrace_trace_arrays, list) {
if (tr == this_tr) {
- tr->ref++;
- return 0;
+ return __trace_array_get(tr);
}
}
@@ -560,6 +610,12 @@ static void __trace_array_put(struct trace_array *this_tr)
{
WARN_ON(!this_tr->ref);
this_tr->ref--;
+ /*
+ * When free_on_close is set, prepare removing the array
+ * when the last reference is released.
+ */
+ if (this_tr->ref == 1 && this_tr->free_on_close)
+ trace_array_kick_autoremove(this_tr);
}
/**
@@ -4829,6 +4885,10 @@ static void update_last_data(struct trace_array *tr)
/* Only if the buffer has previous boot data clear and update it. */
tr->flags &= ~TRACE_ARRAY_FL_LAST_BOOT;
+ /* If this is a backup instance, mark it for autoremove. */
+ if (tr->flags & TRACE_ARRAY_FL_VMALLOC)
+ tr->free_on_close = true;
+
/* Reset the module list and reload them */
if (tr->scratch) {
struct trace_scratch *tscratch = tr->scratch;
@@ -8442,8 +8502,8 @@ struct trace_array *trace_array_find_get(const char *instance)
guard(mutex)(&trace_types_lock);
tr = trace_array_find(instance);
- if (tr)
- tr->ref++;
+ if (tr && __trace_array_get(tr) < 0)
+ tr = NULL;
return tr;
}
@@ -8540,6 +8600,8 @@ trace_array_create_systems(const char *name, const char *systems,
if (ftrace_allocate_ftrace_ops(tr) < 0)
goto out_free_tr;
+ trace_array_init_autoremove(tr);
+
ftrace_init_trace_array(tr);
init_trace_flags_index(tr);
@@ -8650,7 +8712,9 @@ struct trace_array *trace_array_get_by_name(const char *name, const char *system
list_for_each_entry(tr, &ftrace_trace_arrays, list) {
if (tr->name && strcmp(tr->name, name) == 0) {
- tr->ref++;
+ /* if this fails, @tr is going to be removed. */
+ if (__trace_array_get(tr) < 0)
+ tr = NULL;
return tr;
}
}
@@ -8689,6 +8753,7 @@ static int __remove_instance(struct trace_array *tr)
set_tracer_flag(tr, 1ULL << i, 0);
}
+ trace_array_cancel_autoremove(tr);
tracing_set_nop(tr);
clear_ftrace_function_probes(tr);
event_trace_del_tracer(tr);
@@ -9653,8 +9718,10 @@ __init static void enable_instances(void)
/*
* Backup buffers can be freed but need vfree().
*/
- if (backup)
+ if (backup) {
tr->flags |= TRACE_ARRAY_FL_VMALLOC | TRACE_ARRAY_FL_RDONLY;
+ trace_array_start_autoremove();
+ }
if (start || backup) {
tr->flags |= TRACE_ARRAY_FL_BOOT | TRACE_ARRAY_FL_LAST_BOOT;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 2aa11178af59..a34f2d9479db 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -455,6 +455,12 @@ struct trace_array {
* we do not waste memory on systems that are not using tracing.
*/
bool ring_buffer_expanded;
+ /*
+ * If the ring buffer is a read only backup instance, it will be
+ * removed after dumping all data via pipe, because no readable data.
+ */
+ bool free_on_close;
+ struct work_struct autoremove_work;
};
enum {
^ permalink raw reply related
* [PATCH v10 3/3] tracing/Documentation: Add a section about backup instance
From: Masami Hiramatsu (Google) @ 2026-04-01 6:38 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <177502545902.1311542.15822886746171808738.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a section about backup instance to the debugging.rst.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v10:
- Add a newline.
Changes in v6:
- Fix typos.
---
Documentation/trace/debugging.rst | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst
index 4d88c346fc38..bca1710d92bf 100644
--- a/Documentation/trace/debugging.rst
+++ b/Documentation/trace/debugging.rst
@@ -159,3 +159,22 @@ If setting it from the kernel command line, it is recommended to also
disable tracing with the "traceoff" flag, and enable tracing after boot up.
Otherwise the trace from the most recent boot will be mixed with the trace
from the previous boot, and may make it confusing to read.
+
+Using a backup instance for keeping previous boot data
+------------------------------------------------------
+
+It is also possible to record trace data at system boot time by specifying
+events with the persistent ring buffer, but in this case the data before the
+reboot will be lost before it can be read. This problem can be solved by a
+backup instance. From the kernel command line::
+
+ reserve_mem=12M:4096:trace trace_instance=boot_map@trace,sched,irq trace_instance=backup=boot_map
+
+On boot up, the previous data in the "boot_map" is copied to the "backup"
+instance, and the "sched:*" and "irq:*" events for the current boot are traced
+in the "boot_map". Thus the user can read the previous boot data from the "backup"
+instance without stopping the trace.
+
+Note that this "backup" instance is readonly, and will be removed automatically
+if you clear the trace data or read out all trace data from the "trace_pipe"
+or the "trace_pipe_raw" files.
^ permalink raw reply related
* [PATCH] mm/vmscan: add tracepoint for changes in min_seq and max_seq
From: wangzhen @ 2026-04-01 7:17 UTC (permalink / raw)
To: akpm@linux-foundation.org
Cc: rostedt@goodmis.org, mhiramat@kernel.org,
mathieu.desnoyers@efficios.com, hannes@cmpxchg.org,
david@kernel.org, mhocko@kernel.org, zhengqi.arch@bytedance.com,
shakeel.butt@linux.dev, ljs@kernel.org, axelrasmussen@google.com,
yuanchu@google.com, weixugc@google.com, jiayuan.chen@shopee.com,
wangzhen, linux-kernel@vger.kernel.org,
linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org,
kasong@tencent.com, baolin.wang@linux.alibaba.com,
baohua@kernel.org
From 20f685d13826ebaa86705a689128893be6316c7b Mon Sep 17 00:00:00 2001
From: w00021541 <wangzhen5@hihonor.com>
Date: Tue, 31 Mar 2026 16:00:48 +0800
Subject: [PATCH] mm/vmscan: add tracepoint for changes in min_seq and max_seq
Currently, when tracking the changes in min_seq and max_seq of mglru,
it can only be done through debugfs. The granularity of debugfs is too
coarse when we debug the generations of mglru. there's no way to trace
the increase in min_seq and max_seq, so we add 2 tracepoints,
mm_vmscan_lru_inc_min_seq and mm_vmscan_lru_inc_max_seq.
Test results:
kswapd0-94 [003] d..1. 72.664921: mm_vmscan_lru_inc_min_seq: memcg_id=87 type=0 swappiness=140 min_seq=2
kswapd0-94 [003] d..1. 72.664922: mm_vmscan_lru_inc_min_seq: memcg_id=87 type=1 swappiness=140 min_seq=1
kswapd0-94 [003] d..1. 72.667303: mm_vmscan_lru_inc_max_seq: memcg_id=87 swappiness=140 max_seq=4
kswapd0-94 [000] d..1. 72.807691: mm_vmscan_lru_inc_min_seq: memcg_id=25 type=0 swappiness=140 min_seq=2
kswapd0-94 [000] d..1. 72.807692: mm_vmscan_lru_inc_min_seq: memcg_id=25 type=1 swappiness=140 min_seq=1
kswapd0-94 [000] d..1. 72.810955: mm_vmscan_lru_inc_max_seq: memcg_id=25 swappiness=140 max_seq=4
kswapd0-94 [005] d..1. 73.482586: mm_vmscan_lru_inc_min_seq: memcg_id=91 type=0 swappiness=140 min_seq=2
kswapd0-94 [005] d..1. 73.482588: mm_vmscan_lru_inc_min_seq: memcg_id=91 type=1 swappiness=140 min_seq=1
kswapd0-94 [005] d..1. 73.485509: mm_vmscan_lru_inc_max_seq: memcg_id=91 swappiness=140 max_seq=4
kswapd0-94 [000] d..1. 77.708630: mm_vmscan_lru_inc_min_seq: memcg_id=88 type=0 swappiness=140 min_seq=1
kswapd0-94 [000] d..1. 77.709491: mm_vmscan_lru_inc_min_seq: memcg_id=88 type=0 swappiness=140 min_seq=2
kswapd0-94 [000] d..1. 77.709491: mm_vmscan_lru_inc_min_seq: memcg_id=88 type=1 swappiness=140 min_seq=1
kswapd0-94 [000] d..1. 77.715550: mm_vmscan_lru_inc_max_seq: memcg_id=88 swappiness=140 max_seq=4
kswapd0-94 [003] d..1. 77.749288: mm_vmscan_lru_inc_min_seq: memcg_id=4 type=0 swappiness=140 min_seq=2
kswapd0-94 [003] d..1. 77.749290: mm_vmscan_lru_inc_min_seq: memcg_id=4 type=1 swappiness=140 min_seq=1
kswapd0-94 [003] d..1. 77.763055: mm_vmscan_lru_inc_max_seq: memcg_id=4 swappiness=140 max_seq=4
Signed-off-by: w00021541 <wangzhen5@hihonor.com>
---
include/trace/events/vmscan.h | 49 +++++++++++++++++++++++++++++++++++
mm/vmscan.c | 6 +++++
2 files changed, 55 insertions(+)
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index ea58e4656abf..6f2ae5c597fd 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -488,6 +488,55 @@ TRACE_EVENT(mm_vmscan_lru_shrink_active,
show_reclaim_flags(__entry->reclaim_flags))
);
+#ifdef CONFIG_LRU_GEN
+TRACE_EVENT(mm_vmscan_lru_inc_min_seq,
+
+ TP_PROTO(struct mem_cgroup *memcg, int type, int swappiness, unsigned long min_seq),
+
+ TP_ARGS(memcg, type, swappiness, min_seq),
+
+ TP_STRUCT__entry(
+ __field(int, id)
+ __field(int, type)
+ __field(int, swappiness)
+ __field(unsigned long, min_seq)
+ ),
+
+ TP_fast_assign(
+ __entry->id = mem_cgroup_id(memcg);
+ __entry->type = type;
+ __entry->swappiness = swappiness;
+ __entry->min_seq = min_seq;
+ ),
+
+ TP_printk("memcg_id=%d type=%d swappiness=%d min_seq=%ld",
+ __entry->id, __entry->type,
+ __entry->swappiness, __entry->min_seq)
+);
+
+TRACE_EVENT(mm_vmscan_lru_inc_max_seq,
+
+ TP_PROTO(struct mem_cgroup *memcg, int swappiness, unsigned long max_seq),
+
+ TP_ARGS(memcg, swappiness, max_seq),
+
+ TP_STRUCT__entry(
+ __field(int, id)
+ __field(int, swappiness)
+ __field(unsigned long, max_seq)
+ ),
+
+ TP_fast_assign(
+ __entry->id = mem_cgroup_id(memcg);
+ __entry->swappiness = swappiness;
+ __entry->max_seq = max_seq;
+ ),
+
+ TP_printk("memcg_id=%d swappiness=%d max_seq=%ld",
+ __entry->id, __entry->swappiness, __entry->max_seq)
+);
+#endif /* CONFIG_LRU_GEN */
+
TRACE_EVENT(mm_vmscan_node_reclaim_begin,
TP_PROTO(int nid, int order, gfp_t gfp_flags),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fc9373e8251..6b5d21ee45ba 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3848,6 +3848,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
int zone;
int remaining = MAX_LRU_BATCH;
struct lru_gen_folio *lrugen = &lruvec->lrugen;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
int hist = lru_hist_from_seq(lrugen->min_seq[type]);
int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
@@ -3892,6 +3893,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
done:
reset_ctrl_pos(lruvec, type, true);
WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
+ trace_mm_vmscan_lru_inc_min_seq(memcg, type, swappiness, lrugen->min_seq[type]);
return true;
}
@@ -3902,6 +3904,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
bool success = false;
bool seq_inc_flag = false;
struct lru_gen_folio *lrugen = &lruvec->lrugen;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
DEFINE_MIN_SEQ(lruvec);
VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
@@ -3947,6 +3950,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
reset_ctrl_pos(lruvec, type, true);
WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
+ trace_mm_vmscan_lru_inc_min_seq(memcg, type, swappiness, lrugen->min_seq[type]);
success = true;
}
@@ -3959,6 +3963,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness
int prev, next;
int type, zone;
struct lru_gen_folio *lrugen = &lruvec->lrugen;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
restart:
if (seq < READ_ONCE(lrugen->max_seq))
return false;
@@ -4012,6 +4017,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness
WRITE_ONCE(lrugen->timestamps[next], jiffies);
/* make sure preceding modifications appear */
smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
+ trace_mm_vmscan_lru_inc_max_seq(memcg, swappiness, lrugen->max_seq);
unlock:
spin_unlock_irq(&lruvec->lru_lock);
--
2.17.1
^ permalink raw reply related
* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Kiryl Shutsemau @ 2026-04-01 9:40 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Breno Leitao, Jonathan Corbet, Shuah Khan, linux-kernel,
linux-trace-kernel, linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <20260401100237.ff9a37505d041a00e4d8658a@kernel.org>
On Wed, Apr 01, 2026 at 10:02:37AM +0900, Masami Hiramatsu wrote:
> On Tue, 31 Mar 2026 11:18:33 +0100
> Kiryl Shutsemau <kirill@shutemov.name> wrote:
>
> > On Wed, Mar 25, 2026 at 11:22:04PM +0900, Masami Hiramatsu wrote:
> > > > diff --git a/init/main.c b/init/main.c
> > > > index 453ac9dff2da0..14a04c283fa48 100644
> > > > --- a/init/main.c
> > > > +++ b/init/main.c
> > > > @@ -416,9 +416,64 @@ static int __init warn_bootconfig(char *str)
> > > > return 0;
> > > > }
> > > >
> > > > +/*
> > > > + * do_early_param() is defined later in this file but called from
> > > > + * bootconfig_apply_early_params() below, so we need a forward declaration.
> > > > + */
> > > > +static int __init do_early_param(char *param, char *val,
> > > > + const char *unused, void *arg);
> > > > +
> > > > +/*
> > > > + * bootconfig_apply_early_params - dispatch kernel.* keys from the embedded
> > > > + * bootconfig as early_param() calls.
> > > > + *
> > > > + * early_param() handlers must run before most of the kernel initialises
> > > > + * (e.g. before the GIC driver reads irqchip.gicv3_pseudo_nmi). A bootconfig
> > > > + * attached to the initrd arrives too late for this because the initrd is not
> > > > + * mapped yet when early params are processed. The embedded bootconfig lives
> > > > + * in the kernel image itself (.init.data), so it is always reachable.
> > > > + *
> > > > + * This function is called from setup_boot_config() which runs in
> > > > + * start_kernel() before parse_early_param(), making the timing correct.
> > > > + */
> > > > +static void __init bootconfig_apply_early_params(void)
> > >
> > > [sashiko comment]
> > > | Does this run early enough for architectural parameters?
> > > | While setup_boot_config() runs before parse_early_param() in start_kernel(),
> > > | it runs after setup_arch(). setup_boot_config() relies on xbc_init() which
> > > | uses the memblock allocator, requiring setup_arch() to have already
> > > | initialized it.
> > > | However, the kernel expects many early parameters (like mem=, earlycon,
> > > | noapic, and iommu) to be parsed during setup_arch() via the architecture's
> > > | call to parse_early_param(). Since setup_arch() completes before
> > > | setup_boot_config() runs, will these architectural early parameters be
> > > | silently ignored because the decisions they influence were already
> > > | finalized?
> > >
> > > This is the major reason that I did not support early parameter
> > > in bootconfig. Some archs initialize kernel_cmdline in setup_arch()
> > > and setup early parameters in it.
> > > To fix this, we need to change setup_arch() for each architecture so
> > > that it calls this bootconfig_apply_early_params().
> >
> > Hi Masami,
> >
> > I have a question about bootconfig design. Is it necessary to parse the
> > bootconfig at boot time?
> >
> > We could avoid a lot of complexity if we flattened the bootconfig into a
> > simple key-value list before embedding it in the kernel image or
> > attaching it to the initrd. That would eliminate the need for memory
> > allocations and allow the config to be used earlier during boot.
>
> Hi Kiryl,
>
> Thanks for a good question.
>
> If it is embedded in the kernel, yes, we can compile it. But if it is
> attached to initrd, the kernel needs to verify it. So anyway we have to
> parse the key-value. Of course we can make it a binary data structure,
> but it still needs verification. Moreover, memory allocation is not
> required by design (the first design uses a statically allocated data).
>
> Even if it is normalized as key-value, we can not trust the contents
> without outer verification system, like verified boot. Therefore, our
> basic strategy is to parse the text.
Hm. I am not sure what issues verification aims to solve. Could you
elaborate.
With normalized key-value structure, we should be able to extract the
needed value in the same way as with normal kernel command line, no?
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Pengpeng Hou @ 2026-04-01 11:22 UTC (permalink / raw)
To: rostedt
Cc: mhiramat, mathieu.desnoyers, tom.zanussi, linux-kernel,
linux-trace-kernel, pengpeng
hist_field_name() uses a static MAX_FILTER_STR_VAL buffer for fully
qualified variable-reference names, but it currently appends into that
buffer with strcat() without rebuilding it first. As a result, repeated
calls append a new "system.event.field" name onto the previous one,
which can eventually run past the end of full_name.
Build the name with snprintf() on each call and return NULL if the fully
qualified name does not fit in MAX_FILTER_STR_VAL.
Fixes: 067fe038e70f ("tracing: Add variable reference handling to hist triggers")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
Changes since v1: https://lore.kernel.org/all/20260329030950.32503-1-pengpeng@iscas.ac.cn/
- rebuild full_name on each call instead of falling back to field->name
- return NULL on overflow as suggested
- split out the snprintf() length check instead of using an inline if
kernel/trace/trace_events_hist.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 73ea180cad55..f9c8a4f078ea 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1361,12 +1361,14 @@ static const char *hist_field_name(struct hist_field *field,
field->flags & HIST_FIELD_FL_VAR_REF) {
if (field->system) {
static char full_name[MAX_FILTER_STR_VAL];
+ int len;
+
+ len = snprintf(full_name, sizeof(full_name), "%s.%s.%s",
+ field->system, field->event_name,
+ field->name);
+ if (len >= sizeof(full_name))
+ return NULL;
- strcat(full_name, field->system);
- strcat(full_name, ".");
- strcat(full_name, field->event_name);
- strcat(full_name, ".");
- strcat(full_name, field->name);
field_name = full_name;
} else
field_name = field->name;
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH v2 2/2] tracing/hist: reject synthetic-field strings that exceed MAX_FILTER_STR_VAL
From: Pengpeng Hou @ 2026-04-01 11:22 UTC (permalink / raw)
To: rostedt
Cc: mhiramat, mathieu.desnoyers, tom.zanussi, linux-kernel,
linux-trace-kernel, pengpeng
The synthetic field helpers build a prefixed synthetic field name and a
hist trigger command in fixed MAX_FILTER_STR_VAL staging buffers. Even
when each individual field or filter string stays within that limit, the
combined "keys=...:synthetic_...=... if ..." command can exceed 256
bytes and overrun the scratch buffer.
Keep MAX_FILTER_STR_VAL as the tracing-side limit and reject synthetic
field names or generated commands that do not fit in that bound before
formatting them into the fixed buffers.
Fixes: 02205a6752f2 ("tracing: Add support for 'field variables'")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
Changes since v1: https://lore.kernel.org/all/20260329030950.32503-1-pengpeng@iscas.ac.cn/
- keep MAX_FILTER_STR_VAL as the fixed tracing-side limit
- reject overlong synthetic names and generated commands with -E2BIG
- drop the previous dynamic-allocation approach
kernel/trace/trace_events_hist.c | 57 +++++++++++++++++++++++---------
1 file changed, 41 insertions(+), 16 deletions(-)
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index f9c8a4f078ea..4172c91605af 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -2966,12 +2966,16 @@ find_synthetic_field_var(struct hist_trigger_data *target_hist_data,
struct hist_field *event_var;
char *synthetic_name;
+ if ((sizeof("synthetic_") - 1) + strlen(field_name) >=
+ MAX_FILTER_STR_VAL)
+ return ERR_PTR(-E2BIG);
+
synthetic_name = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
if (!synthetic_name)
return ERR_PTR(-ENOMEM);
- strcpy(synthetic_name, "synthetic_");
- strcat(synthetic_name, field_name);
+ scnprintf(synthetic_name, MAX_FILTER_STR_VAL, "synthetic_%s",
+ field_name);
event_var = find_event_var(target_hist_data, system, event_name, synthetic_name);
@@ -3018,6 +3022,8 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
struct hist_field *event_var;
char *saved_filter;
char *cmd;
+ size_t cmdlen;
+ size_t off;
int ret;
if (target_hist_data->n_field_var_hists >= SYNTH_FIELDS_MAX) {
@@ -3048,13 +3054,36 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
/* See if a synthetic field variable has already been created */
event_var = find_synthetic_field_var(target_hist_data, subsys_name,
event_name, field_name);
- if (!IS_ERR_OR_NULL(event_var))
+ if (IS_ERR(event_var))
+ return event_var;
+ if (event_var)
return event_var;
var_hist = kzalloc_obj(*var_hist);
if (!var_hist)
return ERR_PTR(-ENOMEM);
+ saved_filter = find_trigger_filter(hist_data, file);
+
+ cmdlen = strlen("keys=") + strlen(":synthetic_") +
+ strlen(field_name) + strlen("=") + strlen(field_name);
+ first = true;
+ for_each_hist_key_field(i, hist_data) {
+ key_field = hist_data->fields[i];
+ if (!first)
+ cmdlen++;
+ cmdlen += strlen(key_field->field->name);
+ first = false;
+ }
+
+ if (saved_filter)
+ cmdlen += strlen(" if ") + strlen(saved_filter);
+
+ if (cmdlen >= MAX_FILTER_STR_VAL) {
+ kfree(var_hist);
+ return ERR_PTR(-E2BIG);
+ }
+
cmd = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
if (!cmd) {
kfree(var_hist);
@@ -3062,28 +3091,24 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
}
/* Use the same keys as the compatible histogram */
- strcat(cmd, "keys=");
+ off = scnprintf(cmd, MAX_FILTER_STR_VAL, "keys=");
+ first = true;
for_each_hist_key_field(i, hist_data) {
key_field = hist_data->fields[i];
- if (!first)
- strcat(cmd, ",");
- strcat(cmd, key_field->field->name);
+ off += scnprintf(cmd + off, MAX_FILTER_STR_VAL - off, "%s%s",
+ first ? "" : ",", key_field->field->name);
first = false;
}
/* Create the synthetic field variable specification */
- strcat(cmd, ":synthetic_");
- strcat(cmd, field_name);
- strcat(cmd, "=");
- strcat(cmd, field_name);
+ off += scnprintf(cmd + off, MAX_FILTER_STR_VAL - off,
+ ":synthetic_%s=%s", field_name, field_name);
/* Use the same filter as the compatible histogram */
- saved_filter = find_trigger_filter(hist_data, file);
- if (saved_filter) {
- strcat(cmd, " if ");
- strcat(cmd, saved_filter);
- }
+ if (saved_filter)
+ scnprintf(cmd + off, MAX_FILTER_STR_VAL - off, " if %s",
+ saved_filter);
var_hist->cmd = kstrdup(cmd, GFP_KERNEL);
if (!var_hist->cmd) {
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* Re: [PATCH] rv: Allow epoll in rtapp-sleep monitor
From: Gabriele Monaco @ 2026-04-01 11:47 UTC (permalink / raw)
To: Nam Cao; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <87v7ec6n1n.fsf@yellow.woof>
On Tue, 2026-03-31 at 17:23 +0200, Nam Cao wrote:
> Gabriele Monaco <gmonaco@redhat.com> writes:
> > On Tue, 2026-03-31 at 12:49 +0200, Nam Cao wrote:
> > > diff --git a/kernel/trace/rv/monitors/sleep/sleep.c
> > > b/kernel/trace/rv/monitors/sleep/sleep.c
> > > index c1347da69e9d..59091863c17c 100644
> > > --- a/kernel/trace/rv/monitors/sleep/sleep.c
> > > +++ b/kernel/trace/rv/monitors/sleep/sleep.c
> > > @@ -162,6 +164,11 @@ static void handle_sys_enter(void *data, struct
> > > pt_regs
> > > *regs, long id)
> > > break;
> > > }
> > > break;
> > > +#ifdef __NR_epoll_wait
> > > + case __NR_epoll_wait:
> > > + ltl_atom_set(mon, LTL_EPOLL_WAIT, true);
> > > + break;
> > > +#endif
> >
> > Sashiko (the AI bot) wonders why this isn't ltl_atom_update() like other
> > things
> > around here. Is that intentional?
>
> No that's not intentional. It does not affect verification result, but
> still should be fixed. I will send v2.
>
> Funnily a colleague just told me earlier today about how good AIs are at
> reviewing..
Well, they're probably already better than (most of) us in those kind of
consistency assessments..
Anyway, I tried your changes and it seems to work as I'd expect. Feel free to
send a V2 and I can include it in the pull request.
Thanks,
Gabriele
^ permalink raw reply
* [PATCH v2] rv: Allow epoll in rtapp-sleep monitor
From: Nam Cao @ 2026-04-01 13:08 UTC (permalink / raw)
To: Gabriele Monaco; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel, Nam Cao
Since commit 0c43094f8cc9 ("eventpoll: Replace rwlock with spinlock"),
epoll_wait is real-time-safe syscall for sleeping.
Add epoll_wait to the list of rt-safe sleeping APIs.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
---
v2: Use ltl_atom_update() instead of ltl_atom_set() in handle_sys_enter()
---
kernel/trace/rv/monitors/sleep/sleep.c | 8 ++
kernel/trace/rv/monitors/sleep/sleep.h | 98 ++++++++++++-----------
tools/verification/models/rtapp/sleep.ltl | 1 +
3 files changed, 61 insertions(+), 46 deletions(-)
diff --git a/kernel/trace/rv/monitors/sleep/sleep.c b/kernel/trace/rv/monitors/sleep/sleep.c
index c1347da69e9d..8dfe5ec13e19 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.c
+++ b/kernel/trace/rv/monitors/sleep/sleep.c
@@ -49,6 +49,7 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
ltl_atom_set(mon, LTL_FUTEX_WAIT, false);
+ ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
ltl_atom_set(mon, LTL_FUTEX_LOCK_PI, false);
ltl_atom_set(mon, LTL_BLOCK_ON_RT_MUTEX, false);
}
@@ -63,6 +64,7 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
+ ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
if (strstarts(task->comm, "migration/"))
ltl_atom_set(mon, LTL_TASK_IS_MIGRATION, true);
@@ -162,6 +164,11 @@ static void handle_sys_enter(void *data, struct pt_regs *regs, long id)
break;
}
break;
+#ifdef __NR_epoll_wait
+ case __NR_epoll_wait:
+ ltl_atom_update(current, LTL_EPOLL_WAIT, true);
+ break;
+#endif
}
}
@@ -174,6 +181,7 @@ static void handle_sys_exit(void *data, struct pt_regs *regs, long ret)
ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_MONOTONIC, false);
ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
+ ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
ltl_atom_update(current, LTL_CLOCK_NANOSLEEP, false);
}
diff --git a/kernel/trace/rv/monitors/sleep/sleep.h b/kernel/trace/rv/monitors/sleep/sleep.h
index 2ab46fd218d2..95dc2727c059 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.h
+++ b/kernel/trace/rv/monitors/sleep/sleep.h
@@ -15,6 +15,7 @@ enum ltl_atom {
LTL_ABORT_SLEEP,
LTL_BLOCK_ON_RT_MUTEX,
LTL_CLOCK_NANOSLEEP,
+ LTL_EPOLL_WAIT,
LTL_FUTEX_LOCK_PI,
LTL_FUTEX_WAIT,
LTL_KERNEL_THREAD,
@@ -40,6 +41,7 @@ static const char *ltl_atom_str(enum ltl_atom atom)
"ab_sl",
"bl_on_rt_mu",
"cl_na",
+ "ep_wa",
"fu_lo_pi",
"fu_wa",
"ker_th",
@@ -75,39 +77,41 @@ static_assert(RV_NUM_BA_STATES <= RV_MAX_BA_STATES);
static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
{
- bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
- bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
- bool val40 = task_is_rcu || task_is_migration;
- bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
- bool val41 = futex_lock_pi || val40;
- bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
- bool val5 = block_on_rt_mutex || val41;
- bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
- bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
- bool val32 = abort_sleep || kthread_should_stop;
bool woken_by_nmi = test_bit(LTL_WOKEN_BY_NMI, mon->atoms);
- bool val33 = woken_by_nmi || val32;
bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
- bool val34 = woken_by_hardirq || val33;
bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
mon->atoms);
- bool val14 = woken_by_equal_or_higher_prio || val34;
bool wake = test_bit(LTL_WAKE, mon->atoms);
- bool val13 = !wake;
- bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
+ bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
+ bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
+ bool sleep = test_bit(LTL_SLEEP, mon->atoms);
+ bool rt = test_bit(LTL_RT, mon->atoms);
+ bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
bool nanosleep_clock_monotonic = test_bit(LTL_NANOSLEEP_CLOCK_MONOTONIC, mon->atoms);
- bool val24 = nanosleep_clock_monotonic || nanosleep_clock_tai;
- bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
- bool val25 = nanosleep_timer_abstime && val24;
- bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
- bool val18 = clock_nanosleep && val25;
+ bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
+ bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
- bool val9 = futex_wait || val18;
+ bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
+ bool epoll_wait = test_bit(LTL_EPOLL_WAIT, mon->atoms);
+ bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
+ bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
+ bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
+ bool val42 = task_is_rcu || task_is_migration;
+ bool val43 = futex_lock_pi || val42;
+ bool val5 = block_on_rt_mutex || val43;
+ bool val34 = abort_sleep || kthread_should_stop;
+ bool val35 = woken_by_nmi || val34;
+ bool val36 = woken_by_hardirq || val35;
+ bool val14 = woken_by_equal_or_higher_prio || val36;
+ bool val13 = !wake;
+ bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
+ bool val27 = nanosleep_timer_abstime && val26;
+ bool val18 = clock_nanosleep && val27;
+ bool val20 = val18 || epoll_wait;
+ bool val9 = futex_wait || val20;
bool val11 = val9 || kernel_thread;
- bool sleep = test_bit(LTL_SLEEP, mon->atoms);
bool val2 = !sleep;
- bool rt = test_bit(LTL_RT, mon->atoms);
bool val1 = !rt;
bool val3 = val1 || val2;
@@ -124,39 +128,41 @@ static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
static void
ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned long *next)
{
- bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
- bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
- bool val40 = task_is_rcu || task_is_migration;
- bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
- bool val41 = futex_lock_pi || val40;
- bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
- bool val5 = block_on_rt_mutex || val41;
- bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
- bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
- bool val32 = abort_sleep || kthread_should_stop;
bool woken_by_nmi = test_bit(LTL_WOKEN_BY_NMI, mon->atoms);
- bool val33 = woken_by_nmi || val32;
bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
- bool val34 = woken_by_hardirq || val33;
bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
mon->atoms);
- bool val14 = woken_by_equal_or_higher_prio || val34;
bool wake = test_bit(LTL_WAKE, mon->atoms);
- bool val13 = !wake;
- bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
+ bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
+ bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
+ bool sleep = test_bit(LTL_SLEEP, mon->atoms);
+ bool rt = test_bit(LTL_RT, mon->atoms);
+ bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
bool nanosleep_clock_monotonic = test_bit(LTL_NANOSLEEP_CLOCK_MONOTONIC, mon->atoms);
- bool val24 = nanosleep_clock_monotonic || nanosleep_clock_tai;
- bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
- bool val25 = nanosleep_timer_abstime && val24;
- bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
- bool val18 = clock_nanosleep && val25;
+ bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
+ bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
- bool val9 = futex_wait || val18;
+ bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
+ bool epoll_wait = test_bit(LTL_EPOLL_WAIT, mon->atoms);
+ bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
+ bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
+ bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
+ bool val42 = task_is_rcu || task_is_migration;
+ bool val43 = futex_lock_pi || val42;
+ bool val5 = block_on_rt_mutex || val43;
+ bool val34 = abort_sleep || kthread_should_stop;
+ bool val35 = woken_by_nmi || val34;
+ bool val36 = woken_by_hardirq || val35;
+ bool val14 = woken_by_equal_or_higher_prio || val36;
+ bool val13 = !wake;
+ bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
+ bool val27 = nanosleep_timer_abstime && val26;
+ bool val18 = clock_nanosleep && val27;
+ bool val20 = val18 || epoll_wait;
+ bool val9 = futex_wait || val20;
bool val11 = val9 || kernel_thread;
- bool sleep = test_bit(LTL_SLEEP, mon->atoms);
bool val2 = !sleep;
- bool rt = test_bit(LTL_RT, mon->atoms);
bool val1 = !rt;
bool val3 = val1 || val2;
diff --git a/tools/verification/models/rtapp/sleep.ltl b/tools/verification/models/rtapp/sleep.ltl
index 6379bbeb6212..6f26c4810f78 100644
--- a/tools/verification/models/rtapp/sleep.ltl
+++ b/tools/verification/models/rtapp/sleep.ltl
@@ -5,6 +5,7 @@ RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD)
RT_VALID_SLEEP_REASON = FUTEX_WAIT
or RT_FRIENDLY_NANOSLEEP
+ or EPOLL_WAIT
RT_FRIENDLY_NANOSLEEP = CLOCK_NANOSLEEP
and NANOSLEEP_TIMER_ABSTIME
--
2.47.3
^ permalink raw reply related
* [BUG] bpf: kprobe_multi allows BPF_F_SLEEPABLE programs, causing sleeping-in-atomic splat
From: Varun R Mallya @ 2026-04-01 13:44 UTC (permalink / raw)
To: bpf
Cc: linux-kernel, kpsingh, mattbobrowski, ast, daniel, andrii,
martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, rostedt,
mhiramat, mathieu.desnoyers, linux-trace-kernel
Hi,
I found a bug in the kprobe_multi BPF link creation path where the kernel
fails to reject programs with BPF_F_SLEEPABLE set.
kprobe.multi programs run in an atomic/RCU context and cannot sleep.
However, bpf_link_create() with BPF_TRACE_KPROBE_MULTI does not validate
whether the program being attached has BPF_F_SLEEPABLE set. This allows
sleepable helpers such as bpf_copy_from_user() to be invoked from a
non-sleepable context, triggering a "sleeping function called from invalid
context" splat.
Reproducer:
/* crash.bpf.c */
SEC("kprobe.multi")
int handle_kprobe_multi_crash(struct pt_regs *ctx)
{
char buf[16];
int ret;
ret = bpf_copy_from_user(buf, sizeof(buf), (void *)ctx->sp);
bpf_printk("bpf_copy_from_user ret: %d\n", ret);
return 0;
}
/* loader: manually set BPF_F_SLEEPABLE before load, then attach via
bpf_link_create() with BPF_TRACE_KPROBE_MULTI */
bpf_program__set_flags(prog, BPF_F_SLEEPABLE);
Kernel output (bpf-next 7.0.0-rc5+):
[ 483.577248] BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:
169
[ 483.577364] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
[ 483.577420] preempt_count: 1, expected: 0
[ 483.577453] RCU nest depth: 2, expected: 0
[ 483.577486] INFO: lockdep is turned off.
Attachment successful! The bug is confirmed.
Check 'sudo dmesg -w' for BPF prints and warnings.
Triggering v[ 483.577569] CfsPU: 7 UID: 0 PID: 1787 Comm: sudo Tainted: G W _
rea7.0.0-rc5+ #4 PREEMPT(full)
d[ 483.577571] T.ainted: [W]=WARN..
Su[ 483.577572] Hrvardware name: Bochs Bochs, BIOS iBochs 01/01/2011v
[ 483.577573] Call Trace:
[ 483.577575] e<TASK>
[ 483.577578] d!dump_stack_lvl+0x54/0x70
[ 483.577582] If __might_resched+b0x200/0x220
p[ 483.577585] f__might_fault+0x_c2c/0x80
o[ 483.577587] p_copy_from_user+y0x23/0x80
_[ 483.577591] bpf_copy_from_user+0x27/0x50
[ 483.577594] fbpf_prog_1906cf6roa66546b5e_handlem__kprobe_multi_crash+0x33/0x4d
u[ 483.577596] skprobe_multi_linek_handler+0x15d/r re0x260
[ 483.577599] t? kprobe_multi_lurink_handler+0x99n/0x260
[ 483.577602] e? __pfx_vfs_readd +0x10/0x10
0[ 483.577604] in? ksys_read+0x7c/0x100
[ 483.577605] dm? vfs_read+0x4/0esg, x360
t[ 483.577607] fprobe_ftrace_entry+0x3b8/0x480
[ 483.577609] h? ksys_read+0x7ce /0x100
e[ 483.577610] x? fprobe_ftrace_pentry+0x93/0x480l
o[ 483.577612] it? lock_release+0x42/0x330
[ 483.577615] ? vfs_read+0x9/0wx360
o[ 483.577616] r? __x64_sys_rt_sigaction+0xde/0xke120
[ 483.577619] ? vfs_read+0x9/0x360
[ 483.577620] d? ksys_read+0x7c./0x100
[ 483.577622]
? do_syscall_64+0x143/0xf80
[ 483.577624] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 483.577625] ? trace_irq_disable+0x1d/0xc0
[ 483.577627] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 483.577631] </TASK>
I am sending in a patch that fixes this.
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
^ permalink raw reply
* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Masami Hiramatsu @ 2026-04-01 13:48 UTC (permalink / raw)
To: Breno Leitao
Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <acvjcCqIAeHyIiQN@gmail.com>
Hi Breno,
On Tue, 31 Mar 2026 08:27:59 -0700
Breno Leitao <leitao@debian.org> wrote:
> hello Masami,
>
> On Tue, Mar 31, 2026 at 12:58:27PM +0900, Masami Hiramatsu wrote:
>
> > > 3) Ensure that early bootconfig parameters don't overwrite the boot command
> > > line. For example, if the boot command line has foo=bar and bootconfig
> > > later has foo=baz, the command line value should take precedence.
> > > This prevents early boot code (in setup_arch()) from seeing a parameter
> > > value that will be changed later.
> >
> > OK, this also needs to be considered. Currently we just pass the bootconfig
> > parameters right before bootloader given parameters as "extra_command_line"
> > if "bootconfig" in cmdline or CONFIG_BOOT_CONFIG_FORCE=y.
> >
> > [boot_config(.kernel)]<command_line>[ -- [boot_config(.init)][init_command_line]]
> >
> > This is currently expected behavior. The bootconfig parameters are
> > expected to be overridden by command_line or command_line are appended.
>
> That's correct, and I have no intention of changing this behavior. Here's
> the current approach:
>
> 1) Early parameters from the bootloader are parsed first in setup_arch()
>
> 2) Subsequently, bootconfig_apply_early_params() is invoked. Any early
> parameter that was already parsed from the bootloader (in setup_arch())
> will be skipped at this stage.
Ah, I meant if we skip these parameters, we should not show it in the
command line via extra_command_line. This is still a minor issue at this
point. It should find early parameters in kernel.* parameters and do not
show it in extra_command_line, because those parameters are ignored.
So it is better to make a separated patch to fix that.
For example, if we pass
kernel.mem=1G
via bootconfig, it will be shown in the /proc/cmdline, but it is
not applied. This can confuse user.
>
> > If we change this for early params, we also should change the expected
> > output of /proc/cmdline too. I think we have 2 options;
> >
> > - As before, we expect the parameters provided by the boot configuration
> > to be processed first and then overridden later by the command line.
> >
> > Or,
> >
> > - ignore all parameters which is given from the command line, this also
> > updates existing setup_boot_config() (means xbc_snprint_cmdline() ).
> >
> > Anyway, this behavior change will also be a bit critical... We have
> > to announce it.
>
> As mentioned above, I don't anticipate any changes to existing behavior.
> Bootconfig parsing remains unchanged. The only modification is that
> bootconfig_apply_early_params() will skip any early config parameter
> that's already present in the bootloader command line.
Yes, but it is just different from existing one.
Suppose that if we have "early" and "normal" keys in the kernel, those
are handled by early_param() and __setup() respectively.
If we use bootconfig, like
kernel {
early = bconf_val
normal = bconf_val
}
And passes "early=foo normal=bar" via cmdline.
In this case, the /proc/cmdline eventually has
"early=bconf_val normal=bconf_val early=foo normal=bar"
And the "normal" callback called with "bconf_val" and "bar" twice.
However, "early" callback will be called with "foo" only once.
That can confuse users too.
I believe it's important for the system to behave in a way that is
as close as possible to the user's mind model.
Because the behavior is inconsistent when multiple parameters with
the same name are specified on the kernel command line, it is
necessary to ensure that users can later look at it and infer what
happened.
I mean, if a parameter is skipped, it should not be printed at
/proc/cmdline, because it can mislead user (and maybe bug reporter)
when a problem happens.
>
> > > +Note that embedded bootconfig is parsed after ``setup_arch()``, so
> > > +early options that are consumed during architecture initialization
> > > +(e.g., ``mem=``, ``memmap=``, ``earlycon``, ``noapic``, ``nolapic``,
> > > +``acpi=``, ``numa=``, ``iommu=``) may not take effect from bootconfig.
> > > +
> >
> > This is easy to explain, but it's quite troublesome for users to
> > determine which parameters are unavailable.
>
> Agreed. This turned out to be significantly more complex than I
> initially anticipated.
Yeah, that's complicated.
>
> I'm uncertain whether we can accomplish this without examining every
> early_parameter() implementation in depth.
Agreed. My proposal is something like a divide and conquer approach.
Since these are implemented architecture by architecture, you need to
check the implementation for each architecture, and I think it's best
to implement them one by one using Kconfig.
>
> > Currently we can identify
> > it by `git grep early_param -- arch/${ARCH}`. But it is setup in
> > setup_arch() we need to track the source code. (Or ask AI :))
>
> The challenge extends beyond that. There are numerous early_parameter()
> definitions scattered throughout the kernel that may or may not be
> utilized by setup_arch().
>
> For example, consider `early_param("mitigations", ..)` in
> ./kernel/cpu.c. This modifies the cpu_mitigations global variable, which
> is referenced in various locations across different architectures.
>
> It's worth noting that we have over 300 early_parameter() instances in
> the kernel.
>
> Given this, analyzing all these early parameters and examining each one
> individually represents a substantial amount of work.
Yes, that may require a substantial amount of work. But to improve
the kernel framework around the parameter handling, eventually we
need to examine each early parameter.
>
> Are there alternative approaches? At this point, I'm leaning toward
> breaking bootconfig's dependency on memblock, allowing us to invoke it
> before setup_arch(). Is this the only practical solution available?!
Basically, the memblock dependency comes from allocating copy of data.
Only for the embedded bootconfig, we can just pass copy memory block
to the xbc_init(). Something like;
xbc_init() {
xbc_data = memblock_alloc();
memcpy(xbc_data, data);
__xbc_init(xbc_data);
}
embedded_xbc_init() {
__xbc_init(embedded_bootconfig_data);
}
Afterwards, we can pass mixture of embedded bootcofnigt and initrd
bootconfig data to parser again.
(But in this case, we must be careful not to override the early
parameters that we have already applied.)
Thank you,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH bpf] bpf: Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-01 13:49 UTC (permalink / raw)
To: varunrmallya
Cc: andrii, ast, bpf, daniel, eddyz87, jolsa, kpsingh, linux-kernel,
linux-trace-kernel, martin.lau, mathieu.desnoyers, mattbobrowski,
memxor, mhiramat, rostedt, song, yonghong.song
In-Reply-To: <ac0hLyMuPHiZfnVc@computer>
kprobe.multi programs run in atomic/RCU context and cannot sleep.
However, bpf_kprobe_multi_link_attach() did not validate whether the
program being attached had the sleepable flag set, allowing sleepable
helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
context.
This causes a "sleeping function called from invalid context" splat:
BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
preempt_count: 1, expected: 0
RCU nest depth: 2, expected: 0
Fix this by rejecting sleepable programs early in
bpf_kprobe_multi_link_attach(), before any further processing.
Fixes: 0dcac272540613d41c05e89679e4ddb978b612f1 ("bpf: Add multi kprobe link")
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
---
kernel/trace/bpf_trace.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0b040a417442..af7079aa0f36 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
if (!is_kprobe_multi(prog))
return -EINVAL;
+ /* kprobe_multi is not allowed to be sleepable. */
+ if (prog->sleepable)
+ return -EINVAL;
+
/* Writing to context is not allowed for kprobes. */
if (prog->aux->kprobe_write_ctx)
return -EINVAL;
--
2.53.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox