* Re: [PATCH v6 12/43] KVM: guest_memfd: Call arch invalidate hooks on conversion
From: Fuad Tabba @ 2026-05-20 14:30 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-12-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> When memory in guest_memfd is converted from private to shared, the
> platform-specific state associated with the guest-private pages must be
> invalidated or cleaned up.
>
> Iterate over the folios in the affected range and call the
> kvm_arch_gmem_invalidate() hook for each PFN range. This allows
> architectures to perform necessary teardown, such as updating hardware
> metadata or encryption states, before the pages are transitioned to the
> shared state.
>
> Invoke this helper after indicating to KVM's mmu code that an invalidation
> is in progress to stop in-flight page faults from succeeding.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Minor nit below, but lgtm.
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 41 insertions(+)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 9d82642a025e9..baf4b88dead1f 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -603,6 +603,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> return safe;
> }
>
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> + struct folio_batch fbatch;
> + pgoff_t next = start;
> + int i;
> +
> + folio_batch_init(&fbatch);
> + while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> + struct folio *folio = fbatch.folios[i];
> + pgoff_t start_index, end_index;
> + kvm_pfn_t start_pfn, end_pfn;
> +
> + start_index = max(start, folio->index);
> + end_index = min(end, folio_next_index(folio));
> + /*
> + * end_index is either in folio or points to
> + * the first page of the next folio. Hence,
> + * all pages in range [start_index, end_index)
> + * are contiguous.
> + */
> + start_pfn = folio_file_pfn(folio, start_index);
> + end_pfn = start_pfn + end_index - start_index;
> +
> + kvm_arch_gmem_invalidate(start_pfn, end_pfn);
> + }
> +
> + folio_batch_release(&fbatch);
> + cond_resched();
> + }
> +}
> +#else
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> +#endif
> +
> static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> size_t nr_pages, uint64_t attrs,
> pgoff_t *err_index)
> @@ -643,7 +679,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> */
>
> kvm_gmem_invalidate_begin(inode, start, end);
> +
> + if (!to_private)
> + kvm_gmem_invalidate(inode, start, end);
> +
> mas_store_prealloc(&mas, xa_mk_value(attrs));
> +
Why the unrelated extra space?
> kvm_gmem_invalidate_end(inode, start, end);
> out:
> filemap_invalidate_unlock(mapping);
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH] kconfig: add optional warnings for changed input values
From: Nicolas Schier @ 2026-05-20 14:31 UTC (permalink / raw)
To: Pengpeng Hou
Cc: Nathan Chancellor, Masahiro Yamada, linux-kbuild, Jonathan Corbet,
Shuah Khan, Randy Dunlap, Thomas Meyer, Miguel Ojeda, linux-doc,
linux-kernel
In-Reply-To: <20260406233001.1-kconfig-warn-changed-input-pengpeng@iscas.ac.cn>
[-- Attachment #1: Type: text/plain, Size: 3714 bytes --]
On Mon, Apr 06, 2026 at 11:06:19PM +0800, Pengpeng Hou wrote:
> When reading .config input, Kconfig stores user-provided values first and
> then resolves the final value after applying dependencies, ranges, and
> other constraints.
>
> If the final value differs from the user's input, Kconfig already tracks
> that state internally, but it does not provide any focused diagnostic to
> show which explicit inputs were adjusted. This is particularly confusing
> for requested values that get forced down by unmet dependencies or clamped
> by ranges.
>
> Add an opt-in diagnostic controlled by KCONFIG_WARN_CHANGED_INPUT.
> Emit the warnings from conf_write() and conf_write_defconfig() after
> value resolution and through the existing message callback path so the
> default behavior stays unchanged and interactive frontends remain usable.
>
> Document the new environment variable and add tests for both olddefconfig
> and savedefconfig.
>
> Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
> ---
Thanks a lot for this patch! I know quite some people waiting for that
feature! Just a minor nit-pick, and two minor issues found from
Sashiko; see below.
[...]
> @@ -759,7 +825,10 @@ int conf_write_defconfig(const char *filename)
> {
> struct symbol *sym;
> struct menu *menu;
> + struct gstr gs = str_new();
> FILE *out;
> + bool warn_changed_input = conf_warn_changed_input_enabled();
> + bool found = false;
nit-picking: I'd favor a more descriptive variable name (e.g.
'changed_input_found'), as I am expecting my future me to have to dig
into conf_warn_changed_input_enabled() what that 'found' might really
mean.
[...]
> @@ -798,6 +870,13 @@ int conf_write_defconfig(const char *filename)
> print_symbol_for_dotconfig(out, sym);
> }
> fclose(out);
> +
> + conf_clear_written_flags();
> +
> + if (found)
> + conf_message("%s", str_get(&gs));
Sashiko complains [1] that conf_message() may truncate the output to
4096 bytes, which can easily be provoked, e.g. by switching ARCH.
[...]
> @@ -809,7 +888,10 @@ int conf_write(const char *name)
> const char *str;
> char tmpname[PATH_MAX + 1], oldname[PATH_MAX + 1];
> char *env;
> + struct gstr gs = str_new();
> bool need_newline = false;
> + bool warn_changed_input = conf_warn_changed_input_enabled();
> + bool found = false;
>
> if (!name)
> name = conf_get_configname();
> @@ -859,6 +941,8 @@ int conf_write(const char *name)
> } else if (!sym_is_choice(sym) &&
> !(sym->flags & SYMBOL_WRITTEN)) {
> sym_calc_value(sym);
> + if (warn_changed_input)
> + conf_append_changed_input_warning(&gs, sym, &found);
> if (!(sym->flags & SYMBOL_WRITE))
> goto next;
Sashiko asks about possibly duplicated warnings:
| Will duplicate warning messages be emitted for symbols that have multiple menu
| entries and are forced off (so SYMBOL_WRITE is not set)?
| Since this skips the rest of the loop via goto next;, the symbol is never
| marked with SYMBOL_WRITTEN (which happens later in the block). When the menu
| traversal encounters the same symbol at its next menu node, it will process it
| again and redundantly append the exact same warning.
But from what I can find in in-tree Kconfigs, we do not have Kconfig symbols
that are accessible from multiple menu entries. But it would be good if
someone else could check that once again.
So, thanks again for this small but great feature!
Tested-by: Nicolas Schier <nsc@kernel.org>
Reviewed-by: Nicolas Schier <nsc@kernel.org>
Thanks!
[1]: http://sashiko.dev/#/patchset/20260406233001.1-kconfig-warn-changed-input-pengpeng%40iscas.ac.cn
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH v4 2/3] drm/panthor: Implement evicted status for GEM objects
From: Boris Brezillon @ 2026-05-20 14:33 UTC (permalink / raw)
To: Nicolas Frattaroli
Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Steven Price, Liviu Dudau, Jonathan Corbet,
Shuah Khan, Tvrtko Ursulin, dri-devel, linux-kernel, kernel,
linux-doc
In-Reply-To: <20260520-panthor-bo-reclaim-observability-v4-2-a47ab61cb80d@collabora.com>
On Wed, 20 May 2026 15:04:49 +0200
Nicolas Frattaroli <nicolas.frattaroli@collabora.com> wrote:
> For fdinfo to be able to fill its evicted counter with data, panthor
> needs to keep track of whether a GEM object has ever been reclaimed.
> Just checking whether the pages are resident isn't enough, as newly
> allocated objects also won't be resident.
>
> Do this with a new atomic_t member on panthor_gem_object. It's increased
> when an object gets evicted by the shrinker, and saturates at INT_MAX.
> This means that once an object has been evicted at least once, its
> reclaim counter will never return to 0.
>
> Due to this, it's possible to distinguish evicted non-resident pages
> from newly allocated non-resident pages by checking whether
> reclaimed_count is != 0
>
> Use this new member to then set the appropriate DRM_GEM_OBJECT_EVICTED
> status flag for fdinfo.
>
> Also add a new column and status flag to the panthor gems debugfs: the
> column is the number of times an object has been evicted, whereas the
> flag indicates whether it currently is evicted.
>
> Reviewed-by: Steven Price <steven.price@arm.com>
> Signed-off-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
> ---
> drivers/gpu/drm/panthor/panthor_gem.c | 18 ++++++++++++++----
> drivers/gpu/drm/panthor/panthor_gem.h | 10 ++++++++++
> 2 files changed, 24 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
> index 13295d7a593d..068aa935c8fc 100644
> --- a/drivers/gpu/drm/panthor/panthor_gem.c
> +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> @@ -687,6 +687,8 @@ static void panthor_gem_evict_locked(struct panthor_gem_object *bo)
> if (drm_WARN_ON_ONCE(bo->base.dev, !bo->backing.pages))
> return;
>
> + atomic_add_unless(&bo->reclaimed_count, 1, INT_MAX);
> +
> panthor_gem_dev_map_cleanup_locked(bo);
> panthor_gem_backing_cleanup_locked(bo);
> panthor_gem_update_reclaim_state_locked(bo, NULL);
> @@ -788,6 +790,8 @@ static enum drm_gem_object_status panthor_gem_status(struct drm_gem_object *obj)
>
> if (drm_gem_is_imported(&bo->base) || bo->backing.pages)
> res |= DRM_GEM_OBJECT_RESIDENT;
> + else if (atomic_read(&bo->reclaimed_count))
> + res |= DRM_GEM_OBJECT_EVICTED;
Could we drop that change so we can at least have patch 2 and 3 merged
while the discussion on the fdinfo semantics is going on?
>
> return res;
> }
> @@ -1595,6 +1599,7 @@ static void panthor_gem_debugfs_print_flag_names(struct seq_file *m)
> static const char * const gem_state_flags_names[] = {
> [PANTHOR_DEBUGFS_GEM_STATE_IMPORTED_BIT] = "imported",
> [PANTHOR_DEBUGFS_GEM_STATE_EXPORTED_BIT] = "exported",
> + [PANTHOR_DEBUGFS_GEM_STATE_EVICTED_BIT] = "evicted",
> };
>
> static const char * const gem_usage_flags_names[] = {
> @@ -1625,6 +1630,7 @@ static void panthor_gem_debugfs_bo_print(struct panthor_gem_object *bo,
> {
> enum panthor_gem_reclaim_state reclaim_state = bo->reclaim_state;
> unsigned int refcount = kref_read(&bo->base.refcount);
> + int reclaimed_count = atomic_read(&bo->reclaimed_count);
> char creator_info[32] = {};
> size_t resident_size;
> u32 gem_usage_flags = bo->debugfs.flags;
> @@ -1638,16 +1644,20 @@ static void panthor_gem_debugfs_bo_print(struct panthor_gem_object *bo,
>
> snprintf(creator_info, sizeof(creator_info),
> "%s/%d", bo->debugfs.creator.process_name, bo->debugfs.creator.tgid);
> - seq_printf(m, "%-32s%-16d%-16d%-16zd%-16zd0x%-16lx",
> + seq_printf(m, "%-32s%-16d%-16d%-11d%-16zd%-16zd0x%-16lx",
> creator_info,
> bo->base.name,
> refcount,
> + reclaimed_count,
> bo->base.size,
> resident_size,
> drm_vma_node_start(&bo->base.vma_node));
>
> if (drm_gem_is_imported(&bo->base))
> gem_state_flags |= PANTHOR_DEBUGFS_GEM_STATE_FLAG_IMPORTED;
> + else if (!resident_size && reclaimed_count)
> + gem_state_flags |= PANTHOR_DEBUGFS_GEM_STATE_FLAG_EVICTED;
> +
> if (bo->base.dma_buf)
> gem_state_flags |= PANTHOR_DEBUGFS_GEM_STATE_FLAG_EXPORTED;
>
> @@ -1671,8 +1681,8 @@ static void panthor_gem_debugfs_print_bos(struct panthor_device *ptdev,
>
> panthor_gem_debugfs_print_flag_names(m);
>
> - seq_puts(m, "created-by global-name refcount size resident-size file-offset state usage label\n");
> - seq_puts(m, "----------------------------------------------------------------------------------------------------------------------------------------------\n");
> + seq_puts(m, "created-by global-name refcount evictions size resident-size file-offset state usage label\n");
> + seq_puts(m, "---------------------------------------------------------------------------------------------------------------------------------------------------------\n");
>
> scoped_guard(mutex, &ptdev->gems.lock) {
> list_for_each_entry(bo, &ptdev->gems.node, debugfs.node) {
> @@ -1680,7 +1690,7 @@ static void panthor_gem_debugfs_print_bos(struct panthor_device *ptdev,
> }
> }
>
> - seq_puts(m, "==============================================================================================================================================\n");
> + seq_puts(m, "=========================================================================================================================================================\n");
> seq_printf(m, "Total size: %zd, Total resident: %zd, Total reclaimable: %zd\n",
> totals.size, totals.resident, totals.reclaimable);
> }
> diff --git a/drivers/gpu/drm/panthor/panthor_gem.h b/drivers/gpu/drm/panthor/panthor_gem.h
> index ae0491d0b121..56d63137b4eb 100644
> --- a/drivers/gpu/drm/panthor/panthor_gem.h
> +++ b/drivers/gpu/drm/panthor/panthor_gem.h
> @@ -19,12 +19,16 @@ struct panthor_vm;
> enum panthor_debugfs_gem_state_flags {
> PANTHOR_DEBUGFS_GEM_STATE_IMPORTED_BIT = 0,
> PANTHOR_DEBUGFS_GEM_STATE_EXPORTED_BIT = 1,
> + PANTHOR_DEBUGFS_GEM_STATE_EVICTED_BIT = 2,
>
> /** @PANTHOR_DEBUGFS_GEM_STATE_FLAG_IMPORTED: GEM BO is PRIME imported. */
> PANTHOR_DEBUGFS_GEM_STATE_FLAG_IMPORTED = BIT(PANTHOR_DEBUGFS_GEM_STATE_IMPORTED_BIT),
>
> /** @PANTHOR_DEBUGFS_GEM_STATE_FLAG_EXPORTED: GEM BO is PRIME exported. */
> PANTHOR_DEBUGFS_GEM_STATE_FLAG_EXPORTED = BIT(PANTHOR_DEBUGFS_GEM_STATE_EXPORTED_BIT),
> +
> + /** @PANTHOR_DEBUGFS_GEM_STATE_FLAG_EVICTED: GEM BO is evicted to swap. */
> + PANTHOR_DEBUGFS_GEM_STATE_FLAG_EVICTED = BIT(PANTHOR_DEBUGFS_GEM_STATE_EVICTED_BIT),
> };
>
> enum panthor_debugfs_gem_usage_flags {
> @@ -172,6 +176,12 @@ struct panthor_gem_object {
> /** @reclaim_state: Cached reclaim state */
> enum panthor_gem_reclaim_state reclaim_state;
>
> + /**
> + * @reclaimed_count: How many times object has been evicted to swap.
> + * The count saturates at %INT_MAX and will never wrap around to 0.
> + */
> + atomic_t reclaimed_count;
> +
> /**
> * @exclusive_vm_root_gem: Root GEM of the exclusive VM this GEM object
> * is attached to.
>
^ permalink raw reply
* Re: (subset) [PATCH v4 1/1] leds: Introduce the multi_max_intensity sysfs attribute
From: Lee Jones @ 2026-05-20 14:34 UTC (permalink / raw)
To: lee, pavel, Armin Wolf
Cc: linux-kernel, corbet, skhan, linux-leds, linux-doc, wse,
jacek.anaszewski, pobrn, m.tretter
In-Reply-To: <20260509214603.262368-2-W_Armin@gmx.de>
On Sat, 09 May 2026 23:46:03 +0200, Armin Wolf wrote:
> Some multicolor LEDs support global brightness control in hardware,
> meaning that the maximum intensity of the color components is not
> connected to the maximum global brightness. Such LEDs cannot be
> described properly by the current multicolor LED class interface,
> because it assumes that the maximum intensity of each color component
> is described by the maximum global brightness of the LED.
>
> [...]
Applied, thanks!
[1/1] leds: Introduce the multi_max_intensity sysfs attribute
commit: b1a9b7a904af2c793850f83a4801a013a718fc47
--
Lee Jones [李琼斯]
^ permalink raw reply
* [PATCH bpf-next] bpf, docs: add LOAD_AQCUIRE and STORE_RELEASE instructions
From: Alexis Lothoré (eBPF Foundation) @ 2026-05-20 14:36 UTC (permalink / raw)
To: David Vernet, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Jonathan Corbet, Shuah Khan
Cc: ebpf, Bastien Curutchet, Thomas Petazzoni, bpf, bpf, linux-doc,
linux-kernel, Alexis Lothoré (eBPF Foundation)
Commit 880442305a39 ("bpf: Introduce load-acquire and store-release
instructions") instroduced the LOAD_ACQUIRE and STORE_RELEASE atomic
instructions modifiers. Those are currently not described in the
documentation, despite being used in the verifier and the various JIT
compilers supporting them.
Add the missing entries in the instruction set documentation.
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
---
.../bpf/standardization/instruction-set.rst | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)
diff --git a/Documentation/bpf/standardization/instruction-set.rst b/Documentation/bpf/standardization/instruction-set.rst
index 39c74611752b..4f10bcd03150 100644
--- a/Documentation/bpf/standardization/instruction-set.rst
+++ b/Documentation/bpf/standardization/instruction-set.rst
@@ -695,22 +695,24 @@ arithmetic operations in the 'imm' field to encode the atomic operation:
*(u64 *)(dst + offset) += src
In addition to the simple atomic operations, there also is a modifier and
-two complex atomic operations:
+four complex atomic operations:
.. table:: Complex atomic operations
=========== ================ ===========================
imm value description
=========== ================ ===========================
- FETCH 0x01 modifier: return old value
- XCHG 0xe0 | FETCH atomic exchange
- CMPXCHG 0xf0 | FETCH atomic compare and exchange
+ FETCH 0x0001 modifier: return old value
+ XCHG 0x00e0 | FETCH atomic exchange
+ CMPXCHG 0x00f0 | FETCH atomic compare and exchange
+ LOAD_ACQ 0x0100 atomic load with barrier
+ STORE_REL 0x0110 atomic store with barrier
=========== ================ ===========================
The ``FETCH`` modifier is optional for simple atomic operations, and
-always set for the complex atomic operations. If the ``FETCH`` flag
-is set, then the operation also overwrites ``src`` with the value that
-was in memory before it was modified.
+always set for the ``XCHG`` and ``CMPXCHG`` complex atomic operations. If
+the ``FETCH`` flag is set, then the operation also overwrites ``src`` with
+the value that was in memory before it was modified.
The ``XCHG`` operation atomically exchanges ``src`` with the value
addressed by ``dst + offset``.
@@ -721,6 +723,11 @@ The ``CMPXCHG`` operation atomically compares the value addressed by
value that was at ``dst + offset`` before the operation is zero-extended
and loaded back to ``R0``.
+The ``LOAD_ACQ`` and ``STORE_REL`` operations implement lighter LOAD and
+STORE memory barriers than full barriers. The corresponding accesses must
+be aligned, but are allowed for any access size (8-bit up to 64-bit
+operations).
+
64-bit immediate instructions
-----------------------------
---
base-commit: ceeb3aa37bff895116944acf4347fcded0b7692d
change-id: 20260520-bpf-insn-doc-756b369ca328
Best regards,
--
Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
^ permalink raw reply related
* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: David Hildenbrand (Arm) @ 2026-05-20 14:43 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, Usama Arif
In-Reply-To: <CAA1CXcCD5ooRJonAVp2LvnoCrQwcs1-NsAYomXbHTVNSe5X0cw@mail.gmail.com>
>> Calculate maximum allowed empty PTEs or PTEs mapping the shared zeropage ... ?
>>
>>> + * PTEs for the given collapse operation.
>>
>> We usually indent here (second line of subject), I think. Same applies to the
>> other doc below.
>
> Hmm tbh I couldn't find a example of what you meant here. There are
> some that put a space between the first sentence and the @ list.
Yeah, we usually try to make it fit in a single line.
But nevermind, leave it as is.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v6 13/43] KVM: guest_memfd: Return early if range already has requested attributes
From: Fuad Tabba @ 2026-05-20 14:44 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-13-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Extract a helper out of kvm_gmem_range_is_private() that checks that a
> range has given attributes.
>
> Optimize setting memory attributes by returning early if all pages in the
> requested range already has the requested attributes.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> virt/kvm/guest_memfd.c | 33 +++++++++++++++++++++++----------
> 1 file changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index baf4b88dead1f..034b72b4947fb 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -86,6 +86,23 @@ static bool kvm_gmem_is_shared_mem(struct inode *inode, pgoff_t index)
> return !kvm_gmem_is_private_mem(inode, index);
> }
>
> +static bool kvm_gmem_range_has_attributes(struct maple_tree *mt,
> + pgoff_t index, size_t nr_pages,
> + u64 attributes)
> +{
> + pgoff_t end = index + nr_pages - 1;
> + void *entry;
> +
> + lockdep_assert(mt_lock_is_held(mt));
> +
> + mt_for_each(mt, entry, index, end) {
> + if (xa_to_value(entry) != attributes)
> + return false;
> + }
> +
> + return true;
> +}
> +
> static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> pgoff_t index, struct folio *folio)
> {
> @@ -649,12 +666,15 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> pgoff_t end = start + nr_pages;
> struct maple_tree *mt;
> struct ma_state mas;
> - int r;
> + int r = 0;
>
> mt = &gi->attributes;
>
> filemap_invalidate_lock(mapping);
>
> + if (kvm_gmem_range_has_attributes(mt, start, nr_pages, attrs))
> + goto out;
> +
> mas_init(&mas, mt, start);
> r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> if (r) {
> @@ -1140,20 +1160,13 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
> static bool kvm_gmem_range_is_private(struct gmem_inode *gi, pgoff_t index,
> size_t nr_pages, struct kvm *kvm, gfn_t gfn)
> {
> - pgoff_t end = index + nr_pages - 1;
> - void *entry;
> -
> if (vm_memory_attributes)
> return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
> KVM_MEMORY_ATTRIBUTE_PRIVATE,
> KVM_MEMORY_ATTRIBUTE_PRIVATE);
>
> - mt_for_each(&gi->attributes, entry, index, end) {
> - if (xa_to_value(entry) != KVM_MEMORY_ATTRIBUTE_PRIVATE)
> - return false;
> - }
> -
> - return true;
> + return kvm_gmem_range_has_attributes(&gi->attributes, index, nr_pages,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
> }
>
> static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work
From: Frederic Weisbecker @ 2026-05-20 14:47 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Leonardo Bras, Jonathan Corbet, Shuah Khan, Peter Zijlstra,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
Masahiro Yamada, linux-doc, linux-kernel, linux-mm,
linux-rt-devel, Marcelo Tosatti
In-Reply-To: <20260520134832.WS7TrMnu@linutronix.de>
Le Wed, May 20, 2026 at 03:48:32PM +0200, Sebastian Andrzej Siewior a écrit :
> How likely is it, that you you had users before late_initcall()? Also
> can it happen that one of them uses one function to lock and the other
> unlock in this brief window? There is no check if this was used before
> static_branch usage.
Or let alone initialization on the wrong member of the union.
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply
* Re: [PATCH v4 4/4] slub: apply new pw_queue_on() interface
From: Sebastian Andrzej Siewior @ 2026-05-20 14:53 UTC (permalink / raw)
To: Leonardo Bras
Cc: Jonathan Corbet, Shuah Khan, Peter Zijlstra, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
Masahiro Yamada, Frederic Weisbecker, linux-doc, linux-kernel,
linux-mm, linux-rt-devel, Marcelo Tosatti
In-Reply-To: <20260519012754.240804-5-leobras.c@gmail.com>
On 2026-05-18 22:27:50 [-0300], Leonardo Bras wrote:
> @@ -4733,121 +4735,121 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
>
> /*
> * We assume the percpu sheaves contain only local objects although it's
> * not completely guaranteed, so we verify later.
> */
> if (unlikely(node_requested && node != numa_mem_id())) {
> stat(s, ALLOC_NODE_MISMATCH);
> return NULL;
> }
>
> - if (!local_trylock(&s->cpu_sheaves->lock))
> + if (!pw_trylock_local(&s->cpu_sheaves->lock))
> return NULL;
alloc_from_pcs() can be called from kmalloc_nolock()/ NMI context.
I don't remember why exactly local_trylock_t was introduced here instead
of a per-CPU spinlock_t. But there should be nothing wrong with a
trylock on it from NMI as you do here.
One thing worth noting, on !PREEMPT_RT, spin_trylock() always succeeds
on UP. kmalloc_nolock() checks for it, not sure about other callers.
Sebastian
^ permalink raw reply
* Re: [PATCH v4 3/4] swap: apply new pw_queue_on() interface
From: Sebastian Andrzej Siewior @ 2026-05-20 15:07 UTC (permalink / raw)
To: Leonardo Bras
Cc: Jonathan Corbet, Shuah Khan, Peter Zijlstra, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
Masahiro Yamada, Frederic Weisbecker, linux-doc, linux-kernel,
linux-mm, linux-rt-devel, Marcelo Tosatti
In-Reply-To: <20260519012754.240804-4-leobras.c@gmail.com>
On 2026-05-18 22:27:49 [-0300], Leonardo Bras wrote:
after digesting the slub patch,
> @@ -882,38 +879,38 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
> * If the paired barrier is done at any later step, e.g. after the
> * loop, CPU #x will just exit at (C) and miss flushing out all of its
> * added pages.
> */
> WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> smp_mb();
>
> cpumask_clear(&has_mm_work);
> cpumask_clear(&has_bh_work);
> for_each_online_cpu(cpu) {
> - struct work_struct *mm_work = &per_cpu(lru_add_drain_work, cpu);
> + struct pw_struct *mm_pw = &per_cpu(lru_add_drain_pw, cpu);
> struct work_struct *bh_work = &per_cpu(bh_add_drain_work, cpu);
>
> if (cpu_needs_mm_drain(cpu)) {
> - INIT_WORK(mm_work, lru_add_drain_per_cpu);
> - queue_work_on(cpu, mm_percpu_wq, mm_work);
> + INIT_PW(mm_pw, lru_add_drain_per_cpu, cpu);
> + pw_queue_on(cpu, mm_percpu_wq, mm_pw);
> __cpumask_set_cpu(cpu, &has_mm_work);
> }
>
> if (cpu_needs_bh_drain(cpu)) {
> INIT_WORK(bh_work, bh_add_drain_per_cpu);
> queue_work_on(cpu, mm_percpu_wq, bh_work);
> __cpumask_set_cpu(cpu, &has_bh_work);
> }
> }
>
> for_each_cpu(cpu, &has_mm_work)
> - flush_work(&per_cpu(lru_add_drain_work, cpu));
> + pw_flush(&per_cpu(lru_add_drain_pw, cpu));
>
> for_each_cpu(cpu, &has_bh_work)
> flush_work(&per_cpu(bh_add_drain_work, cpu));
Why do we have two iterations here? Is it just a proof of concept that
is not complete yet? I am curious why it is okay/needed to "remove" the
one workqueue but not the other. Maybe the other does not bother as much
as the other does.
But essentially we can't use a spin_lock_t here because due to the
hotpath nature of the code it will kill performance. So instead we do it
anyway but behind a switch so that only those suffer from this that do
not want to suffer from workqueue interruption on a NOHZ full system,
right?
I thought that this improved since commit
ff042f4a9b050 ("mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu")
Did it get worse or was it not entirely gone?
> done:
> mutex_unlock(&lock);
> }
>
> void lru_add_drain_all(void)
> {
Sebastian
^ permalink raw reply
* Re: [PATCH bpf-next] bpf, docs: add LOAD_AQCUIRE and STORE_RELEASE instructions
From: bot+bpf-ci @ 2026-05-20 15:18 UTC (permalink / raw)
To: alexis.lothore, void, ast, daniel, andrii, martin.lau, eddyz87,
memxor, song, yonghong.song, jolsa, corbet, skhan
Cc: ebpf, bastien.curutchet, thomas.petazzoni, bpf, bpf, linux-doc,
linux-kernel, alexis.lothore, ast, andrii, daniel, martin.lau,
eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260520-bpf-insn-doc-v1-1-74d7dada9bfc@bootlin.com>
[-- Attachment #1: Type: text/plain, Size: 4336 bytes --]
> diff --git a/Documentation/bpf/standardization/instruction-set.rst b/Documentation/bpf/standardization/instruction-set.rst
> --- a/Documentation/bpf/standardization/instruction-set.rst
> +++ b/Documentation/bpf/standardization/instruction-set.rst
> @@ -695,22 +695,24 @@
> *(u64 *)(dst + offset) += src
>
> In addition to the simple atomic operations, there also is a modifier and
> -two complex atomic operations:
> +four complex atomic operations:
>
> .. table:: Complex atomic operations
>
> =========== ================ ===========================
> imm value description
> =========== ================ ===========================
> - FETCH 0x01 modifier: return old value
> - XCHG 0xe0 | FETCH atomic exchange
> - CMPXCHG 0xf0 | FETCH atomic compare and exchange
> + FETCH 0x0001 modifier: return old value
> + XCHG 0x00e0 | FETCH atomic exchange
> + CMPXCHG 0x00f0 | FETCH atomic compare and exchange
> + LOAD_ACQ 0x0100 atomic load with barrier
> + STORE_REL 0x0110 atomic store with barrier
> =========== ================ ===========================
>
> The ``FETCH`` modifier is optional for simple atomic operations, and
> -always set for the complex atomic operations. If the ``FETCH`` flag
> -is set, then the operation also overwrites ``src`` with the value that
> -was in memory before it was modified.
> +always set for the ``XCHG`` and ``CMPXCHG`` complex atomic operations. If
> +the ``FETCH`` flag is set, then the operation also overwrites ``src`` with
> +the value that was in memory before it was modified.
>
> The ``XCHG`` operation atomically exchanges ``src`` with the value
> addressed by ``dst + offset``.
[ ... ]
> @@ -721,6 +723,11 @@
> value that was at ``dst + offset`` before the operation is zero-extended
> and loaded back to ``R0``.
>
> +The ``LOAD_ACQ`` and ``STORE_REL`` operations implement lighter LOAD and
> +STORE memory barriers than full barriers. The corresponding accesses must
> +be aligned, but are allowed for any access size (8-bit up to 64-bit
> +operations).
Does the LOAD_ACQ documentation need to clarify its non-standard register
semantics? BPF atomic operations are encoded as STX instructions, which
conventionally use dst as the memory base pointer and src as the value
register. However, LOAD_ACQ reverses this pattern by using src as the
memory base pointer and dst as the destination register for the fetched
value, acting like an LDX instruction.
Additionally, should the documentation clarify that the loaded value is
zero-extended into the 64-bit destination register for 8, 16, and 32-bit
accesses? Omitting these details might lead to incorrect implementations
based on standard STX rules.
This contradiction creates specification ambiguity: the documentation
states that LOAD_ACQ and STORE_REL "are allowed for any access size
(8-bit up to 64-bit operations)", but earlier in the document (line 671)
explicitly states "8-bit and 16-bit wide atomic operations are not
supported."
The instruction encoding section (lines 667-671) only defines two size
encodings for ATOMIC mode:
- {ATOMIC, W, STX} for 32-bit operations
- {ATOMIC, DW, STX} for 64-bit operations
No encoding is shown for B (byte/8-bit) or H (half-word/16-bit) with
ATOMIC mode.
If LOAD_ACQ/STORE_REL truly support 8-bit and 16-bit sizes, shouldn't the
encoding section document how B and H size modifiers work with ATOMIC
mode, and line 671 clarify the exception?
Alternatively, if LOAD_ACQ/STORE_REL only support 32-bit and 64-bit sizes
like other atomic operations, should line 728 be corrected to state
"32-bit and 64-bit operations" instead of "8-bit up to 64-bit operations"?
For a standardized instruction set specification used by multiple
implementations (verifier, JIT compilers, potentially other BPF runtimes),
internal contradictions can lead to implementation divergence and
interoperability issues.
> +
> 64-bit immediate instructions
> -----------------------------
>
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26170628668
^ permalink raw reply
* Re: [PATCH v6 14/43] KVM: guest_memfd: Advertise KVM_SET_MEMORY_ATTRIBUTES2 ioctl
From: Fuad Tabba @ 2026-05-20 15:22 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-14-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Introduce KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES to advertise the
> availability of the KVM_SET_MEMORY_ATTRIBUTES2 ioctl.
>
> KVM_SET_MEMORY_ATTRIBUTES2 is a guest_memfd-scoped version of the existing
> KVM_SET_MEMORY_ATTRIBUTES VM ioctl. It allows userspace to manage memory
> attributes, such as KVM_MEMORY_ATTRIBUTE_PRIVATE, directly on a guest_memfd
> file descriptor.
>
> This new version uses struct kvm_memory_attributes2, which adds an
> error_offset field to the output. This allows KVM to return the specific
> offset that triggered an error, which is especially useful for handling
> EAGAIN results caused by transient page reference counts during attribute
> conversions.
>
> Update the KVM API documentation to define the new ioctl and its behavior,
> and add the necessary UAPI definitions and capability checks.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Suggested-by: Michael Roth <michael.roth@amd.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> Documentation/virt/kvm/api.rst | 78 +++++++++++++++++++++++++++++++++++++++++-
> include/uapi/linux/kvm.h | 2 ++
> virt/kvm/kvm_main.c | 5 +++
> 3 files changed, 84 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 52bbbb553ce10..55c2701d9ed49 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -117,7 +117,7 @@ description:
> x86 includes both i386 and x86_64.
>
> Type:
> - system, vm, or vcpu.
> + system, vm, vcpu or guest_memfd.
>
> Parameters:
> what parameters are accepted by the ioctl.
> @@ -6361,6 +6361,8 @@ S390:
> Returns -EINVAL if the VM has the KVM_VM_S390_UCONTROL flag set.
> Returns -EINVAL if called on a protected VM.
>
> +.. _KVM_SET_MEMORY_ATTRIBUTES:
> +
> 4.141 KVM_SET_MEMORY_ATTRIBUTES
> -------------------------------
>
> @@ -6553,6 +6555,80 @@ KVM_S390_KEYOP_SSKE
> Sets the storage key for the guest address ``guest_addr`` to the key
> specified in ``key``, returning the previous value in ``key``.
>
> +4.145 KVM_SET_MEMORY_ATTRIBUTES2
> +---------------------------------
> +
> +:Capability: KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES
> +:Architectures: all
> +:Type: guest_memfd ioctl
> +:Parameters: struct kvm_memory_attributes2 (in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Errors:
> +
> + ========== ===============================================================
> + EINVAL The specified `offset` or `size` were invalid (e.g. not
> + page aligned, causes an overflow, or size is zero).
> + EFAULT The parameter address was invalid.
> + EAGAIN Some page within requested range had unexpected refcounts. The
> + offset of the page will be returned in `error_offset`.
> + ENOMEM Ran out of memory trying to track private/shared state
> + ========== ===============================================================
> +
> +KVM_SET_MEMORY_ATTRIBUTES2 is an extension to
> +KVM_SET_MEMORY_ATTRIBUTES that supports returning (writing) values to
> +userspace. The original (pre-extension) fields are shared with
> +KVM_SET_MEMORY_ATTRIBUTES identically.
> +
> +Attribute values are shared with KVM_SET_MEMORY_ATTRIBUTES.
> +
> +::
> +
> + struct kvm_memory_attributes2 {
> + /* in */
> + union {
> + __u64 address;
> + __u64 offset;
> + };
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> + /* out */
> + __u64 error_offset;
> + __u64 reserved[11];
> + };
> +
> + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> +
> +Set attributes for a range of offsets within a guest_memfd to
> +KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed
> +memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is
> +supported, after a successful call to set
> +KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable
> +into host userspace and will only be mappable by the guest.
> +
> +To allow the range to be mappable into host userspace again, call
> +KVM_SET_MEMORY_ATTRIBUTES2 on the guest_memfd again with
> +KVM_MEMORY_ATTRIBUTE_PRIVATE unset.
> +
> +KVM does not directly manipulate the memory contents of pages during
> +attribute updates. However, the process of setting these attributes,
> +which includes operations such as unmapping pages from the host or
> +stage-2 page tables, may result in side effects on memory contents
> +that vary across different trusted firmware implementations.
> +
> +If this ioctl returns -EAGAIN, the offset of the page with unexpected
> +refcounts will be returned in `error_offset`. This can occur if there
> +are transient refcounts on the pages, taken by other parts of the
> +kernel.
> +
> +Userspace is expected to figure out how to remove all known refcounts
> +on the shared pages, such as refcounts taken by get_user_pages(), and
> +try the ioctl again. A possible source of these long term refcounts is
> +if the guest_memfd memory was pinned in IOMMU page tables.
> +
> +See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`.
> +
> .. _kvm_run:
>
> 5. The kvm_run structure
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 0b55258573d3d..f437fd0f1350c 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -996,6 +996,7 @@ struct kvm_enable_cap {
> #define KVM_CAP_S390_USER_OPEREXEC 246
> #define KVM_CAP_S390_KEYOP 247
> #define KVM_CAP_S390_VSIE_ESAMODE 248
> +#define KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES 249
>
> struct kvm_irq_routing_irqchip {
> __u32 irqchip;
> @@ -1648,6 +1649,7 @@ struct kvm_memory_attributes {
> __u64 flags;
> };
>
> +/* Available with KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES */
> #define KVM_SET_MEMORY_ATTRIBUTES2 _IOWR(KVMIO, 0xd2, struct kvm_memory_attributes2)
>
> struct kvm_memory_attributes2 {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4d7bf52b7b717..cec02d68d7039 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4972,6 +4972,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> return 1;
> case KVM_CAP_GUEST_MEMFD_FLAGS:
> return kvm_gmem_get_supported_flags(kvm);
> + case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
> + if (vm_memory_attributes)
> + return 0;
> +
> + return kvm_supported_mem_attributes(kvm);
> #endif
> default:
> break;
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH 00/12] misc/syncobj: add /dev/syncobj device
From: Xaver Hugl @ 2026-05-20 15:27 UTC (permalink / raw)
To: Christian König
Cc: Julian Orth, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
David Airlie, Simona Vetter, Sumit Semwal, Jonathan Corbet,
Shuah Khan, Arnd Bergmann, Greg Kroah-Hartman, dri-devel,
linux-kernel, linux-media, linaro-mm-sig, linux-doc,
wayland-devel, Michel Dänzer
In-Reply-To: <c9fbfdaf-2a58-4423-8dc5-6e29a88f6293@amd.com>
> In general the answer is yes, userspace needs to take care of inserting fences when wait before signal is used and the work can not be submitted to the HW for some reason.
>
> Currently we only have an IOCTL to insert the signaled dummy fence at some timeline sequence, but it should be trivial as well to insert a signaled fence with an error code.
>
> But the compositor needs to be able to handle that case anyway, because it can be that a malicious or just buggy client just never inserts the fence.
>
> So that a device is hot plugged is not different to just a client not inserting the fence in the first place.
A buggy client can always freeze its own surface, it doesn't need
handling beyond cleaning up properly when the client disconnects.
The hotplug case is different, since currently a well-behaved client
can only attempt to signal the point in the syncobj... but the drm
device is gone, so the ioctl will fail and the client's surface is
frozen, even though it did everything right.
So afaict, whatever new ioctl is added for this will need to be
independent of the drm device, or be special cased not to fail when
the device is removed.
> >> One problem is that only syncfile allows for querying such error codes at the moment, we have patches pending to add that to syncobj as well but we lack a compositor with support for that as userspace client.
> > As long as the error case can be detected with an eventfd,
>
> Yeah that's the problem. The eventfd only tells you if the operation is completed (or at least has materialized).
>
> To query the error you would need to ask the underlying syncobj or syncfile directly.
Issuing an additional ioctl after the eventfd fired for this rare case
wouldn't be particularly nice, but also not difficult. If we'd get
that with the eventfd directly, that would be much better though.
> Ah! I think I got the problem now. You basically want to avoid importing the syncobj because when the wrong device goes away you are busted.
Exactly.
> The reason we didn't considered having the IOCTLs on the FD is because if you don't import them and instead keep them around you can run out file descriptors quite quickly.
>
> When you have an use case where you receive an FD from the client and do a one shot conversion to an eventfd that will probably work, but for keeping them in the long run you need some kind of container for the syncobjs, don't you?
Compositors always run with vastly increased fd limits since they have
to handle a lot of fds for dmabufs alone, so keeping the fd around
wouldn't be an issue for us.
> > A device-independent way to create and use syncobj would still be
> > useful to us though, both to simplify the compositor and to improve
> > the software rendering use cases.
>
> Yeah not sure how to cleanly do that. We could have a dummy /dev/dri/rendersync or something like that, but that would be quite a hack.
I think for userspace it would be less of a hack than searching for a
random drm node that can import it. I'd gladly take another solution
as well though, if there is one.
- Xaver
^ permalink raw reply
* Re: [PATCH bpf-next] bpf, docs: add LOAD_AQCUIRE and STORE_RELEASE instructions
From: Alexis Lothoré @ 2026-05-20 15:46 UTC (permalink / raw)
To: bot+bpf-ci, alexis.lothore, void, ast, daniel, andrii, martin.lau,
eddyz87, memxor, song, yonghong.song, jolsa, corbet, skhan
Cc: ebpf, bastien.curutchet, thomas.petazzoni, bpf, bpf, linux-doc,
linux-kernel, martin.lau, clm, ihor.solodrai
In-Reply-To: <ffe4e22a8e5d0fceefa7f9ecb378e038b4686aa417addd57d2af243b67607c0a@mail.kernel.org>
On Wed May 20, 2026 at 5:18 PM CEST, bot+bpf-ci wrote:
>> diff --git a/Documentation/bpf/standardization/instruction-set.rst b/Documentation/bpf/standardization/instruction-set.rst
>> --- a/Documentation/bpf/standardization/instruction-set.rst
>> +++ b/Documentation/bpf/standardization/instruction-set.rst
>> @@ -695,22 +695,24 @@
>> *(u64 *)(dst + offset) += src
>>
>> In addition to the simple atomic operations, there also is a modifier and
>> -two complex atomic operations:
>> +four complex atomic operations:
>>
>> .. table:: Complex atomic operations
>>
>> =========== ================ ===========================
>> imm value description
>> =========== ================ ===========================
>> - FETCH 0x01 modifier: return old value
>> - XCHG 0xe0 | FETCH atomic exchange
>> - CMPXCHG 0xf0 | FETCH atomic compare and exchange
>> + FETCH 0x0001 modifier: return old value
>> + XCHG 0x00e0 | FETCH atomic exchange
>> + CMPXCHG 0x00f0 | FETCH atomic compare and exchange
>> + LOAD_ACQ 0x0100 atomic load with barrier
>> + STORE_REL 0x0110 atomic store with barrier
>> =========== ================ ===========================
>>
>> The ``FETCH`` modifier is optional for simple atomic operations, and
>> -always set for the complex atomic operations. If the ``FETCH`` flag
>> -is set, then the operation also overwrites ``src`` with the value that
>> -was in memory before it was modified.
>> +always set for the ``XCHG`` and ``CMPXCHG`` complex atomic operations. If
>> +the ``FETCH`` flag is set, then the operation also overwrites ``src`` with
>> +the value that was in memory before it was modified.
>>
>> The ``XCHG`` operation atomically exchanges ``src`` with the value
>> addressed by ``dst + offset``.
>
> [ ... ]
>
>> @@ -721,6 +723,11 @@
>> value that was at ``dst + offset`` before the operation is zero-extended
>> and loaded back to ``R0``.
>>
>> +The ``LOAD_ACQ`` and ``STORE_REL`` operations implement lighter LOAD and
>> +STORE memory barriers than full barriers. The corresponding accesses must
>> +be aligned, but are allowed for any access size (8-bit up to 64-bit
>> +operations).
>
> Does the LOAD_ACQ documentation need to clarify its non-standard register
> semantics? BPF atomic operations are encoded as STX instructions, which
> conventionally use dst as the memory base pointer and src as the value
> register. However, LOAD_ACQ reverses this pattern by using src as the
> memory base pointer and dst as the destination register for the fetched
> value, acting like an LDX instruction.
True, I can specify that both are anyway supported by a STX instruction.
> Additionally, should the documentation clarify that the loaded value is
> zero-extended into the 64-bit destination register for 8, 16, and 32-bit
> accesses? Omitting these details might lead to incorrect implementations
> based on standard STX rules.
I guess I can add this detail, yes.
> This contradiction creates specification ambiguity: the documentation
> states that LOAD_ACQ and STORE_REL "are allowed for any access size
> (8-bit up to 64-bit operations)", but earlier in the document (line 671)
> explicitly states "8-bit and 16-bit wide atomic operations are not
> supported."
>
> The instruction encoding section (lines 667-671) only defines two size
> encodings for ATOMIC mode:
> - {ATOMIC, W, STX} for 32-bit operations
> - {ATOMIC, DW, STX} for 64-bit operations
>
> No encoding is shown for B (byte/8-bit) or H (half-word/16-bit) with
> ATOMIC mode.
>
> If LOAD_ACQ/STORE_REL truly support 8-bit and 16-bit sizes, shouldn't the
> encoding section document how B and H size modifiers work with ATOMIC
> mode, and line 671 clarify the exception?
This point, and the corresponding mentions to the "atomic32 conformance
group" and "atomic64 conformance group", made me realize that the kernel
doc seems to be in sync with the eBPF ISA RFC
(https://www.rfc-editor.org/rfc/rfc9669.html). It makes me wonder if
it's really ok to add those LOAD_ACQUIRE/STORE_RELEASE mentions in the
kernel doc only ?
> Alternatively, if LOAD_ACQ/STORE_REL only support 32-bit and 64-bit sizes
> like other atomic operations, should line 728 be corrected to state
> "32-bit and 64-bit operations" instead of "8-bit up to 64-bit operations"?
>
> For a standardized instruction set specification used by multiple
> implementations (verifier, JIT compilers, potentially other BPF runtimes),
> internal contradictions can lead to implementation divergence and
> interoperability issues.
>
>> +
>> 64-bit immediate instructions
>> -----------------------------
>>
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26170628668
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
^ permalink raw reply
* Re: [PATCH 02/15] accel/qda: Add QDA driver documentation
From: Tomeu Vizoso @ 2026-05-20 15:47 UTC (permalink / raw)
To: Dmitry Baryshkov
Cc: ekansh.gupta, Oded Gabbay, Jonathan Corbet, Shuah Khan,
Joerg Roedel, Will Deacon, Robin Murphy, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Sumit Semwal, Christian König, Bharath Kumar,
Chenna Kesava Raju, srini, andersson, konradybcio, robin.clark,
linux-kernel, dri-devel, linux-doc, linux-arm-msm, iommu,
linux-media, linaro-mm-sig
In-Reply-To: <paiohsil5pmvm7cf6jxrhaj2225bgvlt3scrag4x6gbkyosow5@l4tbakbnxcvo>
On Wed, May 20, 2026 at 4:12 PM Dmitry Baryshkov
<dmitry.baryshkov@oss.qualcomm.com> wrote:
>
> On Tue, May 19, 2026 at 11:45:52AM +0530, Ekansh Gupta via B4 Relay wrote:
> > From: Ekansh Gupta <ekansh.gupta@oss.qualcomm.com>
> >
> > Add documentation for the Qualcomm DSP Accelerator (QDA) driver under
> > Documentation/accel/qda/. The documentation covers the driver
> > architecture, GEM-based buffer management, IOMMU context bank
> > isolation, and the RPMsg transport layer.
> >
> > The user-space API section describes the DRM IOCTLs for session
> > management, GEM buffer allocation, and remote procedure invocation via
> > the FastRPC protocol, along with a typical application lifecycle
> > example. Sections for dynamic debug and basic testing are also
> > included.
> >
> > Wire the new documentation into the Compute Accelerators index at
> > Documentation/accel/index.rst.
> >
> > Assisted-by: Claude:claude-4-6-sonnet
> > Signed-off-by: Ekansh Gupta <ekansh.gupta@oss.qualcomm.com>
> > ---
> > Documentation/accel/index.rst | 1 +
> > Documentation/accel/qda/index.rst | 13 ++++
> > Documentation/accel/qda/qda.rst | 146 ++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 160 insertions(+)
> >
> > diff --git a/Documentation/accel/index.rst b/Documentation/accel/index.rst
> > index cbc7d4c3876a..5901ea7f784c 100644
> > --- a/Documentation/accel/index.rst
> > +++ b/Documentation/accel/index.rst
> > @@ -10,4 +10,5 @@ Compute Accelerators
> > introduction
> > amdxdna/index
> > qaic/index
> > + qda/index
> > rocket/index
> > diff --git a/Documentation/accel/qda/index.rst b/Documentation/accel/qda/index.rst
> > new file mode 100644
> > index 000000000000..013400cf9c25
> > --- /dev/null
> > +++ b/Documentation/accel/qda/index.rst
> > @@ -0,0 +1,13 @@
> > +.. SPDX-License-Identifier: GPL-2.0-only
> > +
> > +==================================
> > +accel/qda Qualcomm DSP Accelerator
> > +==================================
> > +
> > +The QDA driver provides a DRM accel based interface for Qualcomm DSP offload.
> > +It uses the FastRPC protocol and integrates with DRM and GEM infrastructure
> > +for device and buffer management.
> > +
> > +.. toctree::
> > +
> > + qda
> > diff --git a/Documentation/accel/qda/qda.rst b/Documentation/accel/qda/qda.rst
> > new file mode 100644
> > index 000000000000..9f49af6e6acc
> > --- /dev/null
> > +++ b/Documentation/accel/qda/qda.rst
> > @@ -0,0 +1,146 @@
> > +.. SPDX-License-Identifier: GPL-2.0-only
> > +
> > +=====================================
> > +Qualcomm DSP Accelerator (QDA) Driver
> > +=====================================
> > +
> > +Introduction
> > +============
> > +
> > +The QDA driver is a DRM accel driver for Qualcomm's DSPs. It provides a
> > +DRM accel based interface for Qualcomm DSP offload, supporting workloads
> > +such as AI inference, computer vision, audio processing, and sensor offload
> > +on Qualcomm SoCs. It uses the FastRPC protocol and integrates with DRM and
> > +GEM infrastructure for device and buffer management.
> > +
> > +Key Features
> > +============
> > +
> > +* **DRM accel Interface**: Exposes a standard character device node
> > + (e.g., ``/dev/accel/accel0``) via the DRM accel subsystem.
> > +* **FastRPC Protocol**: Implements the FastRPC protocol for communication
> > + between the application processor and the DSP.
> > +* **GEM Buffer Management**: Uses the DRM GEM interface for buffer
> > + allocation, lifecycle management, and DMA-BUF import/export.
> > +* **IOMMU Isolation**: Uses IOMMU context banks to enforce memory isolation
> > + between different DSP user sessions.
> > +* **Modular Design**: Clean separation between the core DRM logic, the
> > + memory manager, and the RPMsg-based transport layer.
> > +
> > +Architecture
> > +============
> > +
> > +The QDA driver consists of several functional blocks:
> > +
> > +1. **Core Driver (``qda_drv``)**: Manages device registration, file operations,
> > + and DRM accel integration.
> > +2. **Memory Manager (``qda_memory_manager``)**: A flexible memory management
> > + layer that handles IOMMU context banks. It supports pluggable backends
> > + (such as DMA-coherent) to adapt to different SoC memory architectures.
> > +3. **GEM Subsystem**: Implements the DRM GEM interface for buffer management:
> > +
> > + * **``qda_gem``**: Core GEM object management, including allocation, mmap
> > + operations, and buffer lifecycle management.
> > + * **``qda_prime``**: PRIME import functionality for DMA-BUF interoperability
> > + with other kernel subsystems.
> > +
> > +4. **Transport Layer (``qda_rpmsg``)**: Abstraction over the RPMsg framework
> > + to handle low-level message passing with the DSP firmware.
> > +5. **Compute Bus (``qda_compute_bus``)**: A custom virtual bus used to
> > + enumerate and manage the specific compute context banks defined in the
> > + device tree. The bus was introduced because IOMMU context banks (CBs) are
> > + synthetic constructs — not real platform devices — making a platform driver
> > + an incorrect abstraction for them. The earlier platform-driver approach also
> > + had a race condition: device nodes were created before the RPMsg channel
> > + resources were fully initialized, and because ``probe`` runs asynchronously,
> > + applications could open a CB device and attempt to start a session before
> > + the underlying transport was ready. The compute bus makes CB lifetime
> > + explicitly subordinate to the parent QDA device, closing that window.
> > +6. **FastRPC Core (``qda_fastrpc``)**: Implements the protocol logic for
> > + marshalling arguments and handling remote invocations.
> > +
> > +User-Space API
> > +==============
> > +
> > +The driver exposes a set of DRM-compliant IOCTLs:
> > +
> > +* ``DRM_IOCTL_QDA_QUERY``: Query DSP type (e.g., "cdsp", "adsp")
> > + and capabilities.
> > +* ``DRM_IOCTL_QDA_REMOTE_SESSION_CREATE``: Initialize a new process context
> > + on the DSP.
> > +* ``DRM_IOCTL_QDA_REMOTE_INVOKE``: Submit a remote method invocation (the
> > + primary execution unit).
> > +* ``DRM_IOCTL_QDA_GEM_CREATE``: Allocate a GEM buffer object for DSP usage.
> > +* ``DRM_IOCTL_QDA_GEM_MMAP_OFFSET``: Retrieve mmap offsets for memory mapping.
> > +* ``DRM_IOCTL_QDA_REMOTE_MAP`` / ``DRM_IOCTL_QDA_REMOTE_MUNMAP``: Map or unmap
> > + buffers into the DSP's virtual address space. Each accepts a ``request``
> > + field selecting between a legacy operation (``QDA_MAP_REQUEST_LEGACY`` /
> > + ``QDA_MUNMAP_REQUEST_LEGACY``) and an attribute-based operation
> > + (``QDA_MAP_REQUEST_ATTR`` / ``QDA_MUNMAP_REQUEST_ATTR``).
>
> Explain, what happens in the users don't map the buffers into the DSP
> space. Will DRM_IOCTL_QDA_REMOTE_INVOKE handle the mapping or not? What
> is the difference between those two modes?
>
> Would the driver benefit from using GPUVM?
>
> > +
> > +Usage Example
> > +=============
> > +
> > +A typical lifecycle for a user-space application:
> > +
> > +1. **Discovery**: Open ``/dev/accel/accel*`` and use
> > + ``DRM_IOCTL_QDA_QUERY`` to identify the DSP domain served by that
> > + device node.
> > +2. **Initialization**: Call ``DRM_IOCTL_QDA_REMOTE_SESSION_CREATE`` to
> > + establish a session and create a process context on the DSP.
> > +3. **Memory**: Allocate buffers via ``DRM_IOCTL_QDA_GEM_CREATE`` or import
> > + DMA-BUFs (PRIME fd) from other drivers using ``DRM_IOCTL_PRIME_FD_TO_HANDLE``.
> > +4. **Execution**: Use ``DRM_IOCTL_QDA_REMOTE_INVOKE`` to pass arguments and
> > + execute functions on the DSP.
> > +5. **Cleanup**: Close file descriptors to automatically release resources and
> > + detach the session.
>
> I'd have expected the description of the actual example. I.e. clone the
> app from https://the.addr, prepare clang >= NN.MM, QAIC (https://foo),
> run make, run the app, check the results. I'd remind that DRM Accel has
> a very specific requirement of having the working toolhain in the
> open-source.
We have been getting submissions lately that don't fulfill that
requirement so I will point to the precise part of the documentation
that explains it:
https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#open-source-userspace-requirements
For an example of a submissions that complies, see:
https://lore.kernel.org/dri-devel/20260114-thames-v2-0-e94a6636e050@tomeuvizoso.net/
Most importantly, notice how the proposed Thames Mesa driver generates
machine code for all the hardware units, and doesn't use any blob for
that.
Regards,
Tomeu
> > +
> > +Internal Implementation
> > +=======================
> > +
> > +Memory Management
> > +-----------------
> > +The driver's memory manager creates virtual "IOMMU devices" that map to
> > +hardware context banks. This allows the driver to manage multiple isolated
> > +address spaces. The implementation uses a DMA-coherent backend to ensure data consistency
> > +between the CPU and DSP without manual cache maintenance in most cases.
>
> GEM usage?
>
> > +
> > +Debugging
> > +=========
> > +The driver includes extensive dynamic debug support. Enable it via the
> > +kernel's dynamic debug control:
> > +
> > +.. code-block:: bash
> > +
> > + echo "file drivers/accel/qda/* +p" > /sys/kernel/debug/dynamic_debug/control
> > +
> > +Testing
> > +=======
> > +The QDA driver can be exercised using the ``fastrpc_test`` utility from the
> > +FastRPC userspace library. Run the test application:
>
> pointer
>
> > +
> > +.. code-block:: bash
> > +
> > + fastrpc_test -d 3 -U 1 -t linux -a v68
> > +
> > +**Options**
> > +
> > +``-d domain``
> > + Select the DSP domain to run on:
> > +
> > + * ``0`` — ADSP
> > + * ``1`` — MDSP
> > + * ``2`` — SDSP
> > + * ``3`` — CDSP *(default on targets with CDSP)*
> > +
> > +``-U unsigned_PD``
> > + Select signed or unsigned protection domain:
> > +
> > + * ``0`` — signed PD
> > + * ``1`` — unsigned PD *(default)*
> > +
> > +``-t target``
> > + Target platform: ``android`` or ``linux`` *(default: linux)*
> > +
> > +``-a arch_version``
> > + DSP architecture version, e.g. ``v68``, ``v75`` *(default: v68)*
> >
> > --
> > 2.34.1
> >
> >
>
> --
> With best wishes
> Dmitry
^ permalink raw reply
* [PATCH v2] docs: submitting-patches: Clarify that "reviewer" is a person
From: Krzysztof Kozlowski @ 2026-05-20 15:48 UTC (permalink / raw)
To: Jonathan Corbet, Shuah Khan, workflows, linux-doc, linux-kernel
Cc: Krzysztof Kozlowski, Greg Kroah-Hartman, Vlastimil Babka,
Andrew Morton, David Hildenbrand, Linus Torvalds, Randy Dunlap,
Mark Brown
Common understanding of word "Reviewer" is: a person performing a review
work [1]. Tools are not persons, thus cannot be reviewers in this term.
Also tools cannot make statements and cannot take responsibility for the
review.
Our docs already clearly mark that "Reviewed-by" must come from a
person:
- "By offering my Reviewed-by: tag, I state that:"
Usage of first person "I" and word "state"
- "A Reviewed-by tag is *a statement of opinion* that the patch is an
appropriate modification of the kernel without any remaining serious"
Only a person can make a statement of opinion.
- "Any interested reviewer (who has done the work) can offer a
Reviewed-by"
A person can offer a tag thus above does not grant the tool
permission to offer a tag.
However this might not be enough, so let's clarify that only a person
with a known identity can state the "Reviewer's statement of oversight".
Link: https://en.wiktionary.org/wiki/reviewer [1]
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
---
Changes in v2:
1. Add tags
2. Rephrase/simplify a bit commit msg. Rephrase title - drop "in
English".
3. Add "with known identity", suggested by David Hildenbrand. I retained
previous tags, assuming this change is within spirit of previous
version and there were no objections on the list.
---
Documentation/process/submitting-patches.rst | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst
index d7290e208e72..cc6a1f73d7f2 100644
--- a/Documentation/process/submitting-patches.rst
+++ b/Documentation/process/submitting-patches.rst
@@ -581,12 +581,12 @@ By offering my Reviewed-by: tag, I state that:
A Reviewed-by tag is a statement of opinion that the patch is an
appropriate modification of the kernel without any remaining serious
-technical issues. Any interested reviewer (who has done the work) can
-offer a Reviewed-by tag for a patch. This tag serves to give credit to
-reviewers and to inform maintainers of the degree of review which has been
-done on the patch. Reviewed-by: tags, when supplied by reviewers known to
-understand the subject area and to perform thorough reviews, will normally
-increase the likelihood of your patch getting into the kernel.
+technical issues. Any interested reviewer (who has done the work and is a
+person with known identity) can offer a Reviewed-by tag for a patch. This tag
+serves to give credit to reviewers and to inform maintainers of the degree of
+review which has been done on the patch. Reviewed-by: tags, when supplied by
+reviewers known to understand the subject area and to perform thorough reviews,
+will normally increase the likelihood of your patch getting into the kernel.
Both Tested-by and Reviewed-by tags, once received on mailing list from tester
or reviewer, should be added by author to the applicable patches when sending
--
2.53.0
^ permalink raw reply related
* Re: [PATCH bpf-next] bpf, docs: add LOAD_AQCUIRE and STORE_RELEASE instructions
From: Alexei Starovoitov @ 2026-05-20 16:07 UTC (permalink / raw)
To: Alexis Lothoré
Cc: bot+bpf-ci, David Vernet, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Jonathan Corbet, Shuah Khan, ebpf,
Bastien Curutchet (eBPF Foundation), Thomas Petazzoni, bpf, bpf,
open list:DOCUMENTATION, LKML, Martin KaFai Lau, Chris Mason,
Ihor Solodrai
In-Reply-To: <DINMD7Y3ZG8Q.3GZGX7SX9CN57@bootlin.com>
On Wed, May 20, 2026 at 5:46 PM Alexis Lothoré
<alexis.lothore@bootlin.com> wrote:
>
> On Wed May 20, 2026 at 5:18 PM CEST, bot+bpf-ci wrote:
> >> diff --git a/Documentation/bpf/standardization/instruction-set.rst b/Documentation/bpf/standardization/instruction-set.rst
> >> --- a/Documentation/bpf/standardization/instruction-set.rst
> >> +++ b/Documentation/bpf/standardization/instruction-set.rst
> >> @@ -695,22 +695,24 @@
> >> *(u64 *)(dst + offset) += src
> >>
> >> In addition to the simple atomic operations, there also is a modifier and
> >> -two complex atomic operations:
> >> +four complex atomic operations:
> >>
> >> .. table:: Complex atomic operations
> >>
> >> =========== ================ ===========================
> >> imm value description
> >> =========== ================ ===========================
> >> - FETCH 0x01 modifier: return old value
> >> - XCHG 0xe0 | FETCH atomic exchange
> >> - CMPXCHG 0xf0 | FETCH atomic compare and exchange
> >> + FETCH 0x0001 modifier: return old value
> >> + XCHG 0x00e0 | FETCH atomic exchange
> >> + CMPXCHG 0x00f0 | FETCH atomic compare and exchange
> >> + LOAD_ACQ 0x0100 atomic load with barrier
> >> + STORE_REL 0x0110 atomic store with barrier
> >> =========== ================ ===========================
> >>
> >> The ``FETCH`` modifier is optional for simple atomic operations, and
> >> -always set for the complex atomic operations. If the ``FETCH`` flag
> >> -is set, then the operation also overwrites ``src`` with the value that
> >> -was in memory before it was modified.
> >> +always set for the ``XCHG`` and ``CMPXCHG`` complex atomic operations. If
> >> +the ``FETCH`` flag is set, then the operation also overwrites ``src`` with
> >> +the value that was in memory before it was modified.
> >>
> >> The ``XCHG`` operation atomically exchanges ``src`` with the value
> >> addressed by ``dst + offset``.
> >
> > [ ... ]
> >
> >> @@ -721,6 +723,11 @@
> >> value that was at ``dst + offset`` before the operation is zero-extended
> >> and loaded back to ``R0``.
> >>
> >> +The ``LOAD_ACQ`` and ``STORE_REL`` operations implement lighter LOAD and
> >> +STORE memory barriers than full barriers. The corresponding accesses must
> >> +be aligned, but are allowed for any access size (8-bit up to 64-bit
> >> +operations).
> >
> > Does the LOAD_ACQ documentation need to clarify its non-standard register
> > semantics? BPF atomic operations are encoded as STX instructions, which
> > conventionally use dst as the memory base pointer and src as the value
> > register. However, LOAD_ACQ reverses this pattern by using src as the
> > memory base pointer and dst as the destination register for the fetched
> > value, acting like an LDX instruction.
>
> True, I can specify that both are anyway supported by a STX instruction.
>
> > Additionally, should the documentation clarify that the loaded value is
> > zero-extended into the 64-bit destination register for 8, 16, and 32-bit
> > accesses? Omitting these details might lead to incorrect implementations
> > based on standard STX rules.
>
> I guess I can add this detail, yes.
>
> > This contradiction creates specification ambiguity: the documentation
> > states that LOAD_ACQ and STORE_REL "are allowed for any access size
> > (8-bit up to 64-bit operations)", but earlier in the document (line 671)
> > explicitly states "8-bit and 16-bit wide atomic operations are not
> > supported."
> >
> > The instruction encoding section (lines 667-671) only defines two size
> > encodings for ATOMIC mode:
> > - {ATOMIC, W, STX} for 32-bit operations
> > - {ATOMIC, DW, STX} for 64-bit operations
> >
> > No encoding is shown for B (byte/8-bit) or H (half-word/16-bit) with
> > ATOMIC mode.
> >
> > If LOAD_ACQ/STORE_REL truly support 8-bit and 16-bit sizes, shouldn't the
> > encoding section document how B and H size modifiers work with ATOMIC
> > mode, and line 671 clarify the exception?
>
> This point, and the corresponding mentions to the "atomic32 conformance
> group" and "atomic64 conformance group", made me realize that the kernel
> doc seems to be in sync with the eBPF ISA RFC
> (https://www.rfc-editor.org/rfc/rfc9669.html). It makes me wonder if
> it's really ok to add those LOAD_ACQUIRE/STORE_RELEASE mentions in the
> kernel doc only ?
It's ok. It already diverged a bit. Eventually we will do an RFC update.
^ permalink raw reply
* Re: [PATCH RESEND bpf-next v10 2/8] bpf: clear list node owner and unlink before drop
From: Eduard Zingerman @ 2026-05-20 16:28 UTC (permalink / raw)
To: Kaitao Cheng
Cc: bpf, Alexei Starovoitov, linux-kernel, linux-doc, ast, memxor,
corbet, martin.lau, daniel, andrii, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, chengkaitao,
skhan, vmalik, linux-kselftest, martin.lau, clm, ihor.solodrai,
bot+bpf-ci
In-Reply-To: <47b928ac-25d9-481c-8764-8f840c2dcafa@linux.dev>
On Wed, 2026-05-20 at 17:55 +0800, Kaitao Cheng wrote:
> 在 2026/5/20 06:56, Eduard Zingerman 写道:
> > On Mon, 2026-05-18 at 11:02 +0800, Kaitao Cheng wrote:
> >
> > [...]
> >
> > > > > > The patch does have a bug, however. To fix the issues we are seeing now,
> > > > > > I propose the additional changes below and would appreciate feedback.
> > > > > >
> > > > > > --- a/kernel/bpf/helpers.c
> > > > > > +++ b/kernel/bpf/helpers.c
> > > > > > @@ -2263,8 +2263,10 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
> > > > > > if (!head->next || list_empty(head))
> > > > > > goto unlock;
> > > > > > list_for_each_safe(pos, n, head) {
> > > > > > - WRITE_ONCE(container_of(pos,
> > > > > > - struct bpf_list_node_kern, list_head)->owner, NULL);
> > > > > > + struct bpf_list_node_kern *node;
> > > > > > +
> > > > > > + node = container_of(pos, struct bpf_list_node_kern, list_head);
> > > > > > + WRITE_ONCE(node->owner, BPF_PTR_POISON);
> > > > > > list_move_tail(pos, &drain);
> > > > > > }
> > > > > > unlock:
> > > > > > @@ -2272,8 +2274,12 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
> > > > > > __bpf_spin_unlock_irqrestore(spin_lock);
> > > > > >
> > > > > > while (!list_empty(&drain)) {
> > > > > > + struct bpf_list_node_kern *node;
> > > > > > +
> > > > > > pos = drain.next;
> > > > > > + node = container_of(pos, struct bpf_list_node_kern, list_head);
> > > > > > list_del_init(pos);
> > > > > > + WRITE_ONCE(node->owner, NULL);
> >
> > Is CPU allowed to reorder the stores in list_del_init() and WRITE_ONCE()?
> > If it is, I think there is a race here.
>
> Thanks for taking a close look at this. You are right that there is an
> ordering issue here, but I don't think the specific sequence illustrated
> by the example below is problematic.
>
> > Thread #1:
> > enter bpf_list_head_free()
> > acquire H1 lock
> > list_move_tail(pos, &drain); // reordered
> > <-- ip here -->
> > WRITE_ONCE(node->owner, BPF_PTR_POISON); // reordered
> >
> > Thread #2:
> >
> > acquire H1 lock
> > n = bpf_refcount_acquire()
> > release H1 lock
> > acquire H2 lock
> > enter __bpf_list_add()
> > <-- ip here -->
> > cmpxchg(&node->owner, NULL, BPF_PTR_POISON)
>
> Even if the stores from list_move_tail(pos, &drain) become visible before
> WRITE_ONCE(node->owner, BPF_PTR_POISON), node->owner is not NULL in that
> window. Before the WRITE_ONCE(), it still points to H1. After the WRITE_ONCE(),
> it is BPF_PTR_POISON. In both cases, __bpf_list_add() will fail:
>
> cmpxchg(&node->owner, NULL, BPF_PTR_POISON)
>
> because the old value is neither NULL nor expected to become NULL from this
> part of bpf_list_head_free().
>
>
> However, I agree that your original concern about the ordering between
> list_del_init() and WRITE_ONCE(node->owner, NULL) is valid for the later
> drain loop:
>
> list_del_init(pos);
> WRITE_ONCE(node->owner, NULL);
>
> Here owner == NULL is the signal that the node can be inserted into another
> list. Since WRITE_ONCE() does not provide release ordering, another CPU may
> observe owner == NULL and successfully acquire the node in __bpf_list_add()
> before the list_del_init() stores are visible. In that case __bpf_list_add()
> can link the node into H2, and the delayed stores from list_del_init() may
> then overwrite the node's list pointers and corrupt the H2 list.
>
> So the fix should be to publish owner == NULL with release ordering after the
> node has been fully unlinked, for example:
>
> ```
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -2279,7 +2279,8 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
> pos = drain.next;
> node = container_of(pos, struct bpf_list_node_kern, list_head);
> list_del_init(pos);
> - WRITE_ONCE(node->owner, NULL);
> + /* Ensure __bpf_list_add() sees the node as unlinked. */
> + smp_store_release(&node->owner, NULL);
> /* The contained type can also have resources, including a
> * bpf_list_head which needs to be freed.
> */
> @@ -2607,7 +2608,8 @@ static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head,
> return NULL;
>
> list_del_init(n);
> - WRITE_ONCE(node->owner, NULL);
> + /* Ensure __bpf_list_add() sees the node as unlinked. */
> + smp_store_release(&node->owner, NULL);
> return (struct bpf_list_node *)n;
> }
> ```
>
> The existing cmpxchg() in __bpf_list_add() is a successful RMW with return
> value, so it is fully ordered and is sufficient on the acquire side.
Hi Kaitao,
Thank you for the analysis. I agree with the smp_store_release()
approach, could you please respin the series?
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: Oliver Upton @ 2026-05-20 17:47 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <add71b6f61edc6357e1fddad83273b2cba697d10.camel@infradead.org>
On Wed, May 20, 2026 at 12:33:52AM +0100, David Woodhouse wrote:
> On Tue, 2026-05-19 at 15:57 -0700, Oliver Upton wrote:
> > What ifs and maybes do not meet the bar, in my opinion, for preserving
> > bug emulation in KVM. Of course there could be a little flexibility with
> > that but we need to have some way of discriminating between bug fixes
> > and genuine guest expectations around the behavior of virtual hardware.
>
> I believe you have this completely backwards.
No, I really don't.
Leaving every bugfix that could _possibly_ have a guest-visible impact
subject to drive-by scrutiny many years after the dust has settled is
not an acceptable working dynamic. Especially since it would appear
that the rest of the ecosystem has long since moved on from this
particular issue.
If this matters to you so deeply then please, be part of the solution
instead. You may find that reviewing patches leads to better outcomes
than getting belligerent with the arm64 folks every time you guys
decide to rebase your kernel. Hell, hypotheticals actually have a lot
more weight in the context of a review. And if your testing is extensive
enough to catch these sort of subtleties, don't you think it's better
done against mainline?
Maybe it's just me but I am left feeling disappointed that we all
haven't found a productive way of working together. I've tried to bridge
the gap here; we obviously need to do something that at least fixes the
UAPI breakage. Although apparently we don't even care to meet that low
of bar.
> A stable and mature platform doesn't get to play in its ivory tower and
> randomly inflict breakage on guests because they "deserve it".
Really? Aren't you asking for us to emulate something completely broken
for you?
Thanks,
Oliver
^ permalink raw reply
* Re: [PATCH v3 04/12] x86,fs/resctrl: Program PLZA through kmode arch hooks
From: Babu Moger @ 2026-05-20 17:49 UTC (permalink / raw)
To: Luck, Tony
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, bp,
dave.hansen, skhan, x86, mingo, hpa, akpm, rdunlap,
pawan.kumar.gupta, feng.tang, dapeng1.mi, kees, elver, lirongqing,
paulmck, bhelgaas, seanjc, alexandre.chartre, yazen.ghannam,
peterz, chang.seok.bae, kim.phillips, xin, naveen,
thomas.lendacky, linux-doc, linux-kernel, eranian, peternewman,
sos-linux-ext-patches
In-Reply-To: <agzPTMvJ_LdEmKXe@agluck-desk3>
Hi Tony,
On 5/19/26 15:59, Luck, Tony wrote:
> On Thu, Apr 30, 2026 at 06:24:49PM -0500, Babu Moger wrote:
>> +void resctrl_arch_configure_kmode(cpumask_var_t cpu_mask, u32 closid, u32 rmid, bool enable)
>> +{
>> + union msr_pqr_plza_assoc plza = { 0 };
>> +
>> + plza.split.rmid = rmid;
>> + plza.split.rmid_en = 1;
>
> Shouldn't there be a parameter for the value of rmid_en?
I realized that behavior is not required—it was actually due to a
mistake in my v2 series implementation.
Below are the relevant definitions:
GLOBAL_ASSIGN_CTRL_INHERIT_MON_PER_CPU:
The CLOSID is applied to kernel work, while the RMID used for monitoring
is inherited from the currently running user task.
No separate monitoring group is assigned for kernel work, so kernel
execution naturally inherits the user-space RMID.
GLOBAL_ASSIGN_CTRL_ASSIGN_MON_PER_CPU:
Both CLOSID and RMID are explicitly assigned to kernel work.
This allows assigning a dedicated monitoring group for kernel execution
and therefore requires a separate RMID.
Example: For GLOBAL_ASSIGN_CTRL_INHERIT_MON_PER_CPU:
# mount -t resctrl resctrl /sys/fs/resctrl
# cat /sys/fs/resctrl/info/kernel_mode
[inherit_ctrl_and_mon:group=//]
global_assign_ctrl_inherit_mon_per_cpu:group=none
global_assign_ctrl_assign_mon_per_cpu:group=none
# mkdir /sys/fs/resctrl/ctrl1 (PQR_ASSOC closid=1 rmid=1)
This configures all the CPU threads to use closid=1 and rmid=1 for both
allocation and monitoring across user and kernel modes.
# echo "global_assign_ctrl_inherit_mon_per_cpu:group=ctrl1//" \
> /sys/fs/resctrl/info/kernel_mode
# cat /sys/fs/resctrl/info/kernel_mode
inherit_ctrl_and_mon:group=none
[global_assign_ctrl_inherit_mon_per_cpu:group=ctrl1//]
global_assign_ctrl_assign_mon_per_cpu:group=none
This overrides the previous configuration, and PQR_PLZA_ASSOC is written.
Possible options:
1. (closid=1, rmid_en=0, rmid=1)
Here, hardware uses closid=1 for kernel work, but RMID tracking is
disabled for kernel mode.
As a result, reading RMID 1 reports only user-mode activity
This contradicts the definition of this mode, since kernel work is
expected to inherit the user RMID for monitoring.
2. (closid=1, rmid_en=1, rmid=1)
In this case, RMID tracking is enabled for both user and kernel modes.
Reading RMID 1 reports combined user + kernel activity
This aligns with the expected inherit_monitoring behavior
The preferred approach is to separate kernel monitoring by assigning it
a dedicated monitoring group and updating PQR_PLZA_ASSOC to use a
different RMID (e.g., closid=1, rmid_en=1, rmid=2). This is exactly the
behavior implemented by GLOBAL_ASSIGN_CTRL_ASSIGN_MON_PER_CPU.
Thanks
Babu
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: David Woodhouse @ 2026-05-20 18:29 UTC (permalink / raw)
To: Oliver Upton
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <ag3zr7-11FO3k-Wv@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 4141 bytes --]
On Wed, 2026-05-20 at 10:47 -0700, Oliver Upton wrote:
> On Wed, May 20, 2026 at 12:33:52AM +0100, David Woodhouse wrote:
> > On Tue, 2026-05-19 at 15:57 -0700, Oliver Upton wrote:
> > > What ifs and maybes do not meet the bar, in my opinion, for preserving
> > > bug emulation in KVM. Of course there could be a little flexibility with
> > > that but we need to have some way of discriminating between bug fixes
> > > and genuine guest expectations around the behavior of virtual hardware.
> >
> > I believe you have this completely backwards.
>
> No, I really don't.
>
> Leaving every bugfix that could _possibly_ have a guest-visible impact
> subject to drive-by scrutiny many years after the dust has settled is
> not an acceptable working dynamic. Especially since it would appear
> that the rest of the ecosystem has long since moved on from this
> particular issue.
That's reductio ad absurdum.
I can continue to work around this one internally, sure.
But I'm also concerned about the general case because not only did you
refuse it, but you *also* said that this change in guest-visible
behaviour "should've happened without a change to the revision number".
Which seems to indicate that not only are you being randomly
obstructive about a one-line fix, you *also* don't actually understand
the general concept of what is expected of KVM, which this
Documentation patch is intending to clarify.
It was *right* to bump the IIDR from 1 to 2 when this guest visible
behaviour was changed. The only problem was not letting userspace
select the old revision. I'm really concerned that we now appear to
have a regression of understanding of even the part we previously *did*
get right.
> If this matters to you so deeply then please, be part of the solution
> instead. You may find that reviewing patches leads to better outcomes
> than getting belligerent with the arm64 folks every time you guys
> decide to rebase your kernel. Hell, hypotheticals actually have a lot
> more weight in the context of a review. And if your testing is extensive
> enough to catch these sort of subtleties, don't you think it's better
> done against mainline?
Yes. Definitely. That's why my series with the fixes is more *test*
than actual fix, giving a nice simple framework for any such changes in
future. It checks that GICR_CTLR_IR|GICR_CTLR_CES are visible only with
IIDR.rev=3 for example.
And we're making progress on the amount of downstream crap, but it
doesn't help when we seem to have an impedance mismatch on the very
question of what it means to support customers on KVM at scale. This
thread is not exactly encouraging my engineers to poke their heads
above the parapet.
> Maybe it's just me but I am left feeling disappointed that we all
> haven't found a productive way of working together. I've tried to bridge
> the gap here; we obviously need to do something that at least fixes the
> UAPI breakage. Although apparently we don't even care to meet that low
> of bar.
>
> > A stable and mature platform doesn't get to play in its ivory tower and
> > randomly inflict breakage on guests because they "deserve it".
>
> Really? Aren't you asking for us to emulate something completely broken
> for you?
No. I'm asking for a path to be able to *fix* it.
As things stand, if I just drop these patches and launch guests on a
new kernel, those guests will see writable IGROUPR registers and may
try to use them. And then if I have to roll *back* a kernel deployment,
those guests may lose interrupts.
The *only* time a guest-visible feature (or bugfix, nobody cares about
the difference outside the ivory tower) can be enabled is when the
kernel deployment is finished and stable and *won't* be rolled back.
And *then* new launches (and reboots) can get it.
And one day, when the last guest which was launched *without* it is
finally rebooted and sees the new model, *then* maybe we no longer need
that one line if() statement to support IIDR version 1.
2018 was basically *yesterday*. And I'm kind of scared that I even have
to explain it.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 3/5] mm/damon/core: floor effective quota size at minimum region size
From: Ravi Jonnalagadda @ 2026-05-20 18:37 UTC (permalink / raw)
To: SeongJae Park
Cc: damon, linux-mm, linux-kernel, linux-doc, akpm, corbet, bijan311,
ajayjoshi, honggyu.kim, yunjeong.mun
In-Reply-To: <20260517184705.4652-1-sj@kernel.org>
On Sun, May 17, 2026 at 11:47 AM SeongJae Park <sj@kernel.org> wrote:
>
> On Sat, 16 May 2026 14:03:55 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
>
> > The CONSIST quota goal tuner initializes esz_bp to 0, producing an
> > effective quota size (esz) of 1 byte on the first tick.
> > damos_quota_is_full() rejects all regions when esz < min_region_sz
> > (default PAGE_SIZE = 4096), so no regions can be tried and no
> > feedback reaches the tuner — a bootstrapping deadlock.
>
> That depend on whether the goal is already [over]-achieved. If the goal is
> achieved, the tuner will think no change is needed, so keep the
> effectively-zero quota. If the goal is over-achived, the tuner will think the
> DAMOS scheme should be less aggressive, but it is already effectively-zero
> quota, so keep having effectively-zero quota.
>
> If the ogal is under-achived, the logic will iteratively increase the internal
> esz (esz_bp), until it exceeds the min_region_sz, and finally start making some
> effects.
>
> So, unless the goal is already [over]-achieved, there is no deadlock. If the
> goal is already [over]-achieved, why we would want to make DAMOS do something?
>
> Am I missing something?
>
Hello SJ,
You're not missing anything; you're right. Stock DAMON's
feed-loop tuner ramps esz_bp out of the seed quickly under an
under-achieved goal -- on the order of ten-some ticks at the
1000ms reset_interval the in-tree DAMON modules use, so the
floor isn't gating anything that wouldn't bootstrap on its own.
No deadlock.
I owe a clearer accounting of where this patch and patch 1 came
from, since the same origin story applies to both. Both came
from a parallel debug effort and should not have been carried
into this set.
The work that produced this series came out of an effort to
enable hardware-sampled hotness as a DAMON access source -- the
companion AMD IBS RFC
https://lore.kernel.org/linux-mm/20260516223439.4033-1-ravis.opensrc@gmail.com/
-- and to characterise its closed-loop convergence with the
existing CONSIST tuner on a heterogeneous DRAM+CXL setup. Early
in that effort I was hitting random NMI-context hangs on the per-CPU
report path that prevented runs from completing, and while
debugging those hangs I wasn't sure which direction the
convergence anomalies were coming from -- the sampling backend,
the report-ring path, the tuner shape, or the quota controller.
I made two controller-side experiments as scaffolding while I
narrowed the problem down:
- A per-tick growth cap on the goal-feedback tuner
("max_delta_bp" at 5%/tick) to slow how fast esz could grow
on a transient. That cap stretches the bootstrap above
from ~13s to several minutes, so I added a floor at
min_region_sz to short-cut the bootstrap. That landed here
as patch 3.
- A separate access-rate seeding helper (clear-on-migration)
for the goal-feedback loop. Some early versions of that
helper left the access-rate fields in inconsistent states
and damon_moving_sum() landed in an underflow path I hadn't
seen before. I added a saturating-subtract guard to that
function. That landed here as patch 1.
Once cpuhp-related fixes on the per-CPU sampling path landed
and the NMI stability problem was actually resolved, the
convergence anomalies were tracked down to the sampling/ring
side, not the controller side. I dropped the max_delta_bp knob
and fixed the seeding helper to maintain its invariants.
Patches 1 and 3 were carried into this set even though their
justifications had gone away:
- Patch 1: with the seeding helper fixed, stock callers don't
reach the underflow path -- the invariant holds at every
aggregation boundary in stock DAMON, as you noted. Belongs
with the seeding helper if and when that work goes upstream.
- Patch 3 (this one): with max_delta_bp dropped, the slow
bootstrap doesn't happen -- the ~13s ramp is fast enough
that there's no problem to solve. Once stability was
sorted I also moved the closed-loop runs in the companion
RFC to the temporal tuner, where the bootstrap concern
this patch addresses doesn't even arise (esz_bp saturates
to ULONG_MAX immediately when score=0).
Apologies;both should have come out when
max_delta_bp and the seeding helper did.
Dropping patches 1 and 3.
Patches 2, 4, and 5 are independent of this scaffolding; I'll
reply on each thread separately with the relevant context.
Thanks again for the careful review.
Best,
Ravi
> I'd like to discuss this high level thing first, before digging deep into the
> details.
>
>
> Thanks,
> SJ
>
> [...]
^ permalink raw reply
* Re: [PATCH v5 12/13] Documentation: ABI: testing: add docs for ad9910 sysfs entries
From: Rodrigo Alencar @ 2026-05-20 18:47 UTC (permalink / raw)
To: rodrigo.alencar, linux-iio, devicetree, linux-kernel, linux-doc,
linux-hardening
Cc: Lars-Peter Clausen, Michael Hennerich, Jonathan Cameron,
David Lechner, Andy Shevchenko, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Philipp Zabel, Jonathan Corbet, Shuah Khan,
Kees Cook, Gustavo A. R. Silva
In-Reply-To: <20260517-ad9910-iio-driver-v5-12-31599c88314a@analog.com>
On 26/05/17 07:37PM, Rodrigo Alencar via B4 Relay wrote:
> From: Rodrigo Alencar <rodrigo.alencar@analog.com>
>
> Add custom ABI documentation file for the DDS AD9910 with sysfs entries to
> control Parallel Port, Digital Ramp Generator and OSK parameters.
...
> +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_frequency_offset
> +KernelVersion:
> +Contact: linux-iio@vger.kernel.org
> +Description:
> + For a channel that allows frequency control through buffers, this
> + represents the base frequency value in Hz. The actual output frequency
> + is derived from this offset combined with the processed buffer sample
> + value.
> +
> +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_frequency_scale
> +KernelVersion:
> +Contact: linux-iio@vger.kernel.org
> +Description:
> + For a channel that allows frequency control through buffers, this
> + represents the frequency modulation gain. This value multiplies the
> + buffer input sample value before it is added to a frequency offset.
> +
> +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_phase_offset
> +KernelVersion:
> +Contact: linux-iio@vger.kernel.org
> +Description:
> + For a channel that allows phase control through buffers, this
> + represents the base phase value in radians. The actual output phase is
> + derived from this offset combined with the processed buffer sample
> + value.
> +
> +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_scale_offset
> +KernelVersion:
> +Contact: linux-iio@vger.kernel.org
> +Description:
> + For a channel that allows amplitude control through buffers, this
> + represents the value for a base amplitude scale. The actual output
> + amplitude scale is derived from this offset combined with the processed
> + buffer sample value.
> +
This will become just offset with altcurrent channels. I noticed we have a IIO_PHASE
iio_chan_type, could we have a IIO_FREQUENCY too? Parallel port needs actual raw
frequency values in that case to be written to the dma buffer.
Then we may have buffer capable channels for the parallel port:
out_altcurrent120
offset
out_phase120
offset
out_frequency120
scale
offset
Problem is that the math for the actual frequency output is:
f_OUT = f_FTW + (f_RAW * FM)
where f_FTW is a base frequency (already scaled), FM is a
modulation gain and f_RAW is the contribution from the parallel
port, which is the already scaled:
f_RAW = RAW * f_SYSCLK / 2^32
f_FTW = FTW * f_SYSCLK / 2^32
so the above becomes:
f_OUT = (FTW * f_SYSCLK / 2^32) + (RAW * f_SYSCLK / 2^32) * FM
f_OUT = (FTW/FM + RAW) * f_SYSCLK * FM / 2^32
if I make:
SCALE = f_SYSCLK * FM / 2^32
OFFSET = FTW/FM
f_OUT = (OFFSET + RAW) * SCALE
That would work for a IIO_FREQUENCY channel type, problem is that both
scale and offset would depend on the modulation gain (FM)... I suppose
scale should be setting that and offset assumes it is constant to act
only on FTW.
I suppose we can keep altcurrent for other modes as phase and frequency
can be attributes (knobs) for them. However, in parallel mode we are effectively
pushing frequency, phase or amplitude values into the buffer.
The polar destination is a corner case, but can be solved when both
phase and altcurrent channels are enabled. When that happens we can
change the scan_type with has_ext_scan_type = 1, so the 16-bit data
bus is split between the two.
With the above, all of those *_offset and *_scale custom ABI can be dropped.
--
Kind regards,
Rodrigo Alencar
^ permalink raw reply
* Re: [PATCH v6 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Sean Christopherson @ 2026-05-20 18:59 UTC (permalink / raw)
To: Fuad Tabba
Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CA+EHjTw-cUM=FrJevtSDtR7K6MwUfGfOx21LMFDn7DAy5bFzYw@mail.gmail.com>
On Wed, May 20, 2026, Fuad Tabba wrote:
> On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
> <devnull+ackerleytng.google.com@kernel.org> wrote:
> >
> > From: Sean Christopherson <seanjc@google.com>
> >
> > Implement kvm_gmem_get_memory_attributes() for guest_memfd to allow the KVM
> > core and architecture code to query per-GFN memory attributes.
> >
> > kvm_gmem_get_memory_attributes() finds the memory slot for a given GFN and
> > queries the guest_memfd file's to determine if the page is marked as
> > private.
> >
> > If vm_memory_attributes is not enabled, there is no shared/private tracking
> > at the VM level. Install the guest_memfd implementation as long as
> > guest_memfd is enabled to give guest_memfd a chance to respond on
> > attributes.
> >
> > guest_memfd should look up attributes regardless of whether this memslot is
> > gmem-only since attributes are now tracked by gmem regardless of whether
> > mmap() is enabled.
> >
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > ---
> > include/linux/kvm_host.h | 2 ++
> > virt/kvm/guest_memfd.c | 31 +++++++++++++++++++++++++++++++
> > virt/kvm/kvm_main.c | 3 +++
> > 3 files changed, 36 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index c5ba2cb34e45c..28a54298d27db 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2557,6 +2557,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> > struct kvm_gfn_range *range);
> > #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> >
> > +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn);
> > +
> > #ifdef CONFIG_KVM_GUEST_MEMFD
> > int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 5011d38820d0d..f055e058a3f28 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -509,6 +509,37 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> > return 0;
> > }
> >
> > +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> > +{
> > + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> > + struct inode *inode;
> > +
> > + /*
> > + * If this gfn has no associated memslot, there's no chance of the gfn
> > + * being backed by private memory, since guest_memfd must be used for
> > + * private memory, and guest_memfd must be associated with some memslot.
> > + */
> > + if (!slot)
> > + return 0;
> > +
> > + CLASS(gmem_get_file, file)(slot);
> > + if (!file)
> > + return 0;
> > +
> > + inode = file_inode(file);
> > +
> > + /*
> > + * Rely on the maple tree's internal RCU lock to ensure a
> > + * stable result. This result can become stale as soon as the
> > + * lock is dropped, so the caller _must_ still protect
> > + * consumption of private vs. shared by checking
> > + * mmu_invalidate_retry_gfn() under mmu_lock to serialize
> > + * against ongoing attribute updates.
> > + */
> > + return kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
> > +}
>
> Doesn't this imply that all consumers of kvm_mem_is_private() should
> validate the result using mmu_lock and the invalidation sequence?
> sev_handle_rmp_fault() calls kvm_mem_is_private() without holding
> mmu_lock and without any retry mechanism. Is that a problem?
Yes, but my understanding is that sev_handle_rmp_fault() can tolerate false
positives and false negatives. It's not optimal, but it's "fine", and already
KVM's existing behavior, e.g. KVM gets the PFN and then smashes the RMP, without
ensuring the PFN is fresh.
Mike, is that all correct?
^ permalink raw reply
* Re: [RFC PATCH 0/7] mm/damon: hardware-sampled access reports + AMD IBS Op example
From: Ravi Jonnalagadda @ 2026-05-20 19:01 UTC (permalink / raw)
To: SeongJae Park
Cc: damon, linux-mm, linux-kernel, linux-doc, akpm, corbet, bijan311,
ajayjoshi, honggyu.kim, yunjeong.mun, bharata, Akinobu Mita
In-Reply-To: <20260519061905.89681-1-sj@kernel.org>
On Mon, May 18, 2026 at 11:19 PM SeongJae Park <sj@kernel.org> wrote:
>
> + Akinobu
>
> Hello Ravi,
>
> On Sat, 16 May 2026 15:34:25 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
>
> > Hi all,
> >
> > This is an RFC, not for merge. The series exercises and validates
> > damon_report_access() -- the consumer API SeongJae introduced in [1]
> > -- as a substrate for ingesting access reports from hardware-sampling
> > sources. The series includes one worked-example backend, an AMD IBS
> > Op module (damon_ibs.ko), that runs on Zen 3+ silicon via the
> > existing perf event subsystem.
>
> Thank you for sharing this great RFC series!
>
> [...]
> > Why a hardware-source primitive complements existing primitives
> > ===============================================================
> [...]
> > Both primitives produce a view of hotness that converges to the
> > true distribution over the aggregation interval. For systems where
> > the address space is small relative to the aggregation rate, this is
> > the right tool. On large heterogeneous-memory systems with goal-
> > driven schemes asking the closed-loop tuner to converge on a target
> > distribution, a complementary lower-latency view of accesses can
> > tighten the loop -- reducing the time DAMON's nr_accesses takes to
> > reflect the workload's actual access distribution, which in turn
> > reduces ramp duration and oscillation amplitude during convergence
> > of goal-driven schemes.
> >
> > A hardware-sampling primitive provides this complementary view:
> > hardware retirement records each access at its natural event rate,
> > with a physical address per sample, independent of TLB state and
> > independent of the unmap/fault path.
>
> Yes, I fully agree. Different multiple access check primitives have different
> characteristics.
>
> [...]
>
> > Demonstration
> > =============
> [...]
> > In both regimes, convergence to target is quick, and the workload's
> > measured DRAM share then holds within 1.3 percentage points of
> > target with standard deviation under 1.3 percentage points, sustained
> > over runs of 15-30 minutes per target.
>
> I understand this demonstration shows your AMD IBS-based version of DAMON is
> functioning as expected. Thank you for sharing this!
>
> [...]
> > What's in this series
> > =====================
> >
> > Patch 1. mm/damon/core: refcount ops owner module to prevent
> > rmmod UAF
> > Patch 2. mm/damon/paddr: export damon_pa_* ops for IBS module
> > Patch 3. mm/damon/core: replace mutex-protected report buffer
> > with per-CPU lockless ring
> > Patch 4. mm/damon/core: flat-array snapshot + bsearch in ring-
> > drain loop
> > Patch 5. mm/damon: add sysfs binding and dispatch hookup for
> > paddr_ibs operations
> > Patch 6. mm/damon/core: accept paddr_ibs in node_eligible_mem_bp
> > ops check
> > Patch 7. mm/damon/damon_ibs: add AMD IBS-based access sampling
> > backend
> >
> > Patches 1, 3, and 4 are general infrastructure that benefits any
> > consumer of damon_report_access(). Patches 2, 5, 6, and 7 are the
> > worked-example backend (paddr_ibs ops, sysfs binding, IBS module).
>
> I didn't read the detailed code of each patch. But my high level understanding
> is as below.
>
> Patches 1 and 2 are needed for supporting loadable module-based DAMON operation
> sets (access sampling backend).
>
> Patch 3 is needed for supporting access check primitives that can provide the
> access information in only nmi context. It can also speedup the access
> reporting in general, though.
>
> Patch 4 makes DAMON's internal reported access information retrieval faster, so
> will help any reporting-based DAMON operation set use case.
>
> Patches 5-7 are required for only the IBS-based DAMON operations set
> (paddr_ibs).
>
> So I agree patch 4 is a general infrastructure improvement that benefits
> multiple use cases.
>
> Patch 3 is also arguably general infrastructure improvement, as it will make
> the reporting faster in general.
>
> Patch 1 is not technically coupled with paddr_ibs, and will be needed for
> general loadable module based access check primitives. But, should we support
> lodable modules? If so, why?
>
> Patch 2 is also not technically coupled with paddr_ibs, to my understanding, so
> should be categorized together with patch 1? In other words, if we agree we
> should support lodable modules based DAMON operation sets, this should be
> useful for not only paddr_ibs but more general cases.
>
> Correct me if I'm wrong.
>
> >
> >
> > Patches worth folding into damon/next
> > =====================================
> >
> > Patches 1, 3, and 4 are not specific to IBS or to this RFC's
> > backend. Each is preparatory infrastructure that any consumer of
> > damon_report_access() will need:
> >
> > - Patch 1 (refcount ops owner) -- any modular ops set, including
> > out-of-tree backends, needs clean module unload to avoid UAF
> > on damon_unregister_ops.
> > - Patch 3 (per-CPU lockless ring) -- damon_report_access() cannot
> > be called from NMI context with the current mutex-protected
> > buffer. Hardware samplers all need NMI-safe submission.
> > - Patch 4 (flat-array snapshot + bsearch drain) -- the linear-
> > scan drain is O(reports x regions) and exceeds the sample
> > interval at high-CPU x large-region products. Bsearch brings
> > it to O(reports x log regions).
> >
> > If these belong directly on damon/next as preparatory patches for
> > damon_report_access() rather than living inside an IBS-specific
> > track, we are happy to rebase and resend them that way.
>
> So I'm bit unsure about patch 1. If we don't have a plan to support lodable
> modules based DAMON operations set, we might not need it for now.
>
> For patches 3 and 4, I agree those will be useful in general. Nonetheless, I'd
> slightly prefer to do that optimizations at the later part of the long term
> project.
>
> >
> >
> > Relation to prior and ongoing work
> > ==================================
> >
> > The IBS sampling pattern in patch 7 -- attr.config=0 to use IBS Op
> > default config, dc_phy_addr_valid filter, NMI-safe sample submission
> > -- is derived from concepts in Bharata B Rao's pghot RFC v5 [3].
> > The attribution header is in mm/damon/damon_ibs.c and the patch
> > carries a Suggested-by: trailer.
> >
> > Bharata's pghot v7 [4] introduces a different IBS driver targeting
> > the new IBS Memory Profiler (IBS-MProf) facility, which Bharata
> > describes as a facility "that will be present in future AMD
> > processors" -- a separate IBS instance from the one this RFC's
> > backend uses. This version of driver based out of v5 [3] is an
> > example of how DAMON can be benefited from AMD IBS Hardware
> > source and validates importance of IBS information indepedently.
> > It is not meant to be merged in the current form.
> > @Bharata if you see a path where IBS samples can be consumed
> > by DAMON at some point, will be happy to collaborate.
> >
> > Akinobu Mita's perf-event-based access-check RFC [5] explores a
> > configurable perf-event-driven access source for DAMON. IBS has
> > vendor-specific MSR setup beyond what perf_event_attr alone
> > expresses (e.g. dc_phy_addr_valid filtering on the produced sample,
> > not on the perf attr), so the IBS path here appears complementary
> > to [5] -- operators choose based on whether their hardware sampler
> > fits stock perf or needs additional kernel-side setup.
>
> So apparently there are multiple approaches to develop and use h/w-based access
> monitoring. Akinobu and you are trying to do that using DAMON as the frontend,
> and already made the working prototypes. There were more people who showed
> interest and will to contribute to this project other than you, too. I 100%
> agree h/w-based access monitoring can be useful, and I of course thinking using
> DAMON as the fronend is the right approach. I'm all for making this
> upstreamed.
>
> I was therefore spending time on thinking about in what long-term maintainable
> shape this capability can successfully be upstreamed. I suggested
> damon_report_access() as the internal interface between DAMON and the h/w-based
> access check primitives, and apparently we all (I, Ravi and Akinobu in this
> context) agreed. Akinobu thankfully revisioned his implementation based on
> damon_report_access() interface. Ravi also implemented this RFC based on the
> interface.
>
> After making the consensus with Akinobu, I was taking time on the user space
> interface. When I was discussing with Akinobu, my idea was extending the user
> interface for the page faults based monitoring v3 [1]. But, recently I decided
> to make this more general, so proposed data attributes monitoring extension [2]
> at LSFMMBPF. The patch series for the initial change [3] is merged into mm-new
> for more testing, today. The cover letter of the patch series is also sharing
> how it will be extended for h/w based access monitoring in long term.
>
> I of course want us to go in this direction. I believe you already had chances
> to take a look on the long term plan and didn't make some voice because you
> don't strongly disagree about the plan. If not, please make a voice.
>
Hi SJ,
One layering question I'd like to flag before the plan is written,
since it affects how this RFC's substrate slots in:
In [3], .apply_probes is a periodic per-region classifier driven
from kdamond_fn after .check_accesses, in process context, that
applies a (folio -> bool) predicate to each region's sampling_addr
and accounts the results in r->probe_hits[]. damon_report_access()
on the other hand is a per-event delivery callback into a per-CPU
buffer, called from the access source (NMI for IBS / PEBS / SPE,
process context for page-fault-based sources). These appear to
me to sit at different layers - delivery vs. classification.
The reason I want to confirm this: NMI context for HW samplers
precludes the operations .apply_probes can do today (no mutex, no
kmalloc, no sleep, no folio lookup that touches pte_lock). And
the data shape is inverted - .apply_probes asks "does region R's
sampling_addr have attribute A?", evaluated on the kdamond-chosen
address; an HW sample announces "PA Y was accessed at retirement
time T", arriving asynchronously and needing to find the region
it falls into. If access events end up routed through
.apply_probes in the long-term plan, the IBS / PEBS / SPE
backends would each need a deferral path under it (per-CPU ring
for NMI-safe submission, region mapping at drain time).
Happy to be wrong here if you see a unified shape that handles
both - just want to surface the constraint before the plan is
written.
On the loadable-module question for patches 1 and 2: agreed it's a
genuinely open architectural call, not just a paddr_ibs convenience.
- paddr_ibs (this RFC) targets the existing IBS Op facility on
Zen 3+ silicon via the perf event subsystem and uses a
vendor-specific
overflow-handler filter that perf_event_attr cannot express
(dc_phy_addr_valid in IBS_OP_DATA3). Bharata's pghot v7
[pghot-v7] introduces a separate IBS driver targeting the new
IBS-MProf
facility on future AMD silicon via direct MSR programming -
not perf at all. These are two AMD-specific HW samplers with
non-overlapping silicon coverage and non-overlapping kernel
paths. A distro shipping a single kernel image to a fleet
with mixed silicon needs runtime-selectable backends, which
obj=y can't do across exclusive `depends on` chains.
- Akinobu's perf-event RFC v3 [akinobu-v3] is a useful contrast:
it stays builtin because it's a generic configurable
perf_event_attr passthrough, no vendor-specific code in the
overflow handler. The tristate case is specifically for the
backends that need vendor logic outside perf_event_attr
(IBS dc_phy_addr_valid, future ARM SPE record-format
handling, future Intel PEBS DLA quirks if they need
kernel-side filtering beyond what perf delivers).
Bharata, would value your perspective on two related questions: in
your long-term plan for pghot, do you see the legacy IBS Op path
(this RFC) staying as a DAMON-side backend, while the new IBS-MProf
path lands under pghot? Or do you envision both IBS facilities
eventually feeding through a common HW-sampler primitive (pghot or
DAMON), with frontend selectable by user config? And on existing
Zen 3+ silicon: is the legacy IBS Op driver in this RFC the right
home for those processors going forward.
Thanks,
Ravi
> Assuming you don't have concern on the long term plan yet, I will take time to
> write down more formal and detailed plan. It will explain the overall roadmap,
> timeline and how we could collaborate. On top of that, we could further
> discuss.
>
> >
> >
> > Specific asks
> > =============
> >
> > To SeongJae:
> >
> > 1. Patches 1, 3, and 4 are infrastructure that benefits any consumer
> > of damon_report_access(), not just the IBS backend in this RFC.
> > Would these belong directly on damon/next as preparatory patches
> > for damon_report_access(), rather than living inside an
> > IBS-specific track? Happy to rebase and resend them that way if
> > you'd prefer that shape. Tested-by: tags can come along.
>
> I'm still thinking about how we can collaborate well. The answer for the above
> question would be a part of that. In other words, I have no good answer right
> now, sorry. Could you please give me more time to think more and share the
> plan? I will share the plan as another mail. On the thread, we could further
> discuss. Of course, we could have DAMON beer/coffee/tea chats [4] like
> additional discussions before/after/during the plan discussion.
>
> So, long story short, we agreed this project (h/w-based data access monitoring)
> should be upstreamed. But give me little more time on thinking about how we
> will do it and collaborate. It will take some time. Please bear in mind.
> Sorry for making you wait, but I pretty sure and promise that we will
> eventually make it.
>
> [1] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org
> [2] https://lwn.net/Articles/1071256/
> [3] https://lore.kernel.org/20260518234119.97569-1-sj@kernel.org
> [4] https://docs.google.com/document/d/1v43Kcj3ly4CYqmAkMaZzLiM2GEnWfgdGbZAH3mi2vpM/edit?usp=sharing
>
>
> Thanks,
> SJ
>
> [...]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox