* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: John Ericson @ 2026-06-09 17:27 UTC (permalink / raw)
To: Li Chen, Andy Lutomirski
Cc: Christian Brauner, Kees Cook, Al Viro, linux-fsdevel, linux-api,
LKML, linux-mm, linux-arch, linux-doc, linux-kselftest, x86,
Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
Shuah Khan
In-Reply-To: <19eacd64508.26b92c022125848.262962729296162879@linux.beauty>
On Tue, Jun 9, 2026, at 10:43 AM, Li Chen wrote:
> Hi Andy,
>
> ---- On Tue, 09 Jun 2026 08:01:57 +0800 Andy Lutomirski <luto@kernel.org> wrote ---
> > [...]
> >
> > After contemplating this for a bit... why pidfd? Doesn't a pidfd
> > refer to an actual process that is, or at least was, running? This
> > new thing is a process that we are contemplating spawning. I can
> > imagine that basically all pidfd APIs would be a bit confused by the
> > nonexistence of the process in question.
> >
>
> Yes, I think that is a real concern.
>
> In my current local WIP I tried to keep that distinction explicit.
> pidfd_spawn_open() returns a pidfs-backed builder fd, not a normal pidfd
> referring to a process. The builder fd is allocated as an anonymous pidfs
> file with builder-specific file operations:
>
> file = pidfs_alloc_anon_file("[pidfd_spawn]",
> &pidfd_spawn_builder_fops, builder,
> O_RDWR);
>
What does your builder fd point to, explicitly? For example in my other reply I
talked about how it was "real" process state. In my FreeBSD patch, for example,
I found there was already a status for a process "in exec", and I figured that
was clean to reuse for one of these "embryonic" processes that also hadn't
started running. I would reckon that Linux probably has some similar notions.
> and the normal pidfd helpers still reject it because it does not use the
> ordinary pidfd file operations:
>
> struct pid *pidfd_pid(const struct file *file)
> {
> if (file->f_op != &pidfs_file_operations)
> return ERR_PTR(-EBADF);
> return file_inode(file)->i_private;
> }
>
> So the current split is:
>
> builder_fd = pidfd_spawn_open(...); /* builder object */
> pidfd_config(builder_fd, ...);
> child_pidfd = pidfd_spawn_run(builder_fd, ...); /* real pidfd */
>
> Only the last fd is a normal pidfd for an actual child process. The builder
> fd is only accepted by the builder operations.
>
> This avoids having to define what waitid(P_PIDFD), pidfd_send_signal(),
> pidfd_getfd(), poll(), etc. mean before the process exists.
I wouldn't be so sure this is necessary/good. For example, I think it could
make sense to wait on a process that has yet to be started; one just waits for
both the process to start and the process to exit. Obviously a blocking syscall
in the thread that is spawning the process is not useful, but the asynchronous
poll variation seems fine.
As long as there is real process state here, it shouldn't be too hard to
implement.
> The downside is that it adds a separate open-style entry point and is less
> uniform than the pidfd_open(0, PIDFD_EMPTY) spelling Christian sketched.
I do think there is no point having two file descriptors. The file descriptor
that previously referred to the builder/embryonic process then can refer to the
real process, right?
> If people think there is a better way to represent the pre-spawn builder
> state, or if the preference is to integrate it directly into pidfd_open()
> with an explicit empty/future-pidfd state, I would be happy to discuss that.
Hope the above answers your question? I suppose my ideas lean more on the
"future" than "empty" side --- there is indeed a thread in the thread group,
with real VM/namespace/file descriptor etc. state. Moreover, state gets
initialized before the process is started, so the actual start is a pretty
lightweight step of just letting the scheduler know the now-ready process can
be scheduled. The only thing that distinguishes the embryonic process from a
real one is simply that it isn't running --- i.e. isn't (yet) available to be
scheduled --- so the pidfds holders are free to poke at its state.
Cheers,
John
^ permalink raw reply
* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Jann Horn @ 2026-06-09 17:53 UTC (permalink / raw)
To: Florian Weimer
Cc: Mateusz Guzik, Christian Brauner, Li Chen, Kees Cook,
Alexander Viro, linux-fsdevel, linux-api, linux-kernel, linux-mm,
linux-arch, linux-doc, linux-kselftest, x86, Arnd Bergmann,
Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
Shuah Khan
In-Reply-To: <lhubjdk1c1m.fsf@oldenburg.str.redhat.com>
On Tue, Jun 9, 2026 at 8:08 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Jann Horn:
>
> >> Per the above, the primary win would stem from *NOT* messing with mm.
> >
> > As you write below, I think we have that with CLONE_MM? The C function
> > vfork() is kind of a terrible API because of its returns-twice
> > behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was
> > wrapped by libc in a way similar to clone() (with the child executing
> > a separate handler function), or if it was used in the implementation
> > of some higher-level process-spawning API, it would be a perfectly
> > fine API?
>
> No, there is still a problem with SIGTSTP handling because we cannot
> atomically unmask the signal during execve. We need to unblock SIGTSTP
> before execve in the new process, but this means that it can get
> suspended by SIGTSTP. Consequently, the execve never happens and the
> original process is stuck in vfork:
>
> posix_spawn: parent can get stuck in uninterruptible sleep if child
> receives SIGTSTP early enough
> <https://inbox.sourceware.org/libc-help/2921668c-773e-465d-9480-0abb6f979bf9@www.fastmail.com/>
>
> More on the low-level side, it's difficult to make sure that execve gets
> a consistent snapshot of the environ vector. Both vfork and execve need
> to be async-signal-safe. Any locking or memory allocation (except for
> the stack …) persists in the original process after vfork returns. The
I think that's not entirely accurate; if you call set_robust_list() on
a futex list, then call execve(), the futexes should be released once
the process switches to a new MM, in
begin_new_exec -> exec_mmap -> exec_mm_release -> futex_exec_release
-> futex_cleanup -> exit_robust_list.
So in theory you could use clone() with CLONE_VM and without
CLONE_VFORK, and let the parent either wait for a futex that is
released on exec, or somehow asynchronously check later whether the
futex is still held... probably not the nicest building block but
maybe workable? Though I guess it would fit more nicely if there was a
"munmap() this range on exec" API...
> environ vector can be large, so making a copy on the stack is not ideal.
> It's even harder for getenv/setenv/unsetenv implementations that use
> locking instead of software transactional memory.
Makes sense, that kind of sounds like a pain inherent in being able to
execute from signal handler context...
^ permalink raw reply
* Re: [PATCH v3] cpu/hotplug: Fix NULL kobject warning in cpuhp_smt_enable()
From: Catalin Marinas @ 2026-06-09 17:58 UTC (permalink / raw)
To: Jinjie Ruan
Cc: Will Deacon, corbet, skhan, punit.agrawal, jic23,
osama.abdelkader, chenl311, fengchengwen, suzuki.poulose, maz,
lpieralisi, timothy.hayes, sascha.bischoff, arnd,
mrigendra.chaubey, pierre.gondois, dietmar.eggemann, yangyicong,
sudeep.holla, linux-arm-kernel, linux-doc, linux-kernel
In-Reply-To: <02932ef7-5819-4cf5-8e78-8fd3fd40274f@huawei.com>
Hi Jinjie,
On Wed, Jun 03, 2026 at 02:38:11PM +0800, Jinjie Ruan wrote:
> On 6/2/2026 7:09 PM, Will Deacon wrote:
> > On Wed, May 20, 2026 at 10:20:23AM +0800, Jinjie Ruan wrote:
> >> When booting with ACPI, arm64 smp_prepare_cpus() currently sets all
> >> enumerated CPUs as "present" regardless of their status in the MADT. This
> >> causes issues with SMT hotplug control. For instance, with QEMU's
> >> "-smp 4,maxcpus=8" configuration, the MADT GICC entries are populated as
> >> follows: the first four CPUs are marked Enabled while the remaining four
> >> are marked Online Capable to support potential hot-plugging.
> >>
> >> Fix this by:
> >>
> >> 1. When booting with ACPI, checking the ACPI_MADT_ENABLED flag in the GICC
> >> entry before calling set_cpu_present() during SMP initialization.
> >>
> >> 2. Properly managing the present mask in acpi_map_cpu() and
> >> acpi_unmap_cpu() to support actual CPU hotplug events, This aligns with
> >> other architectures like x86 and LoongArch.
> >>
> >> 3. Update the arm64 CPU hotplug documentation to no longer state that all
> >> online-capable vCPUs are marked as present by the kernel at boot time.
> >>
> >> This ensures that only physically available or explicitly enabled CPUs
> >> are in the present mask, keeping the SMT control logic consistent with
> >> the actual hardware state.
> >
> > Please can you check the Sashiko review comment?
> >
> > https://sashiko.dev/#/patchset/20260520022023.126670-1-ruanjinjie@huawei.com
>
> I think commit eba4675008a6 ("arm64: arch_register_cpu() variant to
> check if an ACPI handle is now available.") introduced this bug.
>
> It introduced an architectural safety block inside
> arch_unregister_cpu(). If a hot-unplug operation is determined to be a
> physical hardware removal (where _STA evaluates to
> !ACPI_STA_DEVICE_PRESENT), it aborts the unregistration transaction
> early to protect unreadied arm64 infrastructure, thereby skipping
> unregister_cpu().
>
> However, the generic ACPI processor driver path in
> acpi_processor_post_eject() currently treats arch_unregister_cpu() as
> an unconditional void operation. When arch_unregister_cpu() bails out
> early, the subsequent cleanup flow blindly proceeds to call
> acpi_unmap_cpu(), clears global per-cpu processor arrays, and
> unconditionally free the 'struct acpi_processor' object.
>
> I think we can fix this by:
>
> 1. Refactoring arch_unregister_cpu() to return an integer
> transaction status. It returns -EOPNOTSUPP when aborting due to physical
> hot-remove blocking, -EINVAL/-EIO on firmware failures, and 0 only upon
> successful unregistration.
>
> 2. Guarding the downstream execution flow in
> acpi_processor_post_eject(). If arch_unregister_cpu() returns a error
> code, the hot-unplug transaction is considered aborted.
I wonder whether we need all this guarding. In the worst case, we could
rewrite the function, something like below, to always unregister and
only warn:
void arch_unregister_cpu(int cpu)
{
acpi_handle acpi_handle = acpi_get_processor_handle(cpu);
struct cpu *c = &per_cpu(cpu_devices, cpu);
acpi_status status;
unsigned long long sta;
if (!acpi_handle) {
pr_err_once("Removing a CPU without associated ACPI handle\n");
} else {
status = acpi_evaluate_integer(acpi_handle, "_STA", NULL, &sta);
if (!ACPI_FAILURE(status) &&
cpu_present(cpu) && !(sta & ACPI_STA_DEVICE_PRESENT))
pr_err_once("Changing CPU present bit is not supported\n");
}
unregister_cpu(c);
}
However, on the first condition, can we actually trigger !acpi_handle?
If not, we could just drop it. I tried to look up the paths and I don't
think we'd ever end up in this function with !acpi_handle. So this
leaves us with the next checks.
On the second/third conditions, it's more about preventing physical CPU
hotplug as we haven't properly defined it for arm yet but we could just
add a WARN_ONCE() to make it more visible and still proceed with the
unregistering. I think with your proposal, we don't fully unroll the
state anyway just by returning an error in arch_unregister_cpu(), so I'd
rather continue here.
What does firmware do for virtual CPU hotplug w.r.t. _STA? I noticed a
slight change in wording in the cpu-hotplug.rst doc with your patch from
On virtual systems the _STA method must always report the CPU as
``present``
to
On virtual systems the _STA method must report the CPU as ``present``
when it is activated by the firmware
Was your intention that _STA.PRESENT can become 0 when hot-unplugging
virtual CPUs?
--
Catalin
^ permalink raw reply
* Re: [PATCH v6 08/12] PCI: liveupdate: Inherit ACS flags in incoming preserved devices
From: David Matlack @ 2026-06-09 18:40 UTC (permalink / raw)
To: Pranjal Shrivastava
Cc: kexec, linux-doc, linux-kernel, linux-mm, linux-pci,
Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pratyush Yadav, Saeed Mahameed, Samiullah Khawaja, Shuah Khan,
Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <aihLTgs1Y49OXQaV@google.com>
On Tue, Jun 9, 2026 at 10:20 AM Pranjal Shrivastava <praan@google.com> wrote:
>
> On Mon, Jun 08, 2026 at 09:56:41PM +0000, David Matlack wrote:
> > On 2026-06-07 08:37 PM, Pranjal Shrivastava wrote:
> > > On Fri, May 22, 2026 at 08:24:06PM +0000, David Matlack wrote:
> > > > Inherit Access Control Services (ACS) flags on all incoming preserved
> > > > devices (endpoints and upstream bridges) during a Live Update.
> > > >
> > > > Inheriting ACS flags avoids changing routing rules while memory
> > > > transactions are in flight from preserved devices. This is also strictly
> > > > necessary to ensure that IOMMU group assignments do not change across
> > > > a Live Update for preserved devices, as changing ACS configurations can
> > > > split or merge IOMMU groups.
> > > >
> > > > Cache the inherited ACS controls established by the previous kernel in
> > > > struct pci_dev so that ACS controls do not change after a reset
> > > > (pci_restore_state() calls pci_enable_acs()).
> > > >
> > > > To simplify ACS inheritance, reject preserving any devices that require
> > > > quirks to enable ACS as those quirks would also have to take Live Update
> > > > into account.
> > > >
> > > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > > ---
> > > > drivers/pci/liveupdate.c | 68 ++++++++++++++++++++++++++++++++++
> > > > drivers/pci/liveupdate.h | 11 ++++++
> > > > drivers/pci/pci.c | 5 +++
> > > > drivers/pci/pci.h | 5 +++
> > > > drivers/pci/quirks.c | 7 ++++
> > > > include/linux/pci_liveupdate.h | 6 +++
> > > > 6 files changed, 102 insertions(+)
> > > >
> > >
> > > [...]
> > >
> > > >
> > > > +void pci_liveupdate_init_acs(struct pci_dev *dev)
> > > > +{
> > > > + guard(rwsem_read)(&pci_liveupdate.rwsem);
> > > > +
> > > > + if (!dev->acs_cap || !dev->liveupdate.incoming)
> > > > + return;
> > > > +
> > > > + pci_read_config_word(dev, dev->acs_cap + PCI_ACS_CTRL, &dev->liveupdate.acs_ctrl);
> > >
> > > I might be thinking out loud here, but as an attacker, this motivates me
> > > to somehow hack the EP FW to mis-report the PCI_ACS_CTRL register across
> > > a liveupdate to fool the incoming kernel. If the FW feeds a 0, it silently
> > > strips ACS protections.
> > >
> > > Should we also serialize ACS state in ser somehow to ensure we aren't
> > > fooled by something like this?
> >
> > What does "EP FW" mean?
>
> I was referring to the Endpoint Firmware (basically any SW running on
> a downstream device)
>
> >
> > Does such an attacker even need Live Update to attack the system? It
> > seems like such an attacker could route TLPs in whatever malicious way
> > they want regardless of Live Update.
> >
>
> I agree that compromised PCIe devices are a menace anyway. But I was
> talking about the potential window opened up by Live Update here,
> suppose we have Device A & B assigned to 2 different VMs (implying they
> are in separate IOMMU groups because the switch set ACS_RR = 1).
>
> Now, the attacker has an opportunity with Liveupdate, since the devices
> are already assigned, if *somehow* it flips a bit like ACS_RR, the
> incoming kernel might see both the devices in the same IOMMU group.
> Who detects this case and what happens if this happens if the devices
> are kept assigned to these VMs?
I suspect that would be caught during the restore of the iommufds to
which those devices are attached.
The kernel would attempt to restore each device into a separate domain
(since that's how they were preserved before the Live Update) but that
will fail because they are in the same group now. Even if one of the
devices was not preserved, that will still cause a failure when a user
tries to start using that device (e.g. to try to attach it to a
different VM).
^ permalink raw reply
* Re: [PATCH v9 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: David Hildenbrand (Arm) @ 2026-06-09 18:41 UTC (permalink / raw)
To: Breno Leitao
Cc: Miaohe Lin, Andrew Morton, Lorenzo Stoakes, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
Naoya Horiguchi, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
lance.yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team
In-Reply-To: <aig7jzwDHfVCxikl@gmail.com>
On 6/9/26 18:15, Breno Leitao wrote:
> On Tue, Jun 09, 2026 at 04:41:01PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/9/26 12:56, Breno Leitao wrote:
>>> get_any_page() collapses every HWPoisonHandlable() rejection into a
>>> single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
>>> -> retry path. That is correct for the transient case (a userspace
>>> folio briefly off LRU during migration or compaction, which a later
>>> shake can drag back), but wrong for stable kernel-owned pages: slab,
>>> page-table, large-kmalloc and PG_reserved pages will never become
>>> HWPoisonHandlable(), so the retry loop is wasted work and the final
>>> -EIO loses the "this is structurally unrecoverable" information.
>>> memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
>>> panic-on-unrecoverable sysctl deliberately does not act on.
>>>
>>> Introduce HWPoisonKernelOwned(), a small predicate that positively
>>> identifies pages the hwpoison handler cannot recover from:
>>>
>>> HWPoisonKernelOwned(p, flags) :=
>>> !(MF_SOFT_OFFLINE && page_has_movable_ops(p)) &&
>>> (PageReserved(p) ||
>>> PageSlab(head) || PageTable(head) || PageLargeKmalloc(head))
>>>
>>> where head = compound_head(p).
>>>
>>> PG_reserved is a per-page flag (PF_NO_COMPOUND) and is tested on the
>>> page directly. The slab, page-table and large-kmalloc page-type bits
>>> are only stored on the head page, so those tests resolve the compound
>>> head first, then re-read compound_head(page) afterwards: a concurrent
>>> split or compound free that moves head invalidates the just-read flags
>>> and the loop retries. The lookup still takes no refcount, mirroring
>>> the rest of get_any_page(); the recheck closes the common split race,
>>> and a residual free->alloc->free in the same window can only mis-tag
>>> a genuinely poisoned page, never reclassify a handlable one.
>>>
>>> The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors the
>>> same exception in HWPoisonHandlable(): soft-offline is allowed to
>>> migrate movable_ops pages even though they are not on the LRU, and
>>> we must not pre-empt that with an unrecoverable verdict.
>>>
>>> The list is intentionally not exhaustive. vmalloc and kernel-stack
>>> pages, for example, do not carry a page_type bit and would need a
>>> different oracle; they keep going through the existing retry path
>>> unchanged. This is the smallest set we can identify with certainty
>>> by page type.
>>>
>>> Wire the helper into the top of get_any_page() to short-circuit
>>> those pages before the retry loop runs. On a hit, drop the caller's
>>> MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
>>> straight away. Pages outside the helper's positive list still take
>>> the existing retry path and return -EIO, leaving operator-visible
>>> behaviour for those cases unchanged.
>>>
>>> Extend the unhandlable-page pr_err() to fire for either errno and
>>> update the get_hwpoison_page() kerneldoc to document the new return.
>>>
>>> memory_failure() still folds every negative return into
>>> MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
>>> this patch on its own only changes the errno that soft_offline_page()
>>> can propagate to its callers. A follow-up wires -ENOTRECOVERABLE
>>> through memory_failure() and reports MF_MSG_KERNEL for the
>>> unrecoverable cases, which is what the
>>> panic_on_unrecoverable_memory_failure sysctl observes.
>>>
>>> Suggested-by: David Hildenbrand <david@kernel.org>
>>> Suggested-by: Lance Yang <lance.yang@linux.dev>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>> mm/memory-failure.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>>> 1 file changed, 58 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index f4d3e6e20e13..eed9de387694 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -1325,6 +1325,46 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
>>> return PageLRU(page) || is_free_buddy_page(page);
>>> }
>>>
>>> +/*
>>> + * Positive identification of pages the hwpoison handler cannot recover.
>>> + * These page types are owned by kernel internals (no userspace mapping
>>> + * to unmap, no file mapping to invalidate, no migration target), so the
>>> + * shake_page() / retry loop in get_any_page() can never turn them into
>>> + * something HWPoisonHandlable() will accept. Short-circuit them to
>>> + * -ENOTRECOVERABLE so callers can panic on operator request instead of
>>> + * spinning through retries that exit as a transient-looking -EIO.
>>> + *
>>> + * The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors
>>> + * HWPoisonHandlable(): soft-offline is allowed to migrate movable_ops
>>> + * pages even though they are not on the LRU.
>>> + */
>>> +static inline bool HWPoisonKernelOwned(struct page *page, unsigned long flags)
>>> +{
>>> + struct page *head;
>>> +
>>> + if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page))
>>> + return false;
>>> +
>>
>> On a second look: Do we really need that? The page types below never support
>> migration. So I guess that check is not required?
>>
>> Apart from that, looks good with two comments:
>>
>> a) HWPoisonKernelOwned: this is not the common style for us to name functions.
>>
>> is_kernel_owned_page() or sth like that would do.
>
> Ack, I will rename it is_kernel_owned_page()
>
> In my defence, most of the functions similar to HWPoisonKernelOwned()
> has this name format, and I got this discussion earlier (with Lance?
> I think). Here are the similar function names in that file:
>
> * HWPoisonHandlable
> * PageHWPoisonTakenOff()
> * SetPageHWPoisonTakenOff
Some of these probably date back to our old way of handling page flags and
things, like PageLRU.
But we really should stop :)
>
> I will update in the new version.
Thanks! Probably best to wait a bit, the merge window is coming up either way,
so this will have to wait a bit either way.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v4 2/2] iio: dac: Add AD5529R DAC driver support
From: Andy Shevchenko @ 2026-06-09 18:44 UTC (permalink / raw)
To: Janani Sunil
Cc: Lars-Peter Clausen, Michael Hennerich, Jonathan Cameron,
David Lechner, Nuno Sá, Andy Shevchenko, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Philipp Zabel, Jonathan Corbet,
Shuah Khan, linux-iio, devicetree, linux-kernel, linux-doc,
Janani Sunil
In-Reply-To: <20260609-ad5529r-driver-v4-2-2e4c02234a1a@analog.com>
On Tue, Jun 09, 2026 at 05:00:21PM +0200, Janani Sunil wrote:
> Add support for AD5529R 16-channel, 12/16 bit Digital to Analog Converter
Looks like half-dropped sentence...
> +static const struct regmap_range ad5529r_16bit_read_only_ranges[] = {
> + regmap_reg_range(AD5529R_REG_DAC_DATA_READBACK_BASE,
> + (AD5529R_REG_DAC_DATA_READBACK_BASE + 15 * 2)),
Too many parentheses.
> + regmap_reg_range(AD5529R_REG_TSENS_ALERT_FLAG, AD5529R_REG_TSENS_SHTD_FLAG),
> + regmap_reg_range(AD5529R_REG_FUNC_BUSY, AD5529R_REG_FUNC_BUSY),
> + regmap_reg_range(AD5529R_REG_INIT_CRC_ERR_STAT, AD5529R_REG_INIT_CRC_ERR_STAT),
> +};
...
> +static int ad5529r_reset(struct ad5529r_state *st)
> +{
> + struct reset_control *rst;
> + int ret;
> +
> + rst = devm_reset_control_get_optional_exclusive(&st->spi->dev, NULL);
> + if (IS_ERR(rst))
> + return PTR_ERR(rst);
> +
> + if (rst) {
> + ret = reset_control_deassert(rst);
> + if (ret)
> + return ret;
> + } else {
> + ret = regmap_write(st->regmap_8bit, AD5529R_REG_INTERFACE_CONFIG_A,
> + AD5529R_INTERFACE_CONFIG_A_SW_RESET);
> + if (ret)
> + return ret;
> + }
> +
> + /*
> + * Wait 10 ms for digital initialization to complete.
> + * Per datasheet, Interface Status A register NOT_READY_ERR bit is
> + * set if SPI transactions are attempted before digital initialization
> + * completes.
> + */
> + fsleep(10000);
10 * USEC_PER_MSEC
(needs time.h if not yet included).
> + return regmap_write(st->regmap_8bit, AD5529R_REG_INTERFACE_CONFIG_A,
> + AD5529R_INTERFACE_CONFIG_A_SDO_ENABLE |
> + AD5529R_INTERFACE_CONFIG_A_ADDR_ASCENSION);
> +}
...
> + case IIO_CHAN_INFO_OFFSET:
> + range_idx = st->output_range_idx[chan->channel];
> +
> + if (ad5529r_output_ranges_mv[range_idx][0] < 0)
> + *val = -(1 << (st->model_data->resolution - 1));
From the C standard point of view this is problematic code in case
if resolution == 32, but I believe it won't happen IRL.
> + else
> + *val = 0;
> +
> + return IIO_VAL_INT;
...
> +static int ad5529r_find_output_range(const s32 *vals)
> +{
> + for (unsigned int i = 0; i < ARRAY_SIZE(ad5529r_output_ranges_mv); i++) {
> + if (vals[0] == ad5529r_output_ranges_mv[i][0] * 1000 &&
> + vals[1] == ad5529r_output_ranges_mv[i][1] * 1000)
Make suffix of the array _mV (yes, like in SI).
> + return i;
> + }
> +
> + return -EINVAL;
> +}
...
> +static int ad5529r_parse_channel_ranges(struct device *dev,
> + struct ad5529r_state *st)
> +{
> + int ret, range_idx;
> + u32 ch;
> + s32 vals[2];
Reversed xmas tree order.
> + device_for_each_child_node_scoped(dev, child) {
> + range_idx = AD5529R_RANGE_0V_5V;
> +
> + ret = fwnode_property_read_u32(child, "reg", &ch);
> + if (ret)
> + return dev_err_probe(dev, ret,
> + "Missing reg property in channel node\n");
> +
> + if (ch >= 16)
> + return dev_err_probe(dev, -EINVAL,
> + "Invalid channel number: %u\n", ch);
> +
> + /* Read u32 property into s32 to handle negative voltage ranges */
> + if (!fwnode_property_read_u32_array(child,
> + "adi,output-range-microvolt",
> + (u32 *)vals, ARRAY_SIZE(vals))) {
Yeah... Do you know how negative numbers will look in DT? Just as regular
negative integers and DT compiler will take care to pack them in unsigned
variable?
> + range_idx = ad5529r_find_output_range(vals);
> + if (range_idx < 0)
> + return dev_err_probe(dev, range_idx,
> + "Invalid range [%d %d] for ch %u\n",
> + vals[0], vals[1], ch);
> + }
> +
> + st->output_range_idx[ch] = range_idx;
> + ret = regmap_write(st->regmap_16bit,
> + AD5529R_REG_OUT_RANGE(ch), range_idx);
> + if (ret)
> + return dev_err_probe(dev, ret,
> + "Failed to configure range for ch %u\n",
> + ch);
> + }
> +
> + return 0;
> +}
...
> +static int ad5529r_probe(struct spi_device *spi)
> +{
> + struct device *dev = &spi->dev;
> + struct iio_dev *indio_dev;
> + struct ad5529r_state *st;
> + bool external_vref;
> + int ret;
> +
> + indio_dev = devm_iio_device_alloc(dev, sizeof(*st));
> + if (!indio_dev)
> + return -ENOMEM;
> +
> + st = iio_priv(indio_dev);
> +
> + st->spi = spi;
> +
> + st->model_data = spi_get_device_match_data(spi);
> + if (!st->model_data)
> + return dev_err_probe(dev, -EINVAL, "Failed to identify device variant\n");
> +
> + ret = devm_regulator_bulk_get_enable(dev, ARRAY_SIZE(ad5529r_supply_names),
> + ad5529r_supply_names);
> + if (ret)
> + return dev_err_probe(dev, ret,
> + "Failed to get and enable regulators\n");
> +
> + ret = devm_regulator_get_enable_optional(dev, "hvss");
> + if (ret && ret != -ENODEV)
> + return dev_err_probe(dev, ret,
> + "Failed to get and enable hvss regulator\n");
> +
> + /*
> + * The datasheet mentions a 4.096V external reference for correct
> + * operation.
> + */
> + ret = devm_regulator_get_enable_optional(dev, "vref");
> + if (ret && ret != -ENODEV)
> + return dev_err_probe(dev, ret,
> + "Failed to get and enable vref regulator\n");
> +
> + external_vref = ret != -ENODEV;
This can be written as
if (ret == -ENODEV)
external_vref = false;
else if (ret)
return dev_err_probe(dev, ret,
"Failed to get and enable vref regulator\n");
else
external_vref = true;
> + st->regmap_8bit = devm_regmap_init_spi(spi, &ad5529r_regmap_8bit_config);
> + if (IS_ERR(st->regmap_8bit))
> + return dev_err_probe(dev, PTR_ERR(st->regmap_8bit),
> + "Failed to initialize 8-bit regmap\n");
> +
> + st->regmap_16bit = devm_regmap_init_spi(spi, &ad5529r_regmap_16bit_config);
> + if (IS_ERR(st->regmap_16bit))
> + return dev_err_probe(dev, PTR_ERR(st->regmap_16bit),
> + "Failed to initialize 16-bit regmap\n");
> +
> + ret = ad5529r_reset(st);
> + if (ret)
> + return dev_err_probe(dev, ret, "Failed to reset device\n");
> +
> + ret = regmap_assign_bits(st->regmap_16bit, AD5529R_REG_REF_SEL,
> + AD5529R_REF_SEL_INTERNAL_REF,
> + external_vref ? 0 : AD5529R_REF_SEL_INTERNAL_REF);
> + if (ret)
> + return dev_err_probe(dev, ret, "Failed to configure reference\n");
> +
> + ret = ad5529r_parse_channel_ranges(dev, st);
> + if (ret)
> + return ret;
> +
> + indio_dev->name = st->model_data->model_name;
> + indio_dev->info = &ad5529r_info;
> + indio_dev->modes = INDIO_DIRECT_MODE;
> + indio_dev->channels = st->model_data->channels;
> + indio_dev->num_channels = st->model_data->num_channels;
> +
> + return devm_iio_device_register(dev, indio_dev);
> +}
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* Re: [PATCH v5 00/34] Cleaning up the KVM clock mess
From: David Woodhouse @ 2026-06-09 18:50 UTC (permalink / raw)
To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
Paul Durrant, Jonathan Cameron, Sascha Bischoff, Marc Zyngier,
Joey Gouly, Jack Allister, Dongli Zhang, joe.jin, kvm, linux-doc,
linux-kernel, xen-devel, linux-kselftest
In-Reply-To: <20260608145455.89187-1-dwmw2@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 1716 bytes --]
On Mon, 2026-06-08 at 15:47 +0100, David Woodhouse wrote:
> This is v5 of the series to clean up the KVM clock, rebased onto
> tip/timers/ptp (which now includes Thomas's ktime snapshot series and
> the read_snapshot patches for hyperv, kvmclock, and vmclock).
>
> The KVM clock has historically suffered from three problems:
>
> 1. Imprecision: get_kvmclock_ns() computed the clock from the *host*
> TSC without applying guest TSC scaling, causing systemic drift from
> the values the guest computes from its own TSC.
>
> 2. Unnecessary discontinuities: gratuitous KVM_REQ_MASTERCLOCK_UPDATE
> requests caused the master clock reference point to be re-snapshotted,
> yanking the guest's clock due to arithmetic precision differences.
>
> 3. No precise migration API: the existing KVM_[GS]ET_CLOCK only allows
> setting the clock at a given UTC reference time, which is necessarily
> imprecise. There was no way to preserve the exact arithmetic
> relationship between guest TSC and KVM clock across live migration.
>
> This series addresses all three, and adds new APIs for precise clock
> migration and TSC frequency reporting. As an added bonus, it now rips
> out the whole pvclock_gtod_data hack which was shadowing the kernel's
> timekeeping, and uses ktime snapshots as $DEITY (well, Thomas) intended.
Updated series with Sashiko's fixes pushed out to
https://git.infradead.org/?p=users/dwmw2/linux.git;a=shortlog;h=refs/heads/kvmclock6
I'll give it a while for meat-based reviewers to opine before sending
another round. (Is there a way to get Sashiko to take another look
without sending it all to the list again?)
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* [PATCH net-next] docs: networking: add guidance on what to push via extack
From: Jakub Kicinski @ 2026-06-09 19:09 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
corbet, skhan, linux-doc
Every now and then someone tries to duplicated extack
messages to dmesg. Document our guidance against this.
Also indicate that system level faults should continue
to go to system logs. The high level thinking is to try
to distinguish between what's important to the user vs
system admin.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: corbet@lwn.net
CC: skhan@linuxfoundation.org
CC: linux-doc@vger.kernel.org
---
Documentation/networking/driver.rst | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/Documentation/networking/driver.rst b/Documentation/networking/driver.rst
index 195a916dc0de..abd366dd5e43 100644
--- a/Documentation/networking/driver.rst
+++ b/Documentation/networking/driver.rst
@@ -128,3 +128,16 @@ to be freed up.
If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you
must not keep any reference to that SKB and you must not attempt
to free it up.
+
+Error message reporting
+=======================
+
+Number of driver configuration interfaces pass a Netlink extended ACK
+(``extack``) object to the driver (either directly as an argument or
+as a member of a parameter struct). The drivers should try to report
+most errors via the ``extack`` object. System level exceptions,
+indicating that system or device is misbehaving or is in bad state
+should continue to be reported to system logs.
+
+Messages should be passed **either** via ``extack`` **or** to system logs.
+Drivers should not try to report the same information to both.
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v6 08/12] PCI: liveupdate: Inherit ACS flags in incoming preserved devices
From: Pranjal Shrivastava @ 2026-06-09 19:25 UTC (permalink / raw)
To: David Matlack
Cc: kexec, linux-doc, linux-kernel, linux-mm, linux-pci,
Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pratyush Yadav, Saeed Mahameed, Samiullah Khawaja, Shuah Khan,
Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <CALzav=dagHvcS8kbTti5rmMoks9DXuCpO3AjptkQ8z_PdG9JyQ@mail.gmail.com>
On Tue, Jun 09, 2026 at 11:40:39AM -0700, David Matlack wrote:
> On Tue, Jun 9, 2026 at 10:20 AM Pranjal Shrivastava <praan@google.com> wrote:
> >
> > On Mon, Jun 08, 2026 at 09:56:41PM +0000, David Matlack wrote:
> > > On 2026-06-07 08:37 PM, Pranjal Shrivastava wrote:
> > > > On Fri, May 22, 2026 at 08:24:06PM +0000, David Matlack wrote:
> > > > > Inherit Access Control Services (ACS) flags on all incoming preserved
> > > > > devices (endpoints and upstream bridges) during a Live Update.
> > > > >
> > > > > Inheriting ACS flags avoids changing routing rules while memory
> > > > > transactions are in flight from preserved devices. This is also strictly
> > > > > necessary to ensure that IOMMU group assignments do not change across
> > > > > a Live Update for preserved devices, as changing ACS configurations can
> > > > > split or merge IOMMU groups.
> > > > >
> > > > > Cache the inherited ACS controls established by the previous kernel in
> > > > > struct pci_dev so that ACS controls do not change after a reset
> > > > > (pci_restore_state() calls pci_enable_acs()).
> > > > >
> > > > > To simplify ACS inheritance, reject preserving any devices that require
> > > > > quirks to enable ACS as those quirks would also have to take Live Update
> > > > > into account.
> > > > >
> > > > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > > > ---
> > > > > drivers/pci/liveupdate.c | 68 ++++++++++++++++++++++++++++++++++
> > > > > drivers/pci/liveupdate.h | 11 ++++++
> > > > > drivers/pci/pci.c | 5 +++
> > > > > drivers/pci/pci.h | 5 +++
> > > > > drivers/pci/quirks.c | 7 ++++
> > > > > include/linux/pci_liveupdate.h | 6 +++
> > > > > 6 files changed, 102 insertions(+)
> > > > >
> > > >
> > > > [...]
> > > >
> > > > >
> > > > > +void pci_liveupdate_init_acs(struct pci_dev *dev)
> > > > > +{
> > > > > + guard(rwsem_read)(&pci_liveupdate.rwsem);
> > > > > +
> > > > > + if (!dev->acs_cap || !dev->liveupdate.incoming)
> > > > > + return;
> > > > > +
> > > > > + pci_read_config_word(dev, dev->acs_cap + PCI_ACS_CTRL, &dev->liveupdate.acs_ctrl);
> > > >
> > > > I might be thinking out loud here, but as an attacker, this motivates me
> > > > to somehow hack the EP FW to mis-report the PCI_ACS_CTRL register across
> > > > a liveupdate to fool the incoming kernel. If the FW feeds a 0, it silently
> > > > strips ACS protections.
> > > >
> > > > Should we also serialize ACS state in ser somehow to ensure we aren't
> > > > fooled by something like this?
> > >
> > > What does "EP FW" mean?
> >
> > I was referring to the Endpoint Firmware (basically any SW running on
> > a downstream device)
> >
> > >
> > > Does such an attacker even need Live Update to attack the system? It
> > > seems like such an attacker could route TLPs in whatever malicious way
> > > they want regardless of Live Update.
> > >
> >
> > I agree that compromised PCIe devices are a menace anyway. But I was
> > talking about the potential window opened up by Live Update here,
> > suppose we have Device A & B assigned to 2 different VMs (implying they
> > are in separate IOMMU groups because the switch set ACS_RR = 1).
> >
> > Now, the attacker has an opportunity with Liveupdate, since the devices
> > are already assigned, if *somehow* it flips a bit like ACS_RR, the
> > incoming kernel might see both the devices in the same IOMMU group.
> > Who detects this case and what happens if this happens if the devices
> > are kept assigned to these VMs?
>
> I suspect that would be caught during the restore of the iommufds to
> which those devices are attached.
>
> The kernel would attempt to restore each device into a separate domain
> (since that's how they were preserved before the Live Update) but that
> will fail because they are in the same group now. Even if one of the
> devices was not preserved, that will still cause a failure when a user
> tries to start using that device (e.g. to try to attach it to a
> different VM).
Yes, IOMMU would eventually catch-up but what about the DMAs that were
done already? Say to an NVMe disk? We'll have to wipe the entire disk in
such a case? Also, we wouldn't know the offending device..
If such situations aren't a problem, then I guess it's fine.
Thanks,
Praan
^ permalink raw reply
* [PATCH v3 2/3] hwmon: Add update_interval_us chip attribute
From: Ferdinand Schwenk via B4 Relay @ 2026-06-09 19:43 UTC (permalink / raw)
To: Guenter Roeck, Jonathan Corbet, Shuah Khan
Cc: linux-hwmon, linux-kernel, linux-doc, richard.leitner,
Ferdinand Schwenk
In-Reply-To: <20260609-hwmon-ina238-update-interval-us-v2-v3-0-016b55567950@advastore.com>
From: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
Some hardware monitoring chips support update intervals below one
millisecond. The existing update_interval attribute uses millisecond
granularity, which causes sub-millisecond steps to round to the same
value and become inaccessible from userspace.
Introduce update_interval_us, a companion chip-level attribute that
expresses the same update interval in microseconds. Drivers
implementing this attribute should also implement update_interval for
compatibility with millisecond-based userspace interfaces.
Signed-off-by: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
---
Documentation/ABI/testing/sysfs-class-hwmon | 14 ++++++++++++++
Documentation/hwmon/sysfs-interface.rst | 4 ++++
drivers/hwmon/hwmon.c | 1 +
include/linux/hwmon.h | 2 ++
4 files changed, 21 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-class-hwmon b/Documentation/ABI/testing/sysfs-class-hwmon
index cfd0d0bab483..b185bdfc7186 100644
--- a/Documentation/ABI/testing/sysfs-class-hwmon
+++ b/Documentation/ABI/testing/sysfs-class-hwmon
@@ -27,6 +27,20 @@ Description:
Some devices have a variable update rate or interval.
This attribute can be used to change it to the desired value.
+What: /sys/class/hwmon/hwmonX/update_interval_us
+Description:
+ The interval at which the chip will update readings,
+ expressed in microseconds.
+ Unit: microsecond
+
+ RW
+
+ Some devices have a variable update rate or interval and
+ require finer-than-millisecond control.
+ This attribute can be used to change it to the desired value.
+ Drivers implementing this attribute should also implement
+ update_interval for millisecond-based userspace interfaces.
+
What: /sys/class/hwmon/hwmonX/inY_min
Description:
Voltage min value.
diff --git a/Documentation/hwmon/sysfs-interface.rst b/Documentation/hwmon/sysfs-interface.rst
index f76e9f8cc1ad..94e1bbce172a 100644
--- a/Documentation/hwmon/sysfs-interface.rst
+++ b/Documentation/hwmon/sysfs-interface.rst
@@ -106,6 +106,10 @@ Global attributes
`update_interval`
The interval at which the chip will update readings.
+`update_interval_us`
+ The interval at which the chip will update readings,
+ expressed in microseconds for finer resolution.
+
********
Voltages
diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
index 9695dca62a7e..c481bb535e35 100644
--- a/drivers/hwmon/hwmon.c
+++ b/drivers/hwmon/hwmon.c
@@ -572,6 +572,7 @@ static const char * const hwmon_chip_attrs[] = {
[hwmon_chip_curr_reset_history] = "curr_reset_history",
[hwmon_chip_power_reset_history] = "power_reset_history",
[hwmon_chip_update_interval] = "update_interval",
+ [hwmon_chip_update_interval_us] = "update_interval_us",
[hwmon_chip_alarms] = "alarms",
[hwmon_chip_samples] = "samples",
[hwmon_chip_curr_samples] = "curr_samples",
diff --git a/include/linux/hwmon.h b/include/linux/hwmon.h
index 301a83afbd66..5a5882fee299 100644
--- a/include/linux/hwmon.h
+++ b/include/linux/hwmon.h
@@ -39,6 +39,7 @@ enum hwmon_chip_attributes {
hwmon_chip_power_reset_history,
hwmon_chip_register_tz,
hwmon_chip_update_interval,
+ hwmon_chip_update_interval_us,
hwmon_chip_alarms,
hwmon_chip_samples,
hwmon_chip_curr_samples,
@@ -55,6 +56,7 @@ enum hwmon_chip_attributes {
#define HWMON_C_POWER_RESET_HISTORY BIT(hwmon_chip_power_reset_history)
#define HWMON_C_REGISTER_TZ BIT(hwmon_chip_register_tz)
#define HWMON_C_UPDATE_INTERVAL BIT(hwmon_chip_update_interval)
+#define HWMON_C_UPDATE_INTERVAL_US BIT(hwmon_chip_update_interval_us)
#define HWMON_C_ALARMS BIT(hwmon_chip_alarms)
#define HWMON_C_SAMPLES BIT(hwmon_chip_samples)
#define HWMON_C_CURR_SAMPLES BIT(hwmon_chip_curr_samples)
--
2.54.0
^ permalink raw reply related
* [PATCH v3 0/3] hwmon: ina238: add samples and update_interval_us support
From: Ferdinand Schwenk via B4 Relay @ 2026-06-09 19:43 UTC (permalink / raw)
To: Guenter Roeck, Jonathan Corbet, Shuah Khan
Cc: linux-hwmon, linux-kernel, linux-doc, richard.leitner,
Ferdinand Schwenk
The INA238 family exposes ADC averaging and conversion timing controls
through ADC_CONFIG. Add support for the samples and update_interval
chip attributes and introduce a generic update_interval_us companion
attribute for devices that need sub-millisecond resolution.
The shortest INA238 conversion time steps are below 1 ms, so several
valid hardware settings collapse to the same update_interval value when
reported only in milliseconds. Keep update_interval in milliseconds as
required by the existing hwmon ABI and provide update_interval_us to
report and program the same total conversion cycle time with microsecond
resolution.
Patch 1 adds samples and update_interval support for the INA238 family.
Patch 2 adds the generic hwmon update_interval_us attribute and
documents it.
Patch 3 wires the new attribute up in the INA238 driver.
Link: https://lore.kernel.org/all/20260522-hwmon-ina238-add-samples-update-interval-v1-0-e1acfceb447e@advastore.com/
---
v2:
- keep update_interval in milliseconds to preserve the existing ABI
- add the generic update_interval_us hwmon chip attribute and documentation
- implement update_interval_us for ina238
- report and program intervals using the active averaging count
v3:
- add missing Signed-off-by trailer to all patches
- address truncation concern by tightening update_interval input clamping in patch 1/3
- preserve equivalent clamping in patch 3/3 after introducing update_interval_us
- Link to v2: https://lore.kernel.org/r/20260608-hwmon-ina238-update-interval-us-v2-v2-0-2d939fbb2ea1@advastore.com
---
Ferdinand Schwenk (3):
hwmon: ina238: add support for samples and update_interval
hwmon: Add update_interval_us chip attribute
hwmon: ina238: add update_interval_us attribute
Documentation/ABI/testing/sysfs-class-hwmon | 14 +++
Documentation/hwmon/ina238.rst | 4 +
Documentation/hwmon/sysfs-interface.rst | 4 +
drivers/hwmon/hwmon.c | 1 +
drivers/hwmon/ina238.c | 162 +++++++++++++++++++++++++++-
include/linux/hwmon.h | 2 +
6 files changed, 185 insertions(+), 2 deletions(-)
---
base-commit: 028ef9c96e96197026887c0f092424679298aae8
change-id: 20260608-hwmon-ina238-update-interval-us-v2-13448de0a083
Best regards,
--
Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
^ permalink raw reply
* [PATCH v3 1/3] hwmon: ina238: add support for samples and update_interval
From: Ferdinand Schwenk via B4 Relay @ 2026-06-09 19:43 UTC (permalink / raw)
To: Guenter Roeck, Jonathan Corbet, Shuah Khan
Cc: linux-hwmon, linux-kernel, linux-doc, richard.leitner,
Ferdinand Schwenk
In-Reply-To: <20260609-hwmon-ina238-update-interval-us-v2-v3-0-016b55567950@advastore.com>
From: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
Expose INA238 ADC averaging count (AVG) and conversion timing
(VBUSCT/VSHCT/VTCT) through chip-level hwmon attributes:
chip/samples
chip/update_interval
Use per-chip conversion-time lookup tables so the same helpers work
for INA228/INA237/INA238/INA700/INA780 and SQ52206. Cache ADC_CONFIG
in driver data and update it on writes to avoid extra register reads
during read-modify-write updates.
Report update_interval in milliseconds as required by the hwmon ABI.
Compute it from raw ADC cycle time multiplied by the active averaging
count, and apply the inverse mapping on writes so programmed conversion
time tracks the selected sample count.
Clamp user-provided update_interval before unit scaling to prevent
overflow in arithmetic conversions.
Also combine chip attributes in HWMON_CHANNEL_INFO using a bitwise OR
for a single logical chip channel.
Signed-off-by: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
---
drivers/hwmon/ina238.c | 144 ++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 142 insertions(+), 2 deletions(-)
diff --git a/drivers/hwmon/ina238.c b/drivers/hwmon/ina238.c
index ff67b03189f7..dc5dd3ad2557 100644
--- a/drivers/hwmon/ina238.c
+++ b/drivers/hwmon/ina238.c
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/of.h>
#include <linux/regmap.h>
+#include <linux/util_macros.h>
/* INA238 register definitions */
#define INA238_CONFIG 0x0
@@ -49,6 +50,17 @@
#define INA238_DIAG_ALERT_BUSUL BIT(3)
#define INA238_DIAG_ALERT_POL BIT(2)
+/* INA238_ADC_CONFIG register field masks and shifts */
+#define INA238_ADC_CONFIG_MODE_MASK GENMASK(15, 12)
+#define INA238_ADC_CONFIG_VBUSCT_SHIFT 9
+#define INA238_ADC_CONFIG_VBUSCT_MASK GENMASK(11, 9)
+#define INA238_ADC_CONFIG_VSHCT_SHIFT 6
+#define INA238_ADC_CONFIG_VSHCT_MASK GENMASK(8, 6)
+#define INA238_ADC_CONFIG_VTCT_SHIFT 3
+#define INA238_ADC_CONFIG_VTCT_MASK GENMASK(5, 3)
+#define INA238_ADC_CONFIG_AVG_SHIFT 0
+#define INA238_ADC_CONFIG_AVG_MASK GENMASK(2, 0)
+
#define INA238_REGISTERS 0x20
#define INA238_RSHUNT_DEFAULT 2500 /* uOhm */
@@ -101,6 +113,21 @@ static const struct regmap_config ina238_regmap_config = {
.val_bits = 16,
};
+/* Lookup table for conversion times in usec for INA238 family */
+static const u16 ina238_conv_time[] = {
+ 50, 84, 150, 280, 540, 1052, 2074, 4120,
+};
+
+/* Lookup table for conversion times in usec for SQ52206 */
+static const u16 sq52206_conv_time[] = {
+ 66, 118, 310, 566, 1070, 2090, 4140, 8230,
+};
+
+/* Lookup table for number of samples used in averaging mode */
+static const int ina238_avg_samples[] = {
+ 1, 4, 16, 64, 128, 256, 512, 1024,
+};
+
enum ina238_ids { ina228, ina237, ina238, ina700, ina780, sq52206 };
struct ina238_config {
@@ -112,6 +139,7 @@ struct ina238_config {
u32 power_calculate_factor; /* fixed parameter for power calculation, from datasheet */
u32 bus_voltage_lsb; /* bus voltage LSB, in nV */
int current_lsb; /* current LSB, in uA */
+ const u16 *conv_time; /* conversion time lookup table */
};
struct ina238_data {
@@ -124,6 +152,7 @@ struct ina238_data {
int current_lsb; /* current LSB, in uA */
int power_lsb; /* power LSB, in uW */
int energy_lsb; /* energy LSB, in uJ */
+ u16 adc_config; /* cached ADC_CONFIG register value */
};
static const struct ina238_config ina238_config[] = {
@@ -135,6 +164,7 @@ static const struct ina238_config ina238_config[] = {
.config_default = INA238_CONFIG_DEFAULT,
.bus_voltage_lsb = INA238_BUS_VOLTAGE_LSB,
.temp_resolution = 16,
+ .conv_time = ina238_conv_time,
},
[ina237] = {
.has_20bit_voltage_current = false,
@@ -144,6 +174,7 @@ static const struct ina238_config ina238_config[] = {
.config_default = INA238_CONFIG_DEFAULT,
.bus_voltage_lsb = INA238_BUS_VOLTAGE_LSB,
.temp_resolution = 12,
+ .conv_time = ina238_conv_time,
},
[ina238] = {
.has_20bit_voltage_current = false,
@@ -153,6 +184,7 @@ static const struct ina238_config ina238_config[] = {
.config_default = INA238_CONFIG_DEFAULT,
.bus_voltage_lsb = INA238_BUS_VOLTAGE_LSB,
.temp_resolution = 12,
+ .conv_time = ina238_conv_time,
},
[ina700] = {
.has_20bit_voltage_current = false,
@@ -163,6 +195,7 @@ static const struct ina238_config ina238_config[] = {
.bus_voltage_lsb = INA238_BUS_VOLTAGE_LSB,
.temp_resolution = 12,
.current_lsb = 480,
+ .conv_time = ina238_conv_time,
},
[ina780] = {
.has_20bit_voltage_current = false,
@@ -173,6 +206,7 @@ static const struct ina238_config ina238_config[] = {
.bus_voltage_lsb = INA238_BUS_VOLTAGE_LSB,
.temp_resolution = 12,
.current_lsb = 2400,
+ .conv_time = ina238_conv_time,
},
[sq52206] = {
.has_20bit_voltage_current = false,
@@ -182,6 +216,7 @@ static const struct ina238_config ina238_config[] = {
.config_default = SQ52206_CONFIG_DEFAULT,
.bus_voltage_lsb = SQ52206_BUS_VOLTAGE_LSB,
.temp_resolution = 16,
+ .conv_time = sq52206_conv_time,
},
};
@@ -261,6 +296,97 @@ static int ina228_read_voltage(struct ina238_data *data, int channel, long *val)
return 0;
}
+/* Converting ADC_CONFIG register value to update_interval in usec */
+static inline u32 ina238_reg_to_interval_us(struct ina238_data *data)
+{
+ const u16 *ct = data->config->conv_time;
+ u32 vbusct = ct[(data->adc_config & INA238_ADC_CONFIG_VBUSCT_MASK) >>
+ INA238_ADC_CONFIG_VBUSCT_SHIFT];
+ u32 vshct = ct[(data->adc_config & INA238_ADC_CONFIG_VSHCT_MASK) >>
+ INA238_ADC_CONFIG_VSHCT_SHIFT];
+ u32 vtct = ct[(data->adc_config & INA238_ADC_CONFIG_VTCT_MASK) >>
+ INA238_ADC_CONFIG_VTCT_SHIFT];
+
+ return vbusct + vshct + vtct;
+}
+
+static inline u32 ina238_samples(struct ina238_data *data)
+{
+ return ina238_avg_samples[(data->adc_config & INA238_ADC_CONFIG_AVG_MASK) >>
+ INA238_ADC_CONFIG_AVG_SHIFT];
+}
+
+/* Converting update_interval in msec to a single conversion time in usec */
+static inline u32 ina238_interval_ms_to_conv_time(long interval, u32 samples)
+{
+ u64 interval_us;
+
+ interval = clamp_val(interval, 0, INT_MAX / 1000);
+ interval_us = (u64)interval * 1000;
+
+ /*
+ * update_interval reports the ADC cycle time including averaging.
+ * The target per-field conversion time is interval_us / (samples * 3).
+ */
+ return DIV_ROUND_CLOSEST_ULL(interval_us, samples * 3);
+}
+
+static int ina238_read_chip(struct device *dev, u32 attr, long *val)
+{
+ struct ina238_data *data = dev_get_drvdata(dev);
+
+ switch (attr) {
+ case hwmon_chip_samples:
+ *val = ina238_samples(data);
+ return 0;
+ case hwmon_chip_update_interval:
+ /* Return in msec */
+ *val = DIV_ROUND_CLOSEST(ina238_reg_to_interval_us(data) *
+ ina238_samples(data), 1000);
+ return 0;
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
+static int ina238_write_chip(struct device *dev, u32 attr, long val)
+{
+ struct ina238_data *data = dev_get_drvdata(dev);
+ u16 adc_config;
+ int idx, ret;
+
+ switch (attr) {
+ case hwmon_chip_samples:
+ idx = find_closest(val, ina238_avg_samples,
+ ARRAY_SIZE(ina238_avg_samples));
+ adc_config = (data->adc_config & ~INA238_ADC_CONFIG_AVG_MASK) |
+ (idx << INA238_ADC_CONFIG_AVG_SHIFT);
+ ret = regmap_write(data->regmap, INA238_ADC_CONFIG, adc_config);
+ if (ret)
+ return ret;
+ data->adc_config = adc_config;
+ return 0;
+ case hwmon_chip_update_interval:
+ val = ina238_interval_ms_to_conv_time(val, ina238_samples(data));
+ idx = find_closest(val, data->config->conv_time,
+ ARRAY_SIZE(ina238_conv_time));
+ adc_config = (data->adc_config &
+ ~(INA238_ADC_CONFIG_VBUSCT_MASK |
+ INA238_ADC_CONFIG_VSHCT_MASK |
+ INA238_ADC_CONFIG_VTCT_MASK)) |
+ ((u16)idx << INA238_ADC_CONFIG_VBUSCT_SHIFT) |
+ ((u16)idx << INA238_ADC_CONFIG_VSHCT_SHIFT) |
+ ((u16)idx << INA238_ADC_CONFIG_VTCT_SHIFT);
+ ret = regmap_write(data->regmap, INA238_ADC_CONFIG, adc_config);
+ if (ret)
+ return ret;
+ data->adc_config = adc_config;
+ return 0;
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
static int ina238_read_in(struct device *dev, u32 attr, int channel,
long *val)
{
@@ -587,6 +713,8 @@ static int ina238_read(struct device *dev, enum hwmon_sensor_types type,
u32 attr, int channel, long *val)
{
switch (type) {
+ case hwmon_chip:
+ return ina238_read_chip(dev, attr, val);
case hwmon_in:
return ina238_read_in(dev, attr, channel, val);
case hwmon_curr:
@@ -607,6 +735,8 @@ static int ina238_write(struct device *dev, enum hwmon_sensor_types type,
u32 attr, int channel, long val)
{
switch (type) {
+ case hwmon_chip:
+ return ina238_write_chip(dev, attr, val);
case hwmon_in:
return ina238_write_in(dev, attr, channel, val);
case hwmon_curr:
@@ -629,6 +759,14 @@ static umode_t ina238_is_visible(const void *drvdata,
bool has_energy = data->config->has_energy;
switch (type) {
+ case hwmon_chip:
+ switch (attr) {
+ case hwmon_chip_samples:
+ case hwmon_chip_update_interval:
+ return 0644;
+ default:
+ return 0;
+ }
case hwmon_in:
switch (attr) {
case hwmon_in_input:
@@ -692,6 +830,8 @@ static umode_t ina238_is_visible(const void *drvdata,
HWMON_I_MIN | HWMON_I_MIN_ALARM)
static const struct hwmon_channel_info * const ina238_info[] = {
+ HWMON_CHANNEL_INFO(chip,
+ HWMON_C_SAMPLES | HWMON_C_UPDATE_INTERVAL),
HWMON_CHANNEL_INFO(in,
/* 0: shunt voltage */
INA238_HWMON_IN_CONFIG,
@@ -798,8 +938,8 @@ static int ina238_probe(struct i2c_client *client)
}
/* Setup ADC_CONFIG register */
- ret = regmap_write(data->regmap, INA238_ADC_CONFIG,
- INA238_ADC_CONFIG_DEFAULT);
+ data->adc_config = INA238_ADC_CONFIG_DEFAULT;
+ ret = regmap_write(data->regmap, INA238_ADC_CONFIG, data->adc_config);
if (ret < 0) {
dev_err(dev, "error configuring the device: %d\n", ret);
return -ENODEV;
--
2.54.0
^ permalink raw reply related
* [PATCH v3 3/3] hwmon: ina238: add update_interval_us attribute
From: Ferdinand Schwenk via B4 Relay @ 2026-06-09 19:43 UTC (permalink / raw)
To: Guenter Roeck, Jonathan Corbet, Shuah Khan
Cc: linux-hwmon, linux-kernel, linux-doc, richard.leitner,
Ferdinand Schwenk
In-Reply-To: <20260609-hwmon-ina238-update-interval-us-v2-v3-0-016b55567950@advastore.com>
From: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
The INA238 family supports eight conversion time steps from 50 us to
4120 us (SQ52206: 66 us to 8230 us). At the millisecond granularity of
update_interval, the four shortest steps (50, 84, 150, 280 us) all
round to the same value and cannot be individually selected.
Add support for the generic update_interval_us attribute, which reports
and programs the same ADC cycle time as update_interval but in
microseconds, giving userspace full access to all conversion time steps.
Both attributes reflect the total cycle time including the active
averaging count: the reported value is the raw conversion time
multiplied by the number of averaged samples, and writes apply the
inverse mapping.
Signed-off-by: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
---
Documentation/hwmon/ina238.rst | 4 +++
drivers/hwmon/ina238.c | 70 ++++++++++++++++++++++++++----------------
2 files changed, 48 insertions(+), 26 deletions(-)
diff --git a/Documentation/hwmon/ina238.rst b/Documentation/hwmon/ina238.rst
index 43950d1ec551..a75b79e17d9d 100644
--- a/Documentation/hwmon/ina238.rst
+++ b/Documentation/hwmon/ina238.rst
@@ -106,4 +106,8 @@ energy1_input Energy measurement (uJ)
temp1_input Die temperature measurement (mC)
temp1_max Maximum die temperature threshold (mC)
temp1_max_alarm Maximum die temperature alarm
+
+samples ADC averaging count (1, 4, 16, 64, 128, 256, 512, 1024)
+update_interval Total ADC conversion cycle time including averaging (ms)
+update_interval_us Total ADC conversion cycle time including averaging (us)
======================= =======================================================
diff --git a/drivers/hwmon/ina238.c b/drivers/hwmon/ina238.c
index dc5dd3ad2557..080a93fcc9f7 100644
--- a/drivers/hwmon/ina238.c
+++ b/drivers/hwmon/ina238.c
@@ -316,19 +316,36 @@ static inline u32 ina238_samples(struct ina238_data *data)
INA238_ADC_CONFIG_AVG_SHIFT];
}
-/* Converting update_interval in msec to a single conversion time in usec */
-static inline u32 ina238_interval_ms_to_conv_time(long interval, u32 samples)
+/* Converting update_interval(_us) to a per-field conversion time in usec.
+ * interval_us is the total ADC cycle time including averaging in microseconds.
+ * All three conversion fields (VBUSCT, VSHCT, VTCT) are set equal, so the
+ * per-field time is interval_us / (samples * 3).
+ */
+static inline u32 ina238_interval_us_to_conv_time(u32 interval_us, u32 samples)
{
- u64 interval_us;
+ return DIV_ROUND_CLOSEST_ULL(interval_us, samples * 3);
+}
- interval = clamp_val(interval, 0, INT_MAX / 1000);
- interval_us = (u64)interval * 1000;
+/* Write a per-field conversion time (in usec) to the ADC_CONFIG register */
+static int ina238_write_conv_time(struct ina238_data *data, u32 conv_time_us)
+{
+ u16 adc_config;
+ int idx, ret;
- /*
- * update_interval reports the ADC cycle time including averaging.
- * The target per-field conversion time is interval_us / (samples * 3).
- */
- return DIV_ROUND_CLOSEST_ULL(interval_us, samples * 3);
+ idx = find_closest(conv_time_us, data->config->conv_time,
+ ARRAY_SIZE(ina238_conv_time));
+ adc_config = (data->adc_config &
+ ~(INA238_ADC_CONFIG_VBUSCT_MASK |
+ INA238_ADC_CONFIG_VSHCT_MASK |
+ INA238_ADC_CONFIG_VTCT_MASK)) |
+ ((u16)idx << INA238_ADC_CONFIG_VBUSCT_SHIFT) |
+ ((u16)idx << INA238_ADC_CONFIG_VSHCT_SHIFT) |
+ ((u16)idx << INA238_ADC_CONFIG_VTCT_SHIFT);
+ ret = regmap_write(data->regmap, INA238_ADC_CONFIG, adc_config);
+ if (ret)
+ return ret;
+ data->adc_config = adc_config;
+ return 0;
}
static int ina238_read_chip(struct device *dev, u32 attr, long *val)
@@ -344,6 +361,10 @@ static int ina238_read_chip(struct device *dev, u32 attr, long *val)
*val = DIV_ROUND_CLOSEST(ina238_reg_to_interval_us(data) *
ina238_samples(data), 1000);
return 0;
+ case hwmon_chip_update_interval_us:
+ /* Return in usec */
+ *val = ina238_reg_to_interval_us(data) * ina238_samples(data);
+ return 0;
default:
return -EOPNOTSUPP;
}
@@ -367,21 +388,16 @@ static int ina238_write_chip(struct device *dev, u32 attr, long val)
data->adc_config = adc_config;
return 0;
case hwmon_chip_update_interval:
- val = ina238_interval_ms_to_conv_time(val, ina238_samples(data));
- idx = find_closest(val, data->config->conv_time,
- ARRAY_SIZE(ina238_conv_time));
- adc_config = (data->adc_config &
- ~(INA238_ADC_CONFIG_VBUSCT_MASK |
- INA238_ADC_CONFIG_VSHCT_MASK |
- INA238_ADC_CONFIG_VTCT_MASK)) |
- ((u16)idx << INA238_ADC_CONFIG_VBUSCT_SHIFT) |
- ((u16)idx << INA238_ADC_CONFIG_VSHCT_SHIFT) |
- ((u16)idx << INA238_ADC_CONFIG_VTCT_SHIFT);
- ret = regmap_write(data->regmap, INA238_ADC_CONFIG, adc_config);
- if (ret)
- return ret;
- data->adc_config = adc_config;
- return 0;
+ /* Convert ms to us before passing to the shared helper */
+ val = clamp_val(val, 0, INT_MAX / 1000) * 1000;
+ return ina238_write_conv_time(data,
+ ina238_interval_us_to_conv_time((u32)val,
+ ina238_samples(data)));
+ case hwmon_chip_update_interval_us:
+ val = clamp_val(val, 0, INT_MAX);
+ return ina238_write_conv_time(data,
+ ina238_interval_us_to_conv_time((u32)val,
+ ina238_samples(data)));
default:
return -EOPNOTSUPP;
}
@@ -763,6 +779,7 @@ static umode_t ina238_is_visible(const void *drvdata,
switch (attr) {
case hwmon_chip_samples:
case hwmon_chip_update_interval:
+ case hwmon_chip_update_interval_us:
return 0644;
default:
return 0;
@@ -831,7 +848,8 @@ static umode_t ina238_is_visible(const void *drvdata,
static const struct hwmon_channel_info * const ina238_info[] = {
HWMON_CHANNEL_INFO(chip,
- HWMON_C_SAMPLES | HWMON_C_UPDATE_INTERVAL),
+ HWMON_C_SAMPLES | HWMON_C_UPDATE_INTERVAL |
+ HWMON_C_UPDATE_INTERVAL_US),
HWMON_CHANNEL_INFO(in,
/* 0: shunt voltage */
INA238_HWMON_IN_CONFIG,
--
2.54.0
^ permalink raw reply related
* Re: [PATCH net-next] docs: networking: add guidance on what to push via extack
From: Joe Damato @ 2026-06-09 19:51 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, corbet,
skhan, linux-doc
In-Reply-To: <20260609190919.1139517-1-kuba@kernel.org>
On Tue, Jun 09, 2026 at 12:09:19PM -0700, Jakub Kicinski wrote:
> Every now and then someone tries to duplicated extack
> messages to dmesg. Document our guidance against this.
> Also indicate that system level faults should continue
> to go to system logs. The high level thinking is to try
> to distinguish between what's important to the user vs
> system admin.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: corbet@lwn.net
> CC: skhan@linuxfoundation.org
> CC: linux-doc@vger.kernel.org
> ---
> Documentation/networking/driver.rst | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/Documentation/networking/driver.rst b/Documentation/networking/driver.rst
> index 195a916dc0de..abd366dd5e43 100644
> --- a/Documentation/networking/driver.rst
> +++ b/Documentation/networking/driver.rst
> @@ -128,3 +128,16 @@ to be freed up.
> If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you
> must not keep any reference to that SKB and you must not attempt
> to free it up.
> +
> +Error message reporting
> +=======================
> +
> +Number of driver configuration interfaces pass a Netlink extended ACK
Maybe I'm reading it wrong, but seems like it should be: "A number of" ?
Reviewed-by: Joe Damato <joe@dama.to>
^ permalink raw reply
* Re: [PATCH v17 00/28] Add new general DRM property "color format"
From: Daniel Stone @ 2026-06-09 20:11 UTC (permalink / raw)
To: Nicolas Frattaroli
Cc: Harry Wentland, Leo Li, Rodrigo Siqueira, Alex Deucher,
Christian König, David Airlie, Simona Vetter,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
Andrzej Hajda, Neil Armstrong, Robert Foss, Laurent Pinchart,
Jonas Karlman, Jernej Skrabec, Sandy Huang, Heiko Stübner,
Andy Yan, Jani Nikula, Rodrigo Vivi, Joonas Lahtinen,
Tvrtko Ursulin, Dmitry Baryshkov, Sascha Hauer, Rob Herring,
Jonathan Corbet, Shuah Khan, kernel, amd-gfx, dri-devel,
linux-kernel, linux-arm-kernel, linux-rockchip, intel-gfx,
intel-xe, linux-doc, wayland-devel, Werner Sembach,
Andri Yngvason, Cristian Ciocaltea, Marius Vlad, Dmitry Baryshkov,
Andy Yan
In-Reply-To: <20260609-color-format-v17-0-35739b5782cc@collabora.com>
Hi Nicolas,
On Tue, 9 Jun 2026 at 13:44, Nicolas Frattaroli
<nicolas.frattaroli@collabora.com> wrote:
> this is a follow-up to
> https://lore.kernel.org/all/20250911130739.4936-1-marius.vlad@collabora.com/
> which in of itself is a follow-up to
> https://lore.kernel.org/dri-devel/20240115160554.720247-1-andri@yngvason.is/ where
> a new DRM connector property has been added allowing users to
> force a particular color format.
>
> That in turn was actually also a follow-up from Werner Sembach's posted at
> https://lore.kernel.org/dri-devel/20210630151018.330354-1-wse@tuxedocomputers.com/
>
> As the number of cooks have reached critical mass, I'm hoping I'll be
> the last person to touch this particular series.
Thanks for seeing this through!
I've pushed this now, minus the Intel patches which they can merge
through their own tree. The AMD tree required a pretty trivial
conflict resolution which seems to work OK here.
Cheers,
Daniel
^ permalink raw reply
* [PATCH net-next 0/3] docs: net: more adjustments to docs
From: Jakub Kicinski @ 2026-06-09 20:12 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, corbet, linux-doc,
bpf, Jakub Kicinski
A few small updates to the docs. Mostly typos this time.
This is trying to prepare docs for getting fed directly
into AI reviews.
Jakub Kicinski (3):
docs: net: fix minor issues with XDP metadata docs
docs: net: tls-offload: document tls_dev_del, tls_dev_resync, and
rekey
docs: net: fix minor issues with devlink docs
.../networking/devlink/devlink-health.rst | 12 +++++---
.../networking/devlink/devlink-params.rst | 2 +-
.../networking/devlink/devlink-port.rst | 5 +++-
.../networking/devlink/devlink-trap.rst | 8 +++--
Documentation/networking/devlink/index.rst | 10 +++----
Documentation/networking/tls-offload.rst | 29 ++++++++++++++++++
Documentation/networking/xdp-rx-metadata.rst | 2 +-
Documentation/networking/xsk-tx-metadata.rst | 30 ++++++++++---------
8 files changed, 69 insertions(+), 29 deletions(-)
--
2.54.0
^ permalink raw reply
* [PATCH net-next 1/3] docs: net: fix minor issues with XDP metadata docs
From: Jakub Kicinski @ 2026-06-09 20:12 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, corbet, linux-doc,
bpf, Jakub Kicinski, skhan, ast, daniel, hawk, john.fastabend,
sdf
In-Reply-To: <20260609201224.1191391-1-kuba@kernel.org>
Minor updates to the XDP metadata documentation:
- s/union/struct/ for xsk_tx_metadata
- document nested request and completion metadata fields
- point capability queries at the xsk-features attribute
- fix grammar in the XDP RX metadata guide
- typos
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: corbet@lwn.net
CC: skhan@linuxfoundation.org
CC: ast@kernel.org
CC: daniel@iogearbox.net
CC: hawk@kernel.org
CC: john.fastabend@gmail.com
CC: sdf@fomichev.me
CC: linux-doc@vger.kernel.org
CC: bpf@vger.kernel.org
---
Documentation/networking/xdp-rx-metadata.rst | 2 +-
Documentation/networking/xsk-tx-metadata.rst | 30 +++++++++++---------
2 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
index ce96f4c99505..efdf5eeb49e7 100644
--- a/Documentation/networking/xdp-rx-metadata.rst
+++ b/Documentation/networking/xdp-rx-metadata.rst
@@ -36,7 +36,7 @@ metadata available in which case the driver returns ``-ENODATA``.
Not all kfuncs have to be implemented by the device driver; when not
implemented, the default ones that return ``-EOPNOTSUPP`` will be used
-to indicate the device driver have not implemented this kfunc.
+to indicate the device driver has not implemented this kfunc.
Within an XDP frame, the metadata layout (accessed via ``xdp_buff``) is
diff --git a/Documentation/networking/xsk-tx-metadata.rst b/Documentation/networking/xsk-tx-metadata.rst
index df53a10ccac3..f240930ca1df 100644
--- a/Documentation/networking/xsk-tx-metadata.rst
+++ b/Documentation/networking/xsk-tx-metadata.rst
@@ -14,9 +14,9 @@ General Design
The headroom for the metadata is reserved via ``tx_metadata_len`` and
``XDP_UMEM_TX_METADATA_LEN`` flag in ``struct xdp_umem_reg``. The metadata
length is therefore the same for every socket that shares the same umem.
-The metadata layout is a fixed UAPI, refer to ``union xsk_tx_metadata`` in
+The metadata layout is a fixed UAPI, refer to ``struct xsk_tx_metadata`` in
``include/uapi/linux/if_xdp.h``. Thus, generally, the ``tx_metadata_len``
-field above should contain ``sizeof(union xsk_tx_metadata)``.
+field above should contain ``sizeof(struct xsk_tx_metadata)``.
Note that in the original implementation the ``XDP_UMEM_TX_METADATA_LEN``
flag was not required. Applications might attempt to create a umem
@@ -45,15 +45,16 @@ the metadata area is ignored by the kernel as well.
The flags field enables the particular offload:
- ``XDP_TXMD_FLAGS_TIMESTAMP``: requests the device to put transmission
- timestamp into ``tx_timestamp`` field of ``union xsk_tx_metadata``.
+ timestamp into ``completion.tx_timestamp`` field of
+ ``struct xsk_tx_metadata``.
- ``XDP_TXMD_FLAGS_CHECKSUM``: requests the device to calculate L4
- checksum. ``csum_start`` specifies byte offset of where the checksumming
- should start and ``csum_offset`` specifies byte offset where the
- device should store the computed checksum.
+ checksum. ``request.csum_start`` specifies byte offset of where the
+ checksumming should start and ``request.csum_offset`` specifies byte offset
+ where the device should store the computed checksum.
- ``XDP_TXMD_FLAGS_LAUNCH_TIME``: requests the device to schedule the
packet for transmission at a pre-determined time called launch time. The
- value of launch time is indicated by ``launch_time`` field of
- ``union xsk_tx_metadata``.
+ value of launch time is indicated by ``request.launch_time`` field of
+ ``struct xsk_tx_metadata``.
Besides the flags above, in order to trigger the offloads, the first
packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA``
@@ -63,9 +64,9 @@ only the first chunk should carry the metadata.
Software TX Checksum
====================
-For development and testing purposes its possible to pass
+For development and testing purposes it's possible to pass
``XDP_UMEM_TX_SW_CSUM`` flag to ``XDP_UMEM_REG`` UMEM registration call.
-In this case, when running in ``XDK_COPY`` mode, the TX checksum
+In this case, when running in ``XDP_COPY`` mode, the TX checksum
is calculated on the CPU. Do not enable this option in production because
it will negatively affect performance.
@@ -118,7 +119,7 @@ schedule with a 1-second cycle time, with all Tx Queues open at all times.
The value of the launch time that is programmed in the Advanced Transmit
Context Descriptor is a relative offset to the starting time of the Qbv
-transmission window of the queue. The Frst flag of the descriptor can be
+transmission window of the queue. The First flag of the descriptor can be
set to schedule the packet for the next Qbv cycle. Therefore, the horizon
of the launch time for i225 and i226 is the ending time of the next cycle
of the Qbv transmission window of the queue. For example, when the Qbv
@@ -129,9 +130,10 @@ running.
Querying Device Capabilities
============================
-Every devices exports its offloads capabilities via netlink netdev family.
-Refer to ``xsk-flags`` features bitmask in
-``Documentation/netlink/specs/netdev.yaml``.
+Every device exports its offload capabilities via the Netlink netdev family.
+Query the ``xsk-features`` attribute in
+``Documentation/netlink/specs/netdev.yaml``. Its bits are defined by the
+``xsk-flags`` enum.
- ``tx-timestamp``: device supports ``XDP_TXMD_FLAGS_TIMESTAMP``
- ``tx-checksum``: device supports ``XDP_TXMD_FLAGS_CHECKSUM``
--
2.54.0
^ permalink raw reply related
* [PATCH net-next 2/3] docs: net: tls-offload: document tls_dev_del, tls_dev_resync, and rekey
From: Jakub Kicinski @ 2026-06-09 20:12 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, corbet, linux-doc,
bpf, Jakub Kicinski, john.fastabend, sd, skhan
In-Reply-To: <20260609201224.1191391-1-kuba@kernel.org>
Fill in some gaps in the TLS offload doc:
- describe the tls_dev_del and tls_dev_resync callbacks
- add a mention of rekeying being out of scope for now
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: john.fastabend@gmail.com
CC: sd@queasysnail.net
CC: corbet@lwn.net
CC: skhan@linuxfoundation.org
CC: linux-doc@vger.kernel.org
---
Documentation/networking/tls-offload.rst | 29 ++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
index c173f537bf4d..a41f46885e8c 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -104,6 +104,29 @@ at the end of kernel structures (see :c:member:`driver_state` members
in ``include/net/tls.h``) to avoid additional allocations and pointer
dereferences.
+When the offloaded connection is destroyed the core calls
+the :c:member:`tls_dev_del` callback so the driver can release per-direction
+state:
+
+.. code-block:: c
+
+ void (*tls_dev_del)(struct net_device *netdev,
+ struct tls_context *ctx,
+ enum tls_offload_ctx_dir direction);
+
+``tls_dev_del`` is mandatory whenever ``tls_dev_add`` is provided.
+
+The third TLS device callback is :c:member:`tls_dev_resync`, called by the core
+to synchronize the TCP stream with the record boundaries:
+
+.. code-block:: c
+
+ int (*tls_dev_resync)(struct net_device *netdev,
+ struct sock *sk, u32 seq, u8 *rcd_sn,
+ enum tls_offload_ctx_dir direction);
+
+See the `Resync handling`_ section for details.
+
TX
--
@@ -381,6 +404,12 @@ synchronization with an exponential back off (first after 2 encrypted
records, then after 4 records, after 8, after 16... up until every
128 records).
+Rekey
+=====
+
+Offload does not currently support TLS 1.3, therefore key rotation
+is not a concern for offloaded connections at this point.
+
Error handling
==============
--
2.54.0
^ permalink raw reply related
* [PATCH net-next 3/3] docs: net: fix minor issues with devlink docs
From: Jakub Kicinski @ 2026-06-09 20:12 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, corbet, linux-doc,
bpf, Jakub Kicinski, jiri, skhan
In-Reply-To: <20260609201224.1191391-1-kuba@kernel.org>
Update devlink documentation to match current code:
- describe health reporter defaults (it's currently under "callbacks"),
best-effort auto-dump, and port-scoped reporters
- fix generic parameter names and values
- fix nested devlink setup wording and registration ordering
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: jiri@resnulli.us
CC: corbet@lwn.net
CC: skhan@linuxfoundation.org
CC: linux-doc@vger.kernel.org
---
Documentation/networking/devlink/devlink-health.rst | 12 ++++++++----
Documentation/networking/devlink/devlink-params.rst | 2 +-
Documentation/networking/devlink/devlink-port.rst | 5 ++++-
Documentation/networking/devlink/devlink-trap.rst | 8 +++++---
Documentation/networking/devlink/index.rst | 10 +++++-----
5 files changed, 23 insertions(+), 14 deletions(-)
diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst
index 4d10536377ab..bedac58a2f36 100644
--- a/Documentation/networking/devlink/devlink-health.rst
+++ b/Documentation/networking/devlink/devlink-health.rst
@@ -33,7 +33,9 @@ asynchronously. All health reports handling is done by ``devlink``.
* Recovery procedures
* Diagnostics procedures
* Object dump procedures
- * Out Of Box initial parameters
+
+Drivers also provide default values for generic reporter parameters when
+creating a health reporter.
Different parts of the driver can register different types of health reporters
with different handlers.
@@ -45,8 +47,9 @@ Actions
* A log is being send to the kernel trace events buffer
* Health status and statistics are being updated for the reporter instance
- * Object dump is being taken and saved at the reporter instance (as long as
- auto-dump is set and there is no other dump which is already stored)
+ * Object dump is being taken and saved at the reporter instance. This is
+ best effort and skipped when recovery is aborted, auto-dump is disabled,
+ no dump callback is registered, or a dump is already stored.
* Auto recovery attempt is being done. Depends on:
- Auto-recovery configuration
@@ -75,7 +78,8 @@ User Interface
==============
User can access/change each reporter's parameters and driver specific callbacks
-via ``devlink``, e.g per error type (per health reporter):
+via ``devlink``, e.g. per error type (per health reporter). Reporters may be
+registered for the whole devlink instance or for a specific devlink port.
* Configure reporter's generic parameters (like: disable/enable auto recovery)
* Invoke recovery procedure
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
index ea17756dcda6..ca19ee3e63c8 100644
--- a/Documentation/networking/devlink/devlink-params.rst
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -122,7 +122,7 @@ own name.
* - ``enable_iwarp``
- Boolean
- Enable handling of iWARP traffic in the device.
- * - ``internal_err_reset``
+ * - ``internal_error_reset``
- Boolean
- When enabled, the device driver will reset the device on internal
errors.
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 5e397798a402..9374ebe70f48 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -38,7 +38,7 @@ Devlink port flavours are described below.
- This indicates an eswitch port representing a port of PCI
subfunction (SF).
* - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
- - This indicates a virtual port for the PCI virtual function.
+ - Any virtual port facing the user.
Devlink port can have a different type based on the link layer described below.
@@ -134,6 +134,9 @@ Users may also set the IPsec crypto capability of the function using
Users may also set the IPsec packet capability of the function using
`devlink port function set ipsec_packet` command.
+The ``migratable`` attribute may be set only on ports with
+``DEVLINK_PORT_FLAVOUR_PCI_VF``.
+
Users may also set the maximum IO event queues of the function
using `devlink port function set max_io_eqs` command.
diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst
index 5885e21e2212..ac5bf9337198 100644
--- a/Documentation/networking/devlink/devlink-trap.rst
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -516,9 +516,11 @@ Generic Packet Trap Groups
Generic packet trap groups are used to aggregate logically related packet
traps. These groups allow the user to batch operations such as setting the trap
-action of all member traps. In addition, ``devlink-trap`` can report aggregated
-per-group packets and bytes statistics, in case per-trap statistics are too
-narrow. The description of these groups must be added to the following table:
+action of all member drop traps whose action may legally change. Exception and
+control traps remain unchanged. In addition, ``devlink-trap`` can report
+aggregated per-group packets and bytes statistics, in case per-trap statistics
+are too narrow. The description of these groups must be added to the following
+table:
.. list-table:: List of Generic Packet Trap Groups
:widths: 10 90
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index f7ba7dcf477d..32f70879ddd0 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -13,8 +13,8 @@ new APIs prefixed by ``devl_*``. The older APIs handle all the locking
in devlink core, but don't allow registration of most sub-objects once
the main devlink object is itself registered. The newer ``devl_*`` APIs assume
the devlink instance lock is already held. Drivers can take the instance
-lock by calling ``devl_lock()``. It is also held all callbacks of devlink
-netlink commands.
+lock by calling ``devl_lock()``. It is also held across all callbacks of
+devlink netlink commands.
Drivers are encouraged to use the devlink instance lock for their own needs.
@@ -33,11 +33,11 @@ devlink instances created underneath. In that case, drivers should make
lock of both nested and parent instances at the same time, devlink
instance lock of the parent instance should be taken first, only then
instance lock of the nested instance could be taken.
- - Driver should use object-specific helpers to setup the
- nested relationship:
+ - Driver should use object-specific helpers to setup the nested relationship
+ before registering the nested devlink instance:
- ``devl_nested_devlink_set()`` - called to setup devlink -> nested
- devlink relationship (could be user for multiple nested instances.
+ devlink relationship (could be used for multiple nested instances).
- ``devl_port_fn_devlink_set()`` - called to setup port function ->
nested devlink relationship.
- ``devlink_linecard_nested_dl_set()`` - called to setup linecard ->
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v3 1/3] hwmon: ina238: add support for samples and update_interval
From: Guenter Roeck @ 2026-06-09 20:44 UTC (permalink / raw)
To: Ferdinand Schwenk
Cc: Jonathan Corbet, Shuah Khan, linux-hwmon, linux-kernel, linux-doc,
richard.leitner
In-Reply-To: <20260609-hwmon-ina238-update-interval-us-v2-v3-1-016b55567950@advastore.com>
On Tue, Jun 09, 2026 at 09:43:10PM +0200, Ferdinand Schwenk wrote:
> From: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
>
> Expose INA238 ADC averaging count (AVG) and conversion timing
> (VBUSCT/VSHCT/VTCT) through chip-level hwmon attributes:
>
> chip/samples
> chip/update_interval
>
> Use per-chip conversion-time lookup tables so the same helpers work
> for INA228/INA237/INA238/INA700/INA780 and SQ52206. Cache ADC_CONFIG
> in driver data and update it on writes to avoid extra register reads
> during read-modify-write updates.
>
> Report update_interval in milliseconds as required by the hwmon ABI.
> Compute it from raw ADC cycle time multiplied by the active averaging
> count, and apply the inverse mapping on writes so programmed conversion
> time tracks the selected sample count.
>
> Clamp user-provided update_interval before unit scaling to prevent
> overflow in arithmetic conversions.
>
> Also combine chip attributes in HWMON_CHANNEL_INFO using a bitwise OR
> for a single logical chip channel.
>
> Signed-off-by: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
Applied.
Thanks,
Guenter
^ permalink raw reply
* Re: [PATCH v3 2/3] hwmon: Add update_interval_us chip attribute
From: Guenter Roeck @ 2026-06-09 20:45 UTC (permalink / raw)
To: Ferdinand Schwenk
Cc: Jonathan Corbet, Shuah Khan, linux-hwmon, linux-kernel, linux-doc,
richard.leitner
In-Reply-To: <20260609-hwmon-ina238-update-interval-us-v2-v3-2-016b55567950@advastore.com>
On Tue, Jun 09, 2026 at 09:43:11PM +0200, Ferdinand Schwenk wrote:
> From: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
>
> Some hardware monitoring chips support update intervals below one
> millisecond. The existing update_interval attribute uses millisecond
> granularity, which causes sub-millisecond steps to round to the same
> value and become inaccessible from userspace.
>
> Introduce update_interval_us, a companion chip-level attribute that
> expresses the same update interval in microseconds. Drivers
> implementing this attribute should also implement update_interval for
> compatibility with millisecond-based userspace interfaces.
>
> Signed-off-by: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
Applied.
Thanks,
Guenter
^ permalink raw reply
* Re: [PATCH v3 0/6] alloc_tag: introduce IOCTL-based filtering for MAP
From: Abhishek Bapat @ 2026-06-09 20:47 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Andrew Morton, Kent Overstreet, Hao Ge, Shuah Khan,
Jonathan Corbet, linux-doc, linux-kernel, linux-mm, Sourav Panda
In-Reply-To: <CAJuCfpGr+K+fWK0zxVdMqGwKnr9Y26=Hn0HYFpnp-sUBcL8aKw@mail.gmail.com>
On Mon, Jun 8, 2026 at 5:29 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Jun 8, 2026 at 5:02 PM Abhishek Bapat <abhishekbapat@google.com> wrote:
> >
> > On Fri, Jun 5, 2026 at 5:09 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Fri, 5 Jun 2026 23:36:45 +0000 Abhishek Bapat <abhishekbapat@google.com> wrote:
> > >
> > > > Currently, memory allocation profiling data is primarily exposed through
> > > > /proc/allocinfo. While useful for manual inspection, this text-based
> > > > interface poses challenges for production monitoring and large-scale
> > > > analysis:
> > > >
> > > > 1. Userspace must parse large amounts of text to extract specific
> > > > fields.
> > > > 2. To find specific tags, userspace must read the entire dataset,
> > > > requiring many context switches and high data copying.
> > > > 3. The kernel currently aggregates per-CPU counters for every allocation
> > > > size, even those the user intends to filter out immediately.
> > > >
> > > > This series introduces a new IOCTL-based binary interface for allocinfo
> > > > that supports kernel-side filtering. By allowing the user to specify a
> > > > filter mask, we significantly reduce the work performed in-kernel and
> > > > the amount of data transferred to userspace.
> > >
> > > Thanks. AI review found several things - you'll want to address at
> > > least the first few.
> > >
> > > https://sashiko.dev/#/patchset/cover.1780701922.git.abhishekbapat@google.com
> >
> > All, please note I missed attaching the reason for choosing the IOCTL
> > mechanism to this cover letter, but I will attach it to the v4
> > patchset cover letter along with other changes. Thanks!
>
> Can you please add it here now so that we can review that?
I intend to add this to the end of the cover-letter passages, right
before the version change descriptions:
The ioctl() mechanism was chosen for allocinfo to address the per-CPU
counter aggregation performance bottleneck. A traditional read()
operation must report the total allocation count and sizes for every
code tag in the system. Doing so requires iterating across all CPUs to
sum their per-CPU counters for thousands of tags, which introduces
substantial runtime overhead.
The ioctl() interface allows userspace to push selective filtering
criteria directly into the kernel before the per-CPU counter
aggregation. The kernel aggregates per-CPU counters only for a small
subset of tags that match the filter. This results in significant
performance improvement.
Beyond fast filtered retrieval, the ioctl() foundation allows
introducing a context capture mechanism in the future to capture the
context for specific allocations.
^ permalink raw reply
* Re: [PATCH v3 3/3] hwmon: ina238: add update_interval_us attribute
From: Guenter Roeck @ 2026-06-09 20:49 UTC (permalink / raw)
To: Ferdinand Schwenk
Cc: Jonathan Corbet, Shuah Khan, linux-hwmon, linux-kernel, linux-doc,
richard.leitner
In-Reply-To: <20260609-hwmon-ina238-update-interval-us-v2-v3-3-016b55567950@advastore.com>
On Tue, Jun 09, 2026 at 09:43:12PM +0200, Ferdinand Schwenk wrote:
> From: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
>
> The INA238 family supports eight conversion time steps from 50 us to
> 4120 us (SQ52206: 66 us to 8230 us). At the millisecond granularity of
> update_interval, the four shortest steps (50, 84, 150, 280 us) all
> round to the same value and cannot be individually selected.
>
> Add support for the generic update_interval_us attribute, which reports
> and programs the same ADC cycle time as update_interval but in
> microseconds, giving userspace full access to all conversion time steps.
>
> Both attributes reflect the total cycle time including the active
> averaging count: the reported value is the raw conversion time
> multiplied by the number of averaged samples, and writes apply the
> inverse mapping.
>
> Signed-off-by: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
Applied.
Thanks,
Guenter
^ permalink raw reply
* Re: [PATCH v3 1/6] alloc_tag: add ioctl to /proc/allocinfo
From: Abhishek Bapat @ 2026-06-09 20:51 UTC (permalink / raw)
To: Hao Ge
Cc: Suren Baghdasaryan, Andrew Morton, Kent Overstreet, Shuah Khan,
Jonathan Corbet, linux-doc, linux-kernel, linux-mm, Sourav Panda
In-Reply-To: <2b7fa423-c721-4157-96ba-93bee5010044@linux.dev>
On Mon, Jun 8, 2026 at 6:44 PM Hao Ge <hao.ge@linux.dev> wrote:
>
> Hi Abhishek
>
>
> On 2026/6/9 08:19, Abhishek Bapat wrote:
> > On Sun, Jun 7, 2026 at 6:53 PM Hao Ge <hao.ge@linux.dev> wrote:
> >> Hi Suren and Abhishek
> >>
> >>
> >> Thanks for the new version.
> >>
> >>
> >> On 2026/6/6 07:36, Abhishek Bapat wrote:
> >>> From: Suren Baghdasaryan <surenb@google.com>
> >>>
> >>> Add the following ioctl commands for /proc/allocinfo file:
> >>>
> >>> ALLOCINFO_IOC_CONTENT_ID - gets content identifier which can be used
> >>> to check whether the file content has changed specifically due to module
> >>> load/unload. Every time a module is loaded / unloaded, the returned
> >>> value will be different. By comparing the identifier value at the
> >>> beginning and at the end of the content retrieval operation, users can
> >>> validate retrieved information for consistency.
> >>>
> >>> ALLOCINFO_IOC_GET_AT - gets the record at the specified position. This
> >>> is the position of a record in /proc/allocinfo.
> >>>
> >>> ALLOCINFO_IOC_GET_NEXT - gets the record next to the last retrieved
> >>> one. If no records were previously retrieved, returns the first
> >>> record.
> >>>
> >>> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >>> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> >>> ---
> >>> Documentation/mm/allocation-profiling.rst | 5 +
> >>> .../userspace-api/ioctl/ioctl-number.rst | 2 +
> >>> MAINTAINERS | 1 +
> >>> include/linux/codetag.h | 2 +
> >>> include/uapi/linux/alloc_tag.h | 54 ++++
> >>> lib/alloc_tag.c | 232 +++++++++++++++++-
> >>> lib/codetag.c | 18 ++
> >>> 7 files changed, 312 insertions(+), 2 deletions(-)
> >>> create mode 100644 include/uapi/linux/alloc_tag.h
> >>>
> >>> diff --git a/Documentation/mm/allocation-profiling.rst b/Documentation/mm/allocation-profiling.rst
> >>> index 5389d241176a..c3a28467955f 100644
> >>> --- a/Documentation/mm/allocation-profiling.rst
> >>> +++ b/Documentation/mm/allocation-profiling.rst
> >>> @@ -46,6 +46,11 @@ sysctl:
> >>> Runtime info:
> >>> /proc/allocinfo
> >>>
> >>> + Profiling data can be retrieved either by reading `/proc/allocinfo` directly as
> >>> + text or programmatically via `ioctl()` calls defined in `<uapi/linux/alloc_tag.h>`.
> >>> + The ioctl interface supports structured binary data extraction as well as filtering
> >>> + by module name, function, file, line number, accuracy, or allocation size limits.
> >>> +
> >>> Example output::
> >>>
> >>> root@moria-kvm:~# sort -g /proc/allocinfo|tail|numfmt --to=iec
> >>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> index 331223761fff..84f6808a8578 100644
> >>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> @@ -349,6 +349,8 @@ Code Seq# Include File Comments
> >>> <mailto:luzmaximilian@gmail.com>
> >>> 0xA5 20-2F linux/surface_aggregator/dtx.h Microsoft Surface DTX driver
> >>> <mailto:luzmaximilian@gmail.com>
> >>> +0xA6 00-0F uapi/linux/alloc_tag.h Memory allocation profiling
> >>> + <mailto:surenb@google.com>
> >>> 0xAA 00-3F linux/uapi/linux/userfaultfd.h
> >>> 0xAB 00-1F linux/nbd.h
> >>> 0xAC 00-1F linux/raw.h
> >>> diff --git a/MAINTAINERS b/MAINTAINERS
> >>> index a31f6f207afd..77f3fc487691 100644
> >>> --- a/MAINTAINERS
> >>> +++ b/MAINTAINERS
> >>> @@ -16711,6 +16711,7 @@ S: Maintained
> >>> F: Documentation/mm/allocation-profiling.rst
> >>> F: include/linux/alloc_tag.h
> >>> F: include/linux/pgalloc_tag.h
> >>> +F: include/uapi/linux/alloc_tag.h
> >>> F: lib/alloc_tag.c
> >>>
> >>> MEMORY CONTROLLER DRIVERS
> >>> diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> >>> index ddae7484ca45..a25a085c2df1 100644
> >>> --- a/include/linux/codetag.h
> >>> +++ b/include/linux/codetag.h
> >>> @@ -77,6 +77,8 @@ struct codetag_iterator {
> >>> void codetag_lock_module_list(struct codetag_type *cttype);
> >>> bool codetag_trylock_module_list(struct codetag_type *cttype);
> >>> void codetag_unlock_module_list(struct codetag_type *cttype);
> >>> +unsigned long codetag_get_content_id(struct codetag_type *cttype);
> >>> +unsigned int codetag_get_count(struct codetag_type *cttype);
> >>> struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
> >>> struct codetag *codetag_next_ct(struct codetag_iterator *iter);
> >>>
> >>> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> >>> new file mode 100644
> >>> index 000000000000..901199bad514
> >>> --- /dev/null
> >>> +++ b/include/uapi/linux/alloc_tag.h
> >>> @@ -0,0 +1,54 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> >>> +/*
> >>> + * include/linux/alloc_tag.h
> >> nit: it should be include/uapi/linux/alloc_tag.h
> >>
> >> (I guess you may have missed the comment I brought up before. It is not
> >> a critical problem though.)
> >>
> > Apologies, I missed that comment earlier. Included in the v4 patchset.
> > Thanks for bringing this up.
> >
> >>> + */
> >>> +
> >>> +#ifndef _UAPI_ALLOC_TAG_H
> >>> +#define _UAPI_ALLOC_TAG_H
> >>> +
> >>> +#include <linux/types.h>
> >>> +
> >>> +#define ALLOCINFO_STR_SIZE 64
> >>> +
> >>> +struct allocinfo_content_id {
> >>> + __u64 id;
> >>> +};
> >>> +
> >>> +struct allocinfo_tag {
> >>> + /* Longer names are trimmed */
> >>> + char modname[ALLOCINFO_STR_SIZE];
> >>> + char function[ALLOCINFO_STR_SIZE];
> >>> + char filename[ALLOCINFO_STR_SIZE];
> >>> + __u64 lineno;
> >>> +};
> >>> +
> >>> +/* The alignment ensures 32-bit compatible interfaces are not broken */
> >>> +struct allocinfo_counter {
> >>> + __u64 bytes;
> >>> + __u64 calls;
> >>> + __u8 accurate;
> >>> +} __attribute__((aligned(8)));
> >>> +
> >>> +struct allocinfo_tag_data {
> >>> + struct allocinfo_tag tag;
> >>> + struct allocinfo_counter counter;
> >>> +};
> >>> +
> >>> +struct allocinfo_get_at {
> >>> + __u64 pos; /* input */
> >>> + struct allocinfo_tag_data data;
> >>> +};
> >>> +
> >>> +#define _ALLOCINFO_IOC_CONTENT_ID 0
> >>> +#define _ALLOCINFO_IOC_GET_AT 1
> >>> +#define _ALLOCINFO_IOC_GET_NEXT 2
> >>> +
> >>> +#define ALLOCINFO_IOC_BASE 0xA6
> >>> +#define ALLOCINFO_IOC_CONTENT_ID _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_CONTENT_ID, \
> >>> + struct allocinfo_content_id)
> >>> +#define ALLOCINFO_IOC_GET_AT _IOWR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_AT, \
> >>> + struct allocinfo_get_at)
> >>> +#define ALLOCINFO_IOC_GET_NEXT _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_NEXT, \
> >>> + struct allocinfo_tag_data)
> >>> +
> >>> +#endif /* _UAPI_ALLOC_TAG_H */
> >>> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> >>> index d9be1cf5187d..a0577215eb3d 100644
> >>> --- a/lib/alloc_tag.c
> >>> +++ b/lib/alloc_tag.c
> >>> @@ -5,6 +5,7 @@
> >>> #include <linux/gfp.h>
> >>> #include <linux/kallsyms.h>
> >>> #include <linux/module.h>
> >>> +#include <linux/mutex.h>
> >>> #include <linux/page_ext.h>
> >>> #include <linux/pgalloc_tag.h>
> >>> #include <linux/proc_fs.h>
> >>> @@ -14,6 +15,7 @@
> >>> #include <linux/string_choices.h>
> >>> #include <linux/vmalloc.h>
> >>> #include <linux/kmemleak.h>
> >>> +#include <uapi/linux/alloc_tag.h>
> >>>
> >>> #define ALLOCINFO_FILE_NAME "allocinfo"
> >>> #define MODULE_ALLOC_TAG_VMAP_SIZE (100000UL * sizeof(struct alloc_tag))
> >>> @@ -47,6 +49,10 @@ struct allocinfo_private {
> >>> struct codetag_iterator iter;
> >>> struct codetag_iterator reported_iter;
> >>> bool print_header;
> >>> + /* ioctl uses a separate iterator not to interfere with reads */
> >>> + struct codetag_iterator ioctl_iter;
> >>> + bool positioned; /* seq_open_private() sets to 0 */
> >>> + struct mutex ioctl_lock;
> >>> };
> >>>
> >>> static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> >>> @@ -130,6 +136,229 @@ static const struct seq_operations allocinfo_seq_op = {
> >>> .show = allocinfo_show,
> >>> };
> >>>
> >>> +/*
> >>> + * Initializes seq_file operations and allocates private state when opening
> >>> + * the /proc/allocinfo procfs entry.
> >>> + */
> >>> +static int allocinfo_open(struct inode *inode, struct file *file)
> >>> +{
> >>> + int ret;
> >>> +
> >>> + ret = seq_open_private(file, &allocinfo_seq_op,
> >>> + sizeof(struct allocinfo_private));
> >>> + if (!ret) {
> >>> + struct seq_file *m = file->private_data;
> >>> + struct allocinfo_private *priv = m->private;
> >>> +
> >>> + mutex_init(&priv->ioctl_lock);
> >>> + }
> >>> + return ret;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Cleans up the seq_file state and frees up the private state allocated in
> >>> + * allocinfo_open() when closing the /proc/allocinfo file descriptor.
> >>> + */
> >>> +static int allocinfo_release(struct inode *inode, struct file *file)
> >>> +{
> >>> + return seq_release_private(inode, file);
> >>> +}
> >>> +
> >>> +/*
> >>> + * Returns a pointer to the suffix of a string so that its length fits within
> >>> + * ALLOCINFO_STR_SIZE, preserving the trailing characters.
> >>> + */
> >>> +static const char *allocinfo_str(const char *str)
> >>> +{
> >>> + size_t len = strlen(str);
> >>> +
> >>> + /* Keep an extra space for the trailing NULL. */
> >>> + if (len >= ALLOCINFO_STR_SIZE)
> >>> + str += (len - ALLOCINFO_STR_SIZE) + 1;
> >>> + return str;
> >>> +}
> >>> +
> >>> +/* Copy a string and trim from the beginning if it's too long */
> >>> +static void allocinfo_copy_str(char *dest, const char *src)
> >>> +{
> >>> + strscpy_pad(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE);
> >>> +}
> >>> +
> >>> +/*
> >>> + * Populates the UAPI allocinfo_tag_data structure with active runtime
> >>> + * profiling counters extracted from the given kernel codetag.
> >>> + */
> >>> +static void allocinfo_to_params(struct codetag *ct,
> >>> + struct allocinfo_tag_data *data)
> >>> +{
> >>> + struct alloc_tag *tag = ct_to_alloc_tag(ct);
> >>> + struct alloc_tag_counters counter = alloc_tag_read(tag);
> >>> +
> >>> + if (ct->modname)
> >>> + allocinfo_copy_str(data->tag.modname, ct->modname);
> >>> + else
> >>> + data->tag.modname[0] = '\0';
> >>> + allocinfo_copy_str(data->tag.function, ct->function);
> >>> + allocinfo_copy_str(data->tag.filename, ct->filename);
> >>> + data->tag.lineno = ct->lineno;
> >>> + data->counter.bytes = counter.bytes;
> >>> + data->counter.calls = counter.calls;
> >>> + data->counter.accurate = !alloc_tag_is_inaccurate(tag);
> >>> +}
> >>> +
> >>> +/*
> >>> + * Retrieves the unique content ID representing the current allocation tag module
> >>> + * layout, allowing userspace to detect if modules were loaded / unloaded.
> >>> + */
> >>> +static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
> >>> +{
> >>> + struct allocinfo_content_id params;
> >>> +
> >>> + codetag_lock_module_list(alloc_tag_cttype);
> >>> + params.id = codetag_get_content_id(alloc_tag_cttype);
> >>> + codetag_unlock_module_list(alloc_tag_cttype);
> >>> + if (copy_to_user(arg, ¶ms, sizeof(params)))
> >>> + return -EFAULT;
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Seeks the ioctl iterator to the specified 0-indexed tag position, reads its
> >>> + * profiling data and returns it to userspace.
> >>> + */
> >>> +static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> >>> +{
> >>> + struct allocinfo_private *priv;
> >>> + struct codetag *ct;
> >>> + __u64 pos;
> >>> + struct allocinfo_get_at params = {0};
> >>> +
> >>> + if (copy_from_user(¶ms, arg, sizeof(params)))
> >>> + return -EFAULT;
> >>> +
> >>> + priv = m->private;
> >>> + pos = params.pos;
> >>> +
> >>> + mutex_lock(&priv->ioctl_lock);
> >>> + codetag_lock_module_list(alloc_tag_cttype);
> >>> +
> >>> + if (pos >= codetag_get_count(alloc_tag_cttype)) {
> >>> + codetag_unlock_module_list(alloc_tag_cttype);
> >>> + mutex_unlock(&priv->ioctl_lock);
> >>> + return -ENOENT;
> >>> + }
> >>> +
> >>> + /* Find the codetag */
> >>> + priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> >>> + ct = codetag_next_ct(&priv->ioctl_iter);
> >>> + while (ct && pos--)
> >>> + ct = codetag_next_ct(&priv->ioctl_iter);
> >>> + if (ct) {
> >>> + allocinfo_to_params(ct, ¶ms.data);
> >>> + priv->positioned = true;
> >>> + }
> >>> +
> >>> + codetag_unlock_module_list(alloc_tag_cttype);
> >>> + mutex_unlock(&priv->ioctl_lock);
> >>> +
> >>> + if (!ct)
> >>> + return -ENOENT;
> >>> +
> >>> + if (copy_to_user(arg, ¶ms, sizeof(params)))
> >>> + return -EFAULT;
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Advances the ioctl iterator to the next allocation tag in the sequence and
> >>> + * returns its profiling data to userspace.
> >>> + */
> >>> +static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
> >>> +{
> >>> + struct allocinfo_private *priv;
> >>> + struct codetag *ct;
> >>> + struct allocinfo_tag_data params;
> >>> + int ret = 0;
> >>> +
> >>> + memset(¶ms, 0, sizeof(params));
> >>> + priv = m->private;
> >>> +
> >>> + mutex_lock(&priv->ioctl_lock);
> >>> + codetag_lock_module_list(alloc_tag_cttype);
> >>> +
> >>> + if (!priv->positioned) {
> >>> + priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> >>> + priv->positioned = true;
> >>> + }
> >>> +
> >>> + ct = codetag_next_ct(&priv->ioctl_iter);
> >>> + if (ct)
> >>> + allocinfo_to_params(ct, ¶ms);
> >>> +
> >>> + if (!ct) {
> >>> + priv->positioned = false;
> >>> + ret = -ENOENT;
> >>> + }
> >>> + codetag_unlock_module_list(alloc_tag_cttype);
> >>> + mutex_unlock(&priv->ioctl_lock);
> >>> +
> >>> + if (ret == 0) {
> >>> + if (copy_to_user(arg, ¶ms, sizeof(params)))
> >>> + return -EFAULT;
> >>> + }
> >>> + return ret;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Entry point ioctl function for /proc/allocinfo routing requests to fetch the
> >>> + * layout content ID, seek to a specific tag, or read sequential tags.
> >>> + */
> >>> +static long allocinfo_ioctl(struct file *file, unsigned int cmd,
> >>> + unsigned long __arg)
> >>> +{
> >>> + void __user *arg = (void __user *)__arg;
> >>> + int ret;
> >>> +
> >>> + switch (cmd) {
> >>> + case ALLOCINFO_IOC_CONTENT_ID:
> >>> + ret = allocinfo_ioctl_get_content_id(file->private_data, arg);
> >>> + break;
> >>> + case ALLOCINFO_IOC_GET_AT:
> >>> + ret = allocinfo_ioctl_get_at(file->private_data, arg);
> >>> + break;
> >>> + case ALLOCINFO_IOC_GET_NEXT:
> >>> + ret = allocinfo_ioctl_get_next(file->private_data, arg);
> >>> + break;
> >>> + default:
> >>> + ret = -ENOIOCTLCMD;
> >>> + break;
> >>> + }
> >>> +
> >>> + return ret;
> >>> +}
> >>> +
> >>> +#ifdef CONFIG_COMPAT
> >>> +static long allocinfo_compat_ioctl(struct file *file, unsigned int cmd,
> >>> + unsigned long arg)
> >>> +{
> >>> + return allocinfo_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
> >>> +}
> >>> +#endif
> >>> +
> >>> +static const struct proc_ops allocinfo_proc_ops = {
> >>> + .proc_open = allocinfo_open,
> >>> + .proc_read_iter = seq_read_iter,
> >>> + .proc_lseek = seq_lseek,
> >>> + .proc_release = allocinfo_release,
> >>> + .proc_ioctl = allocinfo_ioctl,
> >>> +#ifdef CONFIG_COMPAT
> >>> + .proc_compat_ioctl = allocinfo_compat_ioctl,
> >>> +#endif
> >>> +
> >>> +};
> >>> +
> >>> size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sleep)
> >>> {
> >>> struct codetag_iterator iter;
> >>> @@ -993,8 +1222,7 @@ static int __init alloc_tag_init(void)
> >>> return 0;
> >>> }
> >>>
> >>> - if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
> >>> - sizeof(struct allocinfo_private), NULL)) {
> >>> + if (!proc_create(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_proc_ops)) {
> >>> pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
> >>> shutdown_mem_profiling(false);
> >>> return -ENOMEM;
> >>> diff --git a/lib/codetag.c b/lib/codetag.c
> >>> index 4001a7ea6675..a9cda4c962a3 100644
> >>> --- a/lib/codetag.c
> >>> +++ b/lib/codetag.c
> >>> @@ -19,6 +19,8 @@ struct codetag_type {
> >>> struct codetag_type_desc desc;
> >>> /* generates unique sequence number for module load */
> >>> unsigned long next_mod_seq;
> >>> + /* bumped on every module load and unload */
> >>> + unsigned long content_id;
> >>> };
> >>>
> >>> struct codetag_range {
> >>> @@ -50,6 +52,20 @@ void codetag_unlock_module_list(struct codetag_type *cttype)
> >>> up_read(&cttype->mod_lock);
> >>> }
> >>>
> >>> +unsigned long codetag_get_content_id(struct codetag_type *cttype)
> >>> +{
> >>> + lockdep_assert_held(&cttype->mod_lock);
> >>> +
> >>> + return cttype->content_id;
> >>> +}
> >>> +
> >>> +unsigned int codetag_get_count(struct codetag_type *cttype)
> >>> +{
> >>> + lockdep_assert_held(&cttype->mod_lock);
> >>> +
> >>> + return cttype->count;
> >>> +}
> >>> +
> >>> struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
> >>> {
> >>> struct codetag_iterator iter = {
> >>> @@ -204,6 +220,7 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
> >>>
> >>> down_write(&cttype->mod_lock);
> >>> cmod->mod_seq = ++cttype->next_mod_seq;
> >>> + ++cttype->content_id;
> >> I have a comment on the content_id bump placement.
> >>
> >> ++cttype->content_id is placed before idr_alloc and the module_load
> >>
> >> callback. If idr_alloc fails or module_load returns an error
> >>
> >> (While the chance of this occurring is very low.), the idr entry gets
> >>
> >> rolled back but content_id has already been bumped. The actual
> >>
> >> content didn't change in this case, so userspace would see a
> >>
> >> different content_id and assume the data is inconsistent when it
> >>
> >> isn't.
> >>
> >>
> >> Thanks
> >>
> >> Best Regards
> >>
> >> Hao
> > While I agree with your comment, I decided to place the counter
> > increment there because the chance of failure is low. Furthermore,
> > even if it falsely invalidates user data, the user will simply query
> > the content again. This placement also aligns with where the
> > previously used field (cttype->next_mod_seq) was incremented. Let me
> > know if you still think I should move it. Thanks!
>
> Sorry, I should have marked this as a nit when I raised the comment.
>
> Given its low probability of occurring, it doesn't block anything for now.
>
> The reason I raised this comment was just in case someone adds new logic
>
> in the feature that could fail. But if that happens, we can move both
> next_mod_seq
>
> and content_id down together.
>
>
> Thanks
>
> Best Regards
>
> Hao
>
Sounds good, so for now I am not touching this patch and keeping it as
is other than the file path nit inside the UAPI header file.
> >>> mod_id = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
> >>> if (mod_id >= 0) {
> >>> if (cttype->desc.module_load) {
> >>> @@ -368,6 +385,7 @@ void codetag_unload_module(struct module *mod)
> >>> cttype->count -= range_size(cttype, &cmod->range);
> >>> idr_remove(&cttype->mod_idr, mod_id);
> >>> kfree(cmod);
> >>> + ++cttype->content_id;
> >>> }
> >>> up_write(&cttype->mod_lock);
> >>> if (found && cttype->desc.free_section_mem)
^ permalink raw reply
* Re: [PATCH v3 4/6] alloc_tag: add accuracy based filtering to ioctl
From: Abhishek Bapat @ 2026-06-09 20:53 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Hao Ge, Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel,
linux-mm, Sourav Panda, Andrew Morton, Kent Overstreet
In-Reply-To: <CAJuCfpGbxO0zu_UAWCYNNv8RHgT=E4AF0tgG-kWoLXECOL3byA@mail.gmail.com>
On Tue, Jun 9, 2026 at 7:41 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Jun 8, 2026 at 6:26 PM Hao Ge <hao.ge@linux.dev> wrote:
> >
> > Hi Suren
> >
> >
> > On 2026/6/9 04:55, Suren Baghdasaryan wrote:
> > > On Mon, Jun 8, 2026 at 1:25 AM Hao Ge <hao.ge@linux.dev> wrote:
> > >>
> > >> On 2026/6/8 14:22, Hao Ge wrote:
> > >>> Hi Abhishek
> > >>>
> > >>>
> > >>> On 2026/6/6 07:36, Abhishek Bapat wrote:
> > >>>> Extend the allocinfo filtering mechanism to allow users to filter tags
> > >>>> based on their accuracy.
> > >>>>
> > >>>> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> > >>>> ---
> > >>>> include/uapi/linux/alloc_tag.h | 3 +++
> > >>>> lib/alloc_tag.c | 8 ++++++++
> > >>>> 2 files changed, 11 insertions(+)
> > >>>>
> > >>>> diff --git a/include/uapi/linux/alloc_tag.h
> > >>>> b/include/uapi/linux/alloc_tag.h
> > >>>> index 0e648192df4d..42445bdb11c5 100644
> > >>>> --- a/include/uapi/linux/alloc_tag.h
> > >>>> +++ b/include/uapi/linux/alloc_tag.h
> > >>>> @@ -20,6 +20,7 @@ struct allocinfo_tag {
> > >>>> char function[ALLOCINFO_STR_SIZE];
> > >>>> char filename[ALLOCINFO_STR_SIZE];
> > >>>> __u64 lineno;
> > >>>> + __u64 inaccurate;
> > >>>
> > >>> I was wondering if it would make sense to define inaccurate as a flags
> > >>> field
> > >>>
> > >>> (e.g. __u64 flags with ALLOCINFO_TAG_F_INACCURATE (1 <<0)),
> > >>>
> > >>> so that only bit 0 is used today and the upper bits are reserved for
> > >>> future use,
> > >>>
> > >>> aligning with current kernel codebase.
> > >>>
> > >>> This design also allows for better extensibility if we need to
> > >>>
> > >>> add new flags for any reason in the future.
> > >>>
> > >>> We also need to add flag validity checks if we go this route.
> > >>>
> > >> And I've reviewed the issue reported by Sashiko, and I think it's valid.
> > >>
> > >> When we expand the allocinfo_tag_data structure
> > >>
> > >> struct allocinfo_tag_data{
> > >>
> > >> char modname[64];
> > >>
> > >> char function[64];
> > >>
> > >> char filename[64];
> > >>
> > >> __u64 lineno;
> > >>
> > >> __u64 inaccurate;
> > >>
> > >> __u64 bytes;
> > >>
> > >> __u64 calls;
> > >>
> > >> __u8 accurate;
> > >> /* padding */
> > >>
> > >> }
> > >>
> > >> I think user space may see two fields related to inaccuracy.
> > > Yes but one field (inside allocinfo_tag) is the input parameter which
> > > user provides to specify the filtering criteria and the other is the
> > > returned tag information. It's similar to any other tag attribute
> > > which you can be included in the filters.
> > >
> > >> How do you like these modifications?
> > >>
> > >>
> > >> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> > >> --- a/include/uapi/linux/alloc_tag.h
> > >> +++ b/include/uapi/linux/alloc_tag.h
> > >> @@ -20,7 +20,6 @@ struct allocinfo_tag {
> > >> char function[ALLOCINFO_STR_SIZE];
> > >> char filename[ALLOCINFO_STR_SIZE];
> > >> __u64 lineno;
> > >> - __u64 inaccurate;
> > >> };
> > >>
> > >> /* The alignment ensures 32-bit compatible interfaces are not broken */
> > >> @@ -40,7 +39,7 @@ enum {
> > >> ALLOCINFO_FILTER_FUNCTION,
> > >> ALLOCINFO_FILTER_FILENAME,
> > >> ALLOCINFO_FILTER_LINENO,
> > >> - ALLOCINFO_FILTER_INACCURATE,
> > >> + ALLOCINFO_FILTER_FLAGS,
> > >> ALLOCINFO_FILTER_MIN_SIZE,
> > >> ALLOCINFO_FILTER_MAX_SIZE,
> > >> __ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_MAX_SIZE
> > >> @@ -50,16 +49,20 @@ enum {
> > >> #define ALLOCINFO_FILTER_MASK_FUNCTION (1 <<
> > >> ALLOCINFO_FILTER_FUNCTION)
> > >> #define ALLOCINFO_FILTER_MASK_FILENAME (1 <<
> > >> ALLOCINFO_FILTER_FILENAME)
> > >> #define ALLOCINFO_FILTER_MASK_LINENO (1 << ALLOCINFO_FILTER_LINENO)
> > >> -#define ALLOCINFO_FILTER_MASK_INACCURATE (1 <<
> > >> ALLOCINFO_FILTER_INACCURATE)
> > >> +#define ALLOCINFO_FILTER_MASK_FLAGS (1 << ALLOCINFO_FILTER_FLAGS)
> > >> #define ALLOCINFO_FILTER_MASK_MIN_SIZE (1 <<
> > >> ALLOCINFO_FILTER_MIN_SIZE)
> > >> #define ALLOCINFO_FILTER_MASK_MAX_SIZE (1 <<
> > >> ALLOCINFO_FILTER_MAX_SIZE)
> > >>
> > >> #define ALLOCINFO_FILTER_MASKS \
> > >> ((1 << (__ALLOCINFO_FILTER_LAST + 1)) - 1)
> > >>
> > >> +#define ALLOCINFO_FILTER_F_INACCURATE (1ULL << 0)
> > >> +#define ALLOCINFO_FILTER_FLAGS_ALL ALLOCINFO_FILTER_F_INACCURATE
> > >> +
> > >> struct allocinfo_filter {
> > >> __u64 mask; /* bitmask of the filter fields used */
> > >> struct allocinfo_tag fields;
> > >> + __u64 flags; /* bitmask of ALLOCINFO_FILTER_F_* */
> > >> __u64 min_size;
> > >> __u64 max_size;
> > >> };
> > >> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > >> --- a/lib/alloc_tag.c
> > >> +++ b/lib/alloc_tag.c
> > >> @@ -249,8 +249,6 @@ static bool matches_filter(struct codetag *ct,
> > >> struct allocinfo_filter *filter,
> > >> struct alloc_tag_counters *counters,
> > >> bool *fetched_counters)
> > >> {
> > >> - bool inaccurate;
> > >> -
> > >> if (!filter || !filter->mask)
> > >> return true;
> > >>
> > >> @@ -277,10 +275,11 @@ static bool matches_filter(struct codetag *ct,
> > >> struct allocinfo_filter *filter,
> > >> ct->lineno != filter->fields.lineno)
> > >> return false;
> > >>
> > >> - if (filter->mask & ALLOCINFO_FILTER_MASK_INACCURATE) {
> > >> - inaccurate = !!(ct->flags & CODETAG_FLAG_INACCURATE);
> > >> - if (inaccurate != !!(filter->fields.inaccurate))
> > >> - return false;
> > >> + if (filter->mask & ALLOCINFO_FILTER_MASK_FLAGS) {
> > >> + if (filter->flags & ALLOCINFO_FILTER_F_INACCURATE) {
> > >> + if (!(ct->flags & CODETAG_FLAG_INACCURATE))
> > > How would you filter records which have only accurate data?
> >
> >
> > Sorry, I overlooked this case.
> >
> > Since allocinfo_tag_data exposes both inaccurate (from allocinfo_tag) and
> >
> > accurate (from allocinfo_counter), userspace developers might mistakenly
> > read
> >
> > inaccurate instead of accurate when checking accuracy.
> >
> > How about we add a comment to clarify?
> >
> > struct allocinfo_tag {
> >
> > /* ... */
> >
> > __u64 lineno;
> >
> > /* filter criteria only; see allocinfo_counter.accurate for actual
> > accuracy */
> >
> > __u64 inaccurate;
>
> I think we had comments showing which block of parameters are inputs
> and which ones are outputs but I'm not opposed to an additional
> reminder here.
>
Ack, I'll include the recommended comment.
> >
> > };
> >
> >
> > LGTM for the rest.
> >
> >
> > Thanks
> >
> > Best Regards
> >
> > Hao
> >
> > > Overall I would prefer ALLOCINFO_FILTER_MASK_INACCURATE rather than
> > > ALLOCINFO_FILTER_MASK_FLAGS. The fact that this attribute is a
> > > single-bit flag is a technical detail. It's still a tag attribuite
> > > like file and module names and IMO deserves its own filter.
> > >
> > >
> > >
> > >> + return false;
> > >> + }
> > >> }
> > >>
> > >> if (filter->mask & (ALLOCINFO_FILTER_MASK_MIN_SIZE |
> > >> ALLOCINFO_FILTER_MASK_MAX_SIZE)) {
> > >> @@ -318,6 +317,10 @@ static int allocinfo_ioctl_get_at(struct seq_file
> > >> *m, void __user *arg)
> > >> if (params.filter.mask & ~ALLOCINFO_FILTER_MASKS)
> > >> return -EINVAL;
> > >>
> > >> + if ((params.filter.mask & ALLOCINFO_FILTER_MASK_FLAGS) &&
> > >> + (params.filter.flags & ~ALLOCINFO_FILTER_FLAGS_ALL))
> > >> + return -EINVAL;
> > >> +
> > >> if ((params.filter.mask & ALLOCINFO_FILTER_MASK_MIN_SIZE) &&
> > >> (params.filter.mask & ALLOCINFO_FILTER_MASK_MAX_SIZE) &&
> > >> params.filter.min_size > params.filter.max_size)
> > >>
> > >>
> > >> Thanks
> > >>
> > >> Best Regards
> > >>
> > >> Hao
> > >>
> > >>
> > >>> Thanks
> > >>>
> > >>> Best Regards
> > >>>
> > >>> Hao
> > >>>
> > >>>
> > >>>> };
> > >>>> /* The alignment ensures 32-bit compatible interfaces are not
> > >>>> broken */
> > >>>> @@ -39,6 +40,7 @@ enum {
> > >>>> ALLOCINFO_FILTER_FUNCTION,
> > >>>> ALLOCINFO_FILTER_FILENAME,
> > >>>> ALLOCINFO_FILTER_LINENO,
> > >>>> + ALLOCINFO_FILTER_INACCURATE,
> > >>>> ALLOCINFO_FILTER_MIN_SIZE,
> > >>>> ALLOCINFO_FILTER_MAX_SIZE,
> > >>>> __ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_MAX_SIZE
> > >>>> @@ -48,6 +50,7 @@ enum {
> > >>>> #define ALLOCINFO_FILTER_MASK_FUNCTION (1 <<
> > >>>> ALLOCINFO_FILTER_FUNCTION)
> > >>>> #define ALLOCINFO_FILTER_MASK_FILENAME (1 <<
> > >>>> ALLOCINFO_FILTER_FILENAME)
> > >>>> #define ALLOCINFO_FILTER_MASK_LINENO (1 <<
> > >>>> ALLOCINFO_FILTER_LINENO)
> > >>>> +#define ALLOCINFO_FILTER_MASK_INACCURATE (1 <<
> > >>>> ALLOCINFO_FILTER_INACCURATE)
> > >>>> #define ALLOCINFO_FILTER_MASK_MIN_SIZE (1 <<
> > >>>> ALLOCINFO_FILTER_MIN_SIZE)
> > >>>> #define ALLOCINFO_FILTER_MASK_MAX_SIZE (1 <<
> > >>>> ALLOCINFO_FILTER_MAX_SIZE)
> > >>>> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > >>>> index ddc6946f56ab..cbcd12c4ef9c 100644
> > >>>> --- a/lib/alloc_tag.c
> > >>>> +++ b/lib/alloc_tag.c
> > >>>> @@ -249,6 +249,8 @@ static bool matches_filter(struct codetag *ct,
> > >>>> struct allocinfo_filter *filter,
> > >>>> struct alloc_tag_counters *counters,
> > >>>> bool *fetched_counters)
> > >>>> {
> > >>>> + bool inaccurate;
> > >>>> +
> > >>>> if (!filter || !filter->mask)
> > >>>> return true;
> > >>>> @@ -275,6 +277,12 @@ static bool matches_filter(struct codetag *ct,
> > >>>> struct allocinfo_filter *filter,
> > >>>> ct->lineno != filter->fields.lineno)
> > >>>> return false;
> > >>>> + if (filter->mask & ALLOCINFO_FILTER_MASK_INACCURATE) {
> > >>>> + inaccurate = !!(ct->flags & CODETAG_FLAG_INACCURATE);
> > >>>> + if (inaccurate != !!(filter->fields.inaccurate))
> > >>>> + return false;
> > >>>> + }
> > >>>> +
> > >>>> if (filter->mask & (ALLOCINFO_FILTER_MASK_MIN_SIZE |
> > >>>> ALLOCINFO_FILTER_MASK_MAX_SIZE)) {
> > >>>> if (!*fetched_counters) {
> > >>>> *counters = allocinfo_prefetch_counters(ct);
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox