Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH v2 0/2] selftests/mm: separate GUP microbenchmarking from functional testing
From: Sarthak Sharma @ 2026-05-20  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Jonathan Corbet, Jason Gunthorpe, John Hubbard,
	Peter Xu, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-mm, linux-kselftest, linux-kernel, linux-doc
In-Reply-To: <20260519112049.a85f34eb5f2af83e11ffc777@linux-foundation.org>

Hi Andrew!

On 5/19/26 11:50 PM, Andrew Morton wrote:
> On Tue, 19 May 2026 17:35:04 +0530 Sarthak Sharma <sarthak.sharma@arm.com> wrote:
> 
>> gup_test.c currently serves two distinct purposes: microbenchmarking
>> (GUP_FAST_BENCHMARK, PIN_FAST_BENCHMARK, PIN_LONGTERM_BENCHMARK) and
>> functional correctness testing (GUP_BASIC_TEST, PIN_BASIC_TEST,
>> DUMP_USER_PAGES_TEST). Mixing these in a single binary means functional
>> tests cannot be run or reported individually, and run_vmtests.sh must
>> invoke the binary multiple times with different flag combinations to
>> cover all configurations. This patch series separates the two concerns:
>> tools/mm/gup_bench for benchmarking and tools/testing/selftests/mm/gup_test
>> for functional testing.
>>
>> Patch 1 adds tools/mm/gup_bench.c, a standalone microbenchmark for
>> GUP_FAST, PIN_FAST and PIN_LONGTERM via the CONFIG_GUP_TEST debugfs
>> interface. It runs the same matrix of configurations as the old
>> run_gup_matrix() shell function (all three commands, read/write,
>> private/shared, four page counts, THP on/off, hugetlb), but as a
>> standalone C program under tools/mm with no dependency on kselftest.
>>
>> Patch 2 rewrites gup_test.c as a kselftest harness-based selftest. It
>> covers all five GUP kernel functions (get_user_pages, get_user_pages_fast,
>> pin_user_pages, pin_user_pages_fast, pin_user_pages with FOLL_LONGTERM)
>> plus DUMP_USER_PAGES_TEST, across 12 mapping configurations (THP on,
>> THP off and hugetlb, each across private/shared and read/write variants)
>> and four batch sizes (1, 512, 123, all pages). Results are reported as
>> standard TAP output with no command-line arguments required.
> 
> Thanks.  AI review asked a few things which seem fairly minor to me,
> but probably legitimate:
> 	https://sashiko.dev/#/patchset/20260519120506.184512-1-sarthak.sharma@arm.com

Thanks for pointing it out. I'll address the review comments and send a v3.

^ permalink raw reply

* Re: [PATCH v2 5/6] gpio: remove machine hogs
From: Bartosz Golaszewski @ 2026-05-20  6:55 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Bartosz Golaszewski, Linus Walleij, Geert Uytterhoeven,
	Frank Rowand, Mika Westerberg, Andy Shevchenko, Aaro Koskinen,
	Janusz Krzysztofik, Tony Lindgren, Russell King, Jonathan Corbet,
	Shuah Khan, linux-gpio, linux-kernel, linux-acpi,
	linux-arm-kernel, linux-omap, linux-doc
In-Reply-To: <ag1GJygtLgngKQqj@google.com>

On Wed, May 20, 2026 at 7:27 AM Dmitry Torokhov
<dmitry.torokhov@gmail.com> wrote:
> >
> > If there is no replacement maybe we can resurrect this? Or shoudl we
> > have add swnode support for hogs?
>
> Hmm, I guess it is already there so I should simply switch. Sorry about
> the noise.
>

Earlier in this series you have examples of using software node based
hogging, I hope you can use it?

Bart

^ permalink raw reply

* Re: [PATCH] nios2: remove the architecture
From: Arnd Bergmann @ 2026-05-20  7:06 UTC (permalink / raw)
  To: schuster.simon@siemens-energy.com, Ethan Nelson-Moore,
	Wolfram Sang
  Cc: Peter Zijlstra, Dinh Nguyen, linux-doc, devicetree, workflows,
	Linux-Arch, dmaengine, linux-i2c, linux-iio, Netdev, linux-pci,
	linux-pwm, linux-hardening, linux-kbuild,
	linux-csky@vger.kernel.org, Jonathan Corbet, Shuah Khan,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Daniel Lezcano,
	Thomas Gleixner, Alex Shi, Yanteng Si, Dongliang Mu, Hu Haowen,
	Kees Cook, Oleg Nesterov, Will Deacon, Aneesh Kumar K.V (Arm),
	Andrew Morton, Nicholas Piggin, Vinod Koul, Frank Li,
	Dave Penkler, Andi Shyti, Jonathan Cameron, David Lechner,
	Nuno Sá, Andy Shevchenko, Andrew Lunn, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Andreas Oetken
In-Reply-To: <20260519103012.blot4bssgiqfer6p@dev-vm-schuster>

On Tue, May 19, 2026, at 12:30, Simon Schuster wrote:
> On Mon, May 18, 2026 at 05:13:58PM -0700, Ethan Nelson-Moore wrote:
>
> 2035 is still a rather tight timeframe for our typical support/phase-out
> period (we would hope to get close to 2040 with the SLTS extensions),
> which is also the reason for our targeted 'lifetime extension' for the
> nios2 architecture for approximately 5 years, or more precisely ~2-3
> SLTS kernels assuming the usual cadence of 2 years between SLTS versions
> (+ some safety margin).

I think that is a reasonable target. We have a bunch of embedded
architectures that have a similarly small user base and I expect
that we will want to remove most of them at some point, as we did
for seven architectures in linux-4.17.

As long as there is a maintainer for nios2 and it's not actively
getting in the way of a specific treewide change, I don't see any
reason to remove this any earlier than the other ones.

Obviously at some point nios2 will have to get removed because
of the limit to gcc-14 or older, but that should not be a problem
for the next few LTS releases.

> Sure, I'd be glad to do so, but so far I refrained from it as I was a bit
> unsure about the netiquette (can I simply do so by self-proclamation? At
> least the git history seems to suggest so...).

Dinh already replied that he welcomes the help, and I also suggested
the same thing a year ago. As the only known user that has contributed
patches in a long time, you are obviously qualified.

Sending a patch for the MAINTAINERS file to Dinh is the first step,
once he has sent that upstream, you can (optionally) apply for
kernel.org account that would let you host a git tree on kernel.org
or have a tree that you both have access to.

     Arnd

^ permalink raw reply

* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper
From: Ian Kent @ 2026-05-20  7:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: NeilBrown, Horst Birthelmer, Amir Goldstein, Miklos Szeredi,
	Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner,
	linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer
In-Reply-To: <fglq7n2brxwdsu7and6nt6xpgdziua754yzgxkmd33pmk6tor4@noxa5ajva7wg>

On 19/5/26 17:12, Jan Kara wrote:
> On Mon 18-05-26 21:39:13, Ian Kent wrote:
>> On 18/5/26 16:19, Jan Kara wrote:
>>> Hi Ian,
>>>
>>> On Mon 18-05-26 10:55:43, Ian Kent wrote:
>>>> On 18/5/26 07:55, NeilBrown wrote:
>>>>> On Fri, 15 May 2026, Horst Birthelmer wrote:
>>>>> According to the email you linked, a problem arises when a directory has
>>>>> a great many negative children.  Code which walks the list of children
>>>>> (such as fsnotify) while holding a lock can suffer unpredictable delays
>>>>> and result in long lock-hold times.  So maybe a limit on negative
>>>>> dentries for any parent is what we really want.  That would be clumsy to
>>>>> implement I imagine.
>>>> But the notion of dropping the dentry in ->d_delete() on last dput() is
>>>> simple enough but did see regressions (the only other place in the VFS
>>>> besides dentry_kill() that the inode is unlinked from the dentry on
>>>> dput()). I wonder if the regression was related to the test itself
>>>> deliberately recreating deleted files and if that really is normal
>>>> behaviour. By itself that should prevent almost all negative dentries
>>>> being retained. Although file systems could do this as well (think XFS
>>>> inode recycling) it should be reasonable to require it be left to the
>>>> VFS.
>>>>
>>>> But even that's not enough given that, in my case, there would still be
>>>> around 4 million dentries in the LRU cache and in fsnotify there are
>>>> directory child traversals holding the parent i_lock "spinlock" that are
>>>> going to cause problems.
>>> Do you mean there are very many positive children of a directory?
>> Didn't quantify that.
>>
>> The symptom is the "Spinlock held for more than ... seconds" occurring in
>> the log. So there are certainly a lot of children in the list, but it's
>> an assumption the ratio of positive to negative entries is roughly the
>> same as the overall ratio in the dcache.
> OK, but that's not necessarily true. I have seen these complaints from the
> kernel but in all the cases I remember it was due to negative dentries
> accumultating in a particular directory. There are certain apps such as
> ElasticSearch which really do like creating huge amounts of negative
> dentries in one directory - they use hashes as filenames and use directory
> lookup instead of a DB table lookup and lookup lots of non-existent keys...

Umm ... that's a good point, I hadn't paid much attention to ENOENT result

lookups, I'll need to check on the like cycle of those, I think they do get

hashed. That has to be the other source of negative dentries that I've

neglected ...

>
>>>> so why is this traversal even retained in fsnotify?
>>> Not sure which traversal you mean but if you set watch on a parent, you
>>> have to walk all children to set PARENT_WATCHED flag so that you don't miss
>>> events on children...
>> Yes, that traversal is what I'm questioning ... again thanks.
>>
>> I think the function name is still fsnotify_set_children_dentry_flags()
>> in recent kernels, the subject of commit 172e422ffea2 I mentioned above.
> OK, thanks.
>
>> When you say miss events are you saying that accessing the parent dentry to
>> work out if the child needs to respond to an event is quite expensive in the
>> overall event processing context, that might make more sense to me ... or do
>> I completely not yet understand the reasoning behind the need for the flag?
> Close but not quite. The cost is the overhead of dget_parent() in
> fsnotify_parent() which is often a couple of cache cold loads and atomic
> instructions to find out we don't need to send any event for the current
> write(2) or read(2) call. It gets worse if there are many IOs happening to
> dentries in the same directory from multiple CPUs because instead of
> cache-cold loads you get a cacheline contention on the parent.
>
>>>>> But what if we move dentries to the end of the list when they become
>>>>> negative, and to the start of the list when they become positive?  Then
>>>>> code which walks the child list could simply abort on the first
>>>>> negative.
>>>>>
>>>>> I doubt that would be quite as easy as it sounds, but it would at least
>>>>> be more focused on the observed symptom rather than some whole-system
>>>>> number which only vaguely correlates with the observed symptom.
>>>>>
>>>>> Maybe a completely different approach: change children-walking code to
>>>>> drop and retake the lock (with appropriate validation) periodically.
>>>>> What too would address the specific symptom.
>>>> Another good question.
>>>>
>>>> I have assumed that dropping and re-taking the lock cannot be done but
>>>> this is a question I would like answered as well. Dropping and re-taking
>>>> lock would require, as Miklos pointed out to me off-list, recording the
>>>> list position with say a cursor, introducing unwanted complexity when it
>>>> would be better to accept the cost of a single extra access to the parent
>>>> flags (which I assume is one reason to set the flag in the child).
>>> The parent access is actually more expensive than you might think. Based on
>>> experience with past fsnotify related performance regression I expect some
>>> 20% performance hit for small tmpfs writes if you add unconditional parent
>>> access to the write path.
>> That sounds like a lot for what should be a memory access of an already in
>> memory structure since the parent must be accessed to traverse the list of
>> child entries. I clearly don't fully understand the implications of what
>> I'm saying but there has been mention of another context ...
> Parent dentry is of course in memory but often cache cold - you don't need
> the parent to do e.g. write(2) to an already open file. You seem to be
> somewhat confused about the child dentry list traversal (or maybe I'm
> misunderstanding) - that happens only when placing the notification mark
> but definitely not for each IO operation.

LOL, confusion is a pretty common state of mind for me!


I do get your point though and I am confusing the traversal with other

operations. I think this answers the question I've been asking (maybe

that wasn't obvious) about the reason for the traversal (ie. the reason

to maintain a flag in the child).


While I have looked at the code here I haven't absorbed it and I

definitely don't understand it, your continued patience is appreciated

and will be beneficial when I get time to look at it a bit closer. I

do still need to use a notifications mechanism to match up with Miklos's

statmount implementation to get the full benefit of that in user space,

if I ever get a chance to work on that again.


So it sounds like it would be worth while considering a traversal that's

based on taking a reference on each dentry rather than a spinlock for

the duration. It would be tricky though, for obvious reasons, like

children added during the traversal, added overhead of getting the next

entry reference, etc.


Ian


^ permalink raw reply

* Re: [PATCH v2 1/2] mm/memcontrol: add dmem charge/uncharge functions
From: Albert Esteve @ 2026-05-20  7:22 UTC (permalink / raw)
  To: Eric Chanudet
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Maarten Lankhorst, Maxime Ripard,
	Natalie Vock, Tejun Heo, Michal Koutný, Jonathan Corbet,
	Shuah Khan, cgroups, linux-mm, linux-kernel, dri-devel,
	T.J. Mercier, Christian König, Maxime Ripard, Dave Airlie,
	linux-doc
In-Reply-To: <20260519-cgroup-dmem-memcg-double-charge-v2-1-db4d1407062b@redhat.com>

On Tue, May 19, 2026 at 6:01 PM Eric Chanudet <echanude@redhat.com> wrote:
>
> Add mem_cgroup_dmem_charge() and mem_cgroup_dmem_uncharge() to allow
> dmem pool allocations to optionally be double-charged against the memory
> controller. Take the struct cgroup from the dmem pool's css as there is
> no convenient object exported to represent these allocations. These will
> resolve the effective memory css from that cgroup and perform the
> charge.
>
> Introduce a MEMCG_DMEM stat counter to memory.stat to make the cgroup's
> dmem charge visible.
>
> Signed-off-by: Eric Chanudet <echanude@redhat.com>

Reviewed-by: Albert Esteve <aesteve@redhat.com>

> ---
>  include/linux/memcontrol.h | 16 ++++++++++++
>  mm/memcontrol.c            | 65 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 81 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index dc3fa687759b45748b2acee6d7f43da325eb50c1..8e1d49b87fb64e6114f3eb920293e14920290fe7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -39,6 +39,7 @@ enum memcg_stat_item {
>         MEMCG_ZSWAP_B,
>         MEMCG_ZSWAPPED,
>         MEMCG_ZSWAP_INCOMP,
> +       MEMCG_DMEM,
>         MEMCG_NR_STAT,
>  };
>
> @@ -1872,6 +1873,21 @@ static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
>  }
>  #endif
>
> +#if defined(CONFIG_MEMCG) && defined(CONFIG_CGROUP_DMEM)
> +bool mem_cgroup_dmem_charge(struct cgroup *cgrp, unsigned int nr_pages,
> +                           gfp_t gfp_mask);
> +void mem_cgroup_dmem_uncharge(struct cgroup *cgrp, unsigned int nr_pages);
> +#else
> +static inline bool mem_cgroup_dmem_charge(struct cgroup *cgrp,
> +                                         unsigned int nr_pages, gfp_t gfp_mask)
> +{
> +       return true;
> +}
> +static inline void mem_cgroup_dmem_uncharge(struct cgroup *cgrp,
> +                                           unsigned int nr_pages)
> +{
> +}
> +#endif
>
>  /* Cgroup v1-related declarations */
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c03d4787d466803db49cdaa90e6d6ba426b7afe2..91a7ac16b6eac2d6c3700b6885a068bf8b640706 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -433,6 +433,7 @@ static const unsigned int memcg_stat_items[] = {
>         MEMCG_ZSWAP_B,
>         MEMCG_ZSWAPPED,
>         MEMCG_ZSWAP_INCOMP,
> +       MEMCG_DMEM,
>  };
>
>  #define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
> @@ -1606,6 +1607,9 @@ static const struct memory_stat memory_stats[] = {
>  #ifdef CONFIG_NUMA_BALANCING
>         { "pgpromote_success",          PGPROMOTE_SUCCESS       },
>  #endif
> +#ifdef CONFIG_CGROUP_DMEM
> +       { "dmem",                       MEMCG_DMEM              },
> +#endif
>  };
>
>  /* The actual unit of the state item, not the same as the output unit */
> @@ -5909,6 +5913,67 @@ static struct cftype zswap_files[] = {
>  };
>  #endif /* CONFIG_ZSWAP */
>
> +#ifdef CONFIG_CGROUP_DMEM
> +/**
> + * mem_cgroup_dmem_charge - charge memcg for a dmem pool allocation
> + * @cgrp: cgroup of the dmem pool
> + * @nr_pages: number of pages to charge
> + * @gfp_mask: reclaim mode
> + *
> + * Charges @nr_pages to @memcg. Returns %true if the charge fit within
> + * @memcg's configured limit, %false if it doesn't.
> + */
> +bool mem_cgroup_dmem_charge(struct cgroup *cgrp, unsigned int nr_pages,
> +                           gfp_t gfp_mask)
> +{
> +       struct cgroup_subsys_state *mem_css;
> +       struct mem_cgroup *memcg;
> +
> +       /* CGROUP_DMEM and MEMCG guarantees this cannot be NULL. */
> +       mem_css = cgroup_get_e_css(cgrp, &memory_cgrp_subsys);
> +
> +       /* Use the memcg, if any, of the dmem cgroup. */
> +       memcg = mem_cgroup_from_css(mem_css);
> +       if (!memcg || mem_cgroup_is_root(memcg)) {
> +               css_put(mem_css);
> +               return false;
> +       }
> +
> +       if (try_charge_memcg(memcg, gfp_mask, nr_pages)) {
> +               css_put(mem_css);
> +               return false;
> +       }
> +
> +       mod_memcg_state(memcg, MEMCG_DMEM, nr_pages);
> +       css_put(mem_css);
> +       return true;
> +}
> +
> +/**
> + * mem_cgroup_dmem_uncharge - uncharge memcg from a dmem pool allocation
> + * @cgrp: cgroup of the dmem pool
> + * @nr_pages: number of pages to uncharge
> + */
> +void mem_cgroup_dmem_uncharge(struct cgroup *cgrp, unsigned int nr_pages)
> +{
> +       struct cgroup_subsys_state *mem_css;
> +       struct mem_cgroup *memcg;
> +
> +       /* CGROUP_DMEM and MEMCG guarantees this cannot be NULL. */
> +       mem_css = cgroup_get_e_css(cgrp, &memory_cgrp_subsys);
> +
> +       memcg = mem_cgroup_from_css(mem_css);
> +       if (!memcg || mem_cgroup_is_root(memcg)) {
> +               css_put(mem_css);
> +               return;
> +       }
> +
> +       mod_memcg_state(memcg, MEMCG_DMEM, -nr_pages);
> +       refill_stock(memcg, nr_pages);
> +       css_put(mem_css);
> +}
> +#endif /* CONFIG_CGROUP_DMEM */
> +
>  static int __init mem_cgroup_swap_init(void)
>  {
>         if (mem_cgroup_disabled())
>
> --
> 2.52.0
>


^ permalink raw reply

* Re: [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin
From: Fujunjie @ 2026-05-20  8:05 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Andrew Morton, Chris Li, Kairui Song, Johannes Weiner, Nhat Pham,
	Yosry Ahmed, linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
	David Hildenbrand, Ryan Roberts, Barry Song, Baolin Wang,
	Chengming Zhou, Baoquan He, Lorenzo Stoakes
In-Reply-To: <CAEmasaV=L8w4dF7ja7GkDu_7U5i+aVVH1a1qsXgtFr3wuWNOPA@mail.gmail.com>



On 5/19/2026 10:49 PM, Alexandre Ghiti wrote:
> Hi,
> 
> On Tue, May 12, 2026 at 9:46 AM Fujunjie <fujunjie1@qq.com> wrote:
>>
>>>
>>
>>
>> On 5/12/2026 12:20 PM, Alexandre Ghiti wrote:
>>> So I have been working on the exact same thing for some weeks now. My work is based on Usama's series [1].
>>>
>>> The problem with large folio swapin is that it can create swap thrashing: to swap in a large folio, swap out may be necessary, as reported in [2].
>>>
>>> I implemented quite a few throttling algorithms on top to try to avoid this issue and so far, I have had mixed/inconsistent results.
>>>
>>> How did you test this series? Did you encounter thrashing? Do you have performance numbers?
>>>
>>> Happy to talk more about this, thanks for your series!
>>>
>>> Alex
>>>
>>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/  <https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/ >
>>> [2] https://lore.kernel.org/all/SJ0PR11MB5678A864244B09FDE4D914EEC9402@SJ0PR11MB5678.namprd11.prod.outlook.com/  <https://lore.kernel.org/all/SJ0PR11MB5678A864244B09FDE4D914EEC9402@SJ0PR11MB5678.namprd11.prod.outlook.com/ >
>>
>> Thanks Alexandre.
>>
>> My RFC only had correctness testing so far. I tested the all-zswap path
>> and fallback cases under QEMU, but I don't have bare-metal
>> performance numbers yet.
>>
>> If you are already actively working on this, I don't want to duplicate the
>> same effort. I will pause this RFC for now and wait for your series.
>>
>> After your series is posted, I will take another look and see if there is
>> anything that still needs follow-up work.
>>
>> Thanks for letting me know.
> 
> Sorry for the late answer. I took a break because of the inconsistent
> results that I had, perhaps a fresh look could help so no worries if
> you give it a try on your end.
> 
> Happy to discuss further results if you continue.
> 
> Alex


Thanks Alexandre.

That sounds good. I will take a fresh look on my side.
If I find something useful, I will discuss with you.

Best regards,
fujunjie


^ permalink raw reply

* Re: [PATCH 00/12] misc/syncobj: add /dev/syncobj device
From: Christian König @ 2026-05-20  8:08 UTC (permalink / raw)
  To: Xaver Hugl
  Cc: Julian Orth, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Sumit Semwal, Jonathan Corbet,
	Shuah Khan, Arnd Bergmann, Greg Kroah-Hartman, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, linux-doc,
	wayland-devel, Michel Dänzer
In-Reply-To: <CAFZQkGz=UJqaJ_eTwKBy1pAg5xL+PLibh7W1vYf7JD7Jrx-LZQ@mail.gmail.com>

On 5/19/26 19:08, Xaver Hugl wrote:
>>> The part where we get this independent of attached hardware is quite
>>> important for us though, since we can't just ignore explicit sync once
>>> the device we previously imported the syncobj into is disconnected.
>>
>> Can you elaborate more on this?
> 
> In Wayland, the client is allowed to attach dmabuf and syncobj
> independently, they don't have to be from the same device (and the
> compositor wouldn't be able to verify the opposite anyways). The
> compositor will usually import both into the same drm device, but
> especially with compositors that render on multiple devices, that's
> not necessarily the case either.
> 
> If for example we had a system with one internal GPU and one external
> GPU, the client renders on the internal GPU and the compositor uses
> the external one. Now when the user yanks the USB C cable, afaiu

Well I would say the other way around is a pretty common use case.

In other words the compositors uses the internal GPU for composing and displaying the picture. And the client uses the external GPU for fast rendering.

> - the buffers from the client stay valid

Buffers from the hot plugged GPU don't stay valid. Accessing CPU mappings either result in a SIGBUS or are redirected to a dummy page.

DMA operations to hot plugged buffers from other GPUs (or rather more general other devices) are waited on before the underlying resource is removed (e.g. system memory or PCIe address space or whatever is backing that).

But no new DMA operations are usually permitted to start.

> - the syncobj stays valid on the client side
> - the syncobj becomes invalid on the compositor side

Nope that's not correct. The syncobj itself stays valid even if you completely hot plug the device.

It can just be that the fences inside the syncobj are terminated with an error.

> "invalid" there means either
> - the acquire point of the client is marked as signaled, before
> rendering on the client side is completed
> - the acquire point of the client is never signaled. Since the
> compositor waits for the acquire point, the Wayland surface is stuck
> forever

Both of those would be a *massive* violation of documented kernel rules for hot-plugging which could lead to random data corruption and/or deadlocks.

If you see any HW driver showing behavior like that please open up a bug report and ping the relevant maintainers immediately.

When a hotplug happens all operations of the device should return an -ENODEV error, even when exposed to other devices/application through syncobj or syncfile.

One problem is that only syncfile allows for querying such error codes at the moment, we have patches pending to add that to syncobj as well but we lack a compositor with support for that as userspace client.

> Afaik the latter is currently the case. The former wouldn't be much
> better though, not when it's preventable.
> 
> This is admittedly an edge case, but GPU hotunplug is something we try
> to support as well as possible in Plasma, and all the edge cases cause
> a lot of problems in combination and are a lot of headaches to handle
> (or really work around) in the compositor.

Well exactly that design is used in the Tesla 3 infotainment system for example.

So GPU hotplug is actually a pretty common use case.

> Another edge case is when the client asks the compositor to import the
> syncobj, which can fail when a hotunplug is in process, and ends up
> disconnecting the client for no fault of either client or compositor.

Well the question here is if the device the compositor is using or the client is using is gone?

If the client device is hot removed the compositor should be perfectly capable to import the syncobj.

If the compositor device is gone then you don't have a device to display anything any more, so generating the next frame doesn't seem to make sense either.

What could be is that you want the compositor to be kept alive even when the display device is gone to switch over to vkms or whatever so that a VNC session or other remote desktop still works.

>>>>> 3. It removes the need to translate between syncobjs fds and handles.
>>>>
>>>> That's a pretty big no-go as well. The differentiation between FDs and handles is completely intentional.
>>> Could you expand on why it's needed? For compositors, the handle is
>>> just an intermediary thing when translating between file descriptors.
>>
>> Well what we could do is to add an IOCTL to directly attach an syncobj file descriptor to an eventfd.
> That would be nice.

Take a look at drm_syncobj_file_fops and how drm_syncobj_add_eventfd() is used. Adding that functionality shouldn't be more than a typing exercise.

Do I see it right that this would already solve most problems in the compositor side?

Regards,
Christian.

> 
> - Xaver


^ permalink raw reply

* Re: [PATCH v2 1/4] cpufreq: Extract cpufreq_policy_init_qos() function
From: Jie Zhan @ 2026-05-20  8:10 UTC (permalink / raw)
  To: Pierre Gondois, linux-kernel
  Cc: Lifeng Zheng, Ionela Voinescu, Sumit Gupta, Zhongqiu Han,
	Rafael J. Wysocki, Viresh Kumar, Jonathan Corbet, Shuah Khan,
	Huang Rui, Mario Limonciello, Perry Yuan, K Prateek Nayak,
	Srinivas Pandruvada, Len Brown, Saravana Kannan, linux-pm,
	linux-doc
In-Reply-To: <20260511135538.522653-2-pierre.gondois@arm.com>



On 5/11/2026 9:55 PM, Pierre Gondois wrote:
> Extract the QoS related logic from cpufreq_policy_online()
> to make the function shorter/simpler.
> 
> The logic is placed in cpufreq_policy_init_qos() and is
> now executed right after the following calls:
> - cpufreq_driver->init()
> - cpufreq_table_validate_and_sort()
> 
> This helps preparing following patches that will,
> in cpufreq_policy_init_qos():
> - treat the policy->min/max values set by drivers as QoS requests.
> - set a default policy->min/max value to all policies.
> 
> No functional change.
> 
LGTM, thanks!
Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com>
> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
> ---
>  drivers/cpufreq/cpufreq.c | 53 +++++++++++++++++++++++----------------
>  1 file changed, 32 insertions(+), 21 deletions(-)
> 
[ ... ]

^ permalink raw reply

* Re: [PATCH v2 2/4] cpufreq: Set default policy->min/max values for all drivers
From: Jie Zhan @ 2026-05-20  8:12 UTC (permalink / raw)
  To: Pierre Gondois, linux-kernel
  Cc: Lifeng Zheng, Ionela Voinescu, Sumit Gupta, Zhongqiu Han,
	Rafael J. Wysocki, Viresh Kumar, Jonathan Corbet, Shuah Khan,
	Huang Rui, Mario Limonciello, Perry Yuan, K Prateek Nayak,
	Srinivas Pandruvada, Len Brown, Saravana Kannan, linux-pm,
	linux-doc
In-Reply-To: <20260511135538.522653-3-pierre.gondois@arm.com>



On 5/11/2026 9:55 PM, Pierre Gondois wrote:
> Some drivers set policy->min/max in their .init() callback.
> cpufreq_set_policy() will ultimately override them through:
> cpufreq_policy_online()
> \-cpufreq_init_policy()
>   \-cpufreq_set_policy()
>     \-/* Set policy->min/max */
> Thus the policy min/max values provided are only temporary.
> 
> There is an exception if CPUFREQ_NEED_INITIAL_FREQ_CHECK is set and:
> cpufreq_policy_online()
> \-__cpufreq_driver_target()
>   \-cpufreq_driver->target()
> 
> To prepare for a following patch that will remove all
> policy->min/max initialization in the driver .init() callback
> if the min/max value is equal to the cpuinfo.min/max_freq,
> set a default policy->min/max value for all drivers.
> 
Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com> 
> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
> ---
>  drivers/cpufreq/cpufreq.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
[ ... ]

^ permalink raw reply

* Re: [PATCH 00/12] misc/syncobj: add /dev/syncobj device
From: Michel Dänzer @ 2026-05-20  8:13 UTC (permalink / raw)
  To: Christian König, Xaver Hugl
  Cc: Julian Orth, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Sumit Semwal, Jonathan Corbet,
	Shuah Khan, Arnd Bergmann, Greg Kroah-Hartman, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, linux-doc,
	wayland-devel
In-Reply-To: <dff60378-4e47-4753-8878-feec6e1c2690@amd.com>

On 5/19/26 18:00, Christian König wrote:
> On 5/19/26 17:31, Xaver Hugl wrote:
>> Am Di., 19. Mai 2026 um 15:29 Uhr schrieb Christian König
>> <christian.koenig@amd.com>:
>>>> 1. This series makes the ability to manipulate syncobjs available
>>>> independently of attached hardware.
>>>> 2. It makes it available under a consistent path /dev/syncobj.
>>>
>>> Exactly that is a big no-go. This has to be under /dev/dri.
>> FWIW udmabuf is also under /dev directly, but I don't think any
>> compositor developer would complain about a different path.
>> What are the rules for that? Could this simply be put in /dev/dri/syncobj?
> 
> The syncobj are actually the DRM specific way of doing things. The general kernel wide way is to use sync files (see drivers/dma-buf/sync_file.c).
> 
> But there has already been tons of problems with those sync files. E.g. they doesn't support your use case at all since they don't have wait before submit behavior.
> 
> So there are already ways to do this, but the Linux kernel so far told everybody that this is forbidden. The DRM syncobj wait before signal functionality is much better, but then basically the second try to do this.

I'm not quite sure what you're getting at here, just to be clear though:

While the syncobj Wayland protocol extension supports wait-before-submit behaviour at the Wayland protocol level, it doesn't need or cause wait-before-submit behaviour for DMA fences in the kernel. The usual rules apply to fences attached to syncobj timeline points. The wait-before-submit behaviour at the Wayland protocol level comes from allowing submit before a fence is attached to the acquire timeline point.

(It took me a while to realize this distinction, before which I mistakenly thought the kernel's DMA fence rules would prohibit wait-before-submit behaviour at the Wayland protocol level as well)


-- 
Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
https://redhat.com             \               Libre software enthusiast

^ permalink raw reply

* Re: [PATCH v2 3/4] cpufreq: Remove driver default policy->min/max init
From: Jie Zhan @ 2026-05-20  8:15 UTC (permalink / raw)
  To: Pierre Gondois, linux-kernel
  Cc: Lifeng Zheng, Ionela Voinescu, Sumit Gupta, Zhongqiu Han,
	Rafael J. Wysocki, Viresh Kumar, Jonathan Corbet, Shuah Khan,
	Huang Rui, Mario Limonciello, Perry Yuan, K Prateek Nayak,
	Srinivas Pandruvada, Len Brown, Saravana Kannan, linux-pm,
	linux-doc
In-Reply-To: <20260511135538.522653-4-pierre.gondois@arm.com>



On 5/11/2026 9:55 PM, Pierre Gondois wrote:
> Prior to [1], drivers were setting policy->min/max and
> the value was used as a QoS constraint. After that change,
> the values were only temporarily used: cpufreq_set_policy()
> ultimately overriding them through:
> cpufreq_policy_online()
> \-cpufreq_init_policy()
>   \-cpufreq_set_policy()
>     \-/* Set policy->min/max */
> 
> This patch reinstate the initial behaviour. This will allow
> drivers to request min/max QoS frequencies if desired.
> For instance, the cppc driver advertises a lowest non-linear
> frequency, which should be used as a min QoS value.
> 
> To avoid having drivers setting policy->min/max to default
> values which are considered as QoS values (i.e. the reason
> why [1] was introduced), remove the initialization of
> policy->min/max in .init() callbacks wherever the
> policy->min/max values are identical to the
> policy->cpuinfo.min/max_freq.
> 
> Indeed, the previous patch ("cpufreq: Set default
> policy->min/max values for all drivers") makes this initialization
> redundant.
> 
> The only drivers where these values are different are:
> - gx-suspmod.c (min)
> - cppc-cpufreq.c (min)
> - longrun.c
> 
> [1]
> commit 521223d8b3ec ("cpufreq: Fix initialization of min and
> max frequency QoS requests")
> 
Acked-by: Jie Zhan <zhanjie9@hisilicon.com>
for the CPPC part, though the rest looks fine but in case I miss something.
> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
> ---
>  drivers/cpufreq/amd-pstate.c      | 14 ++++++--------
>  drivers/cpufreq/cppc_cpufreq.c    |  5 ++---
>  drivers/cpufreq/cpufreq-nforce2.c |  4 ++--
>  drivers/cpufreq/freq_table.c      |  7 +++----
>  drivers/cpufreq/gx-suspmod.c      |  2 +-
>  drivers/cpufreq/intel_pstate.c    |  3 ---
>  drivers/cpufreq/pcc-cpufreq.c     | 10 ++++------
>  drivers/cpufreq/pxa3xx-cpufreq.c  |  5 ++---
>  drivers/cpufreq/sh-cpufreq.c      |  6 ++----
>  drivers/cpufreq/virtual-cpufreq.c |  5 +----
>  10 files changed, 23 insertions(+), 38 deletions(-)
> 
[ ... ]

^ permalink raw reply

* Re: [PATCH v2 4/4] cpufreq: Use policy->min/max init as QoS request
From: Jie Zhan @ 2026-05-20  8:38 UTC (permalink / raw)
  To: Pierre Gondois, linux-kernel
  Cc: Lifeng Zheng, Ionela Voinescu, Sumit Gupta, Zhongqiu Han,
	Rafael J. Wysocki, Viresh Kumar, Jonathan Corbet, Shuah Khan,
	Huang Rui, Mario Limonciello, Perry Yuan, K Prateek Nayak,
	Srinivas Pandruvada, Len Brown, Saravana Kannan, linux-pm,
	linux-doc
In-Reply-To: <20260511135538.522653-5-pierre.gondois@arm.com>



On 5/11/2026 9:55 PM, Pierre Gondois wrote:
> Consider policy->min/max being set in the driver .init()
> callback as a QoS request. Impacted driver are:
> - gx-suspmod.c (min)
> - cppc-cpufreq.c (min)
> - longrun.c (min/max)
> 
> Update the documentation accordingly.
> 
> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
> ---
>  Documentation/cpu-freq/cpu-drivers.rst | 10 ++++++++--
>  drivers/cpufreq/cpufreq.c              | 12 ++++++++++--
>  2 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/cpu-freq/cpu-drivers.rst b/Documentation/cpu-freq/cpu-drivers.rst
> index c5635ac3de547..ab4f3c0f3a89b 100644
> --- a/Documentation/cpu-freq/cpu-drivers.rst
> +++ b/Documentation/cpu-freq/cpu-drivers.rst
> @@ -114,8 +114,14 @@ Then, the driver must fill in the following values:
>  |policy->cur			    | The current operating frequency of   |
>  |				    | this CPU (if appropriate)		   |
>  +-----------------------------------+--------------------------------------+
> -|policy->min,			    |					   |
> -|policy->max,			    |					   |
> +|policy->min			    | If set by the driver in ->init(),    |
> +|				    | used as initial minimum frequency	   |
> +|				    | QoS request.			   |
> ++-----------------------------------+--------------------------------------+
> +|policy->max			    | If set by the driver in ->init(),    |
> +|				    | used as initial maximum frequency	   |
> +|				    | QoS request.			   |
> ++-----------------------------------+--------------------------------------+
Hi Pierre,

Trivial bit: add the general meaning alongside its driver usage at the init
stage, and mention it defaults to cpuinfo_min/max_freq if not set?

I mean something like:
The minimum/maximum scaling frequency.  If set by the driver in ->init(),
used as initial minimum/maximum frequency QoS request; otherwise, follow
policy->cpuinfo.min/max_freq.

Thanks,
Jie
>  |policy->policy and, if necessary,  |					   |
>  |policy->governor		    | must contain the "default policy" for|
>  |				    | this CPU. A few moments later,       |
[ ... ]

^ permalink raw reply

* Re: [PATCH v3 0/5] Support the FEAT_HDBSS introduced in Armv9.5
From: Tian Zheng @ 2026-05-20  8:51 UTC (permalink / raw)
  To: Will Deacon, maz, oupton, catalin.marinas, corbet, pbonzini,
	Tian Zheng
  Cc: kernel-team, yuzenghui, wangzhou1, liuyonglong, yezhenyu2,
	joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose, leo.bras, Jonathan Cameron
In-Reply-To: <177918656142.736362.17906576792384645789.b4-ty@kernel.org>



On 5/19/2026 11:23 PM, Will Deacon wrote:
> On Wed, 25 Feb 2026 12:04:16 +0800, Tian Zheng wrote:
>> This series of patches add support to the Hardware Dirty state tracking
>> Structure(HDBSS) feature, which is introduced by the ARM architecture
>> in the DDI0601(ID121123) version.
>>
>> The HDBSS feature is an extension to the architecture that enhances
>> tracking translation table descriptors' dirty state, identified as
>> FEAT_HDBSS. This feature utilizes hardware assistance to achieve dirty
>> page tracking, aiming to significantly reduce the overhead of scanning
>> for dirty pages.
>>
>> [...]
> 
> Applied sysreg definitions to arm64 (for-next/sysregs), thanks!
> 
> [1/5] arm64/sysreg: Add HDBSS related register information
>        https://git.kernel.org/arm64/c/72f7be0c2e30
> 
> Cheers,

Thanks!
Tian


^ permalink raw reply

* Re: [PATCH v2 1/2] tools/mm: add a standalone GUP microbenchmark
From: Mike Rapoport @ 2026-05-20  8:55 UTC (permalink / raw)
  To: Sarthak Sharma
  Cc: Andrew Morton, David Hildenbrand, Jonathan Corbet,
	Jason Gunthorpe, John Hubbard, Peter Xu, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-mm, linux-kselftest, linux-kernel,
	linux-doc, Mark Brown
In-Reply-To: <20260519120506.184512-2-sarthak.sharma@arm.com>

(added broonie)

Hi,

On Tue, May 19, 2026 at 05:35:05PM +0530, Sarthak Sharma wrote:
> Add a command-line tool for benchmarking get_user_pages fast-path
> (GUP_FAST), pin_user_pages fast-path (PIN_FAST), and pin_user_pages
> longterm (PIN_LONGTERM) via the CONFIG_GUP_TEST debugfs interface.
> 
> When invoked without arguments, gup_bench runs the same matrix of
> configurations as run_gup_matrix() in run_vmtests.sh: all three GUP
> commands across read/write, private/shared mappings, and a range of
> page counts, with THP on/off for regular mappings and hugetlb for huge
> page mappings.
> 
> This tool is a mix of reused and new logic. The mapping/setup path comes
> from selftests/mm/gup_test.c, while the default benchmark matrix matches
> run_gup_matrix() in run_vmtests.sh. The standalone CLI and tools/mm
> integration are added here so tools/mm does not depend on kselftest.
> 
> Add gup_bench to BUILD_TARGETS and INSTALL_TARGETS in tools/mm/Makefile,
> and ignore the resulting binary in tools/mm/.gitignore. While here, also
> add the missing thp_swap_allocator_test entry to .gitignore.
> 
> Add tools/mm/gup_bench.c to the GUP entry in MAINTAINERS.
> 
> Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Sarthak Sharma <sarthak.sharma@arm.com>
> ---
>  MAINTAINERS          |   1 +
>  tools/mm/.gitignore  |   2 +
>  tools/mm/Makefile    |   6 +-
>  tools/mm/gup_bench.c | 491 +++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 497 insertions(+), 3 deletions(-)
>  create mode 100644 tools/mm/gup_bench.c

...
 
> +/*
> + * Local HugeTLB setup helpers for gup_bench.
> + *
> + * These helpers were copied from tools/testing/selftests/mm/ and adjusted to
> + * remove the ksft formatting. Keep this copy local so tools/mm does not
> + * depend on ksft output behavior.
> + */

It looks like self tests of at least 5 subsystems beside mm use hugetlb:

$ git grep -l "Hugepagesize:" tools/testing/selftests/ | grep -v "selftests/mm"
tools/testing/selftests/arm64/mte/check_hugetlb_options.c
tools/testing/selftests/cgroup/test_hugetlb_memcg.c
tools/testing/selftests/kvm/lib/test_util.c
tools/testing/selftests/memfd/common.c
tools/testing/selftests/net/tcp_mmap.c

It seems that we need to better share the common code in
tools/testing/selftest.

And adding another copy of the hugetlb detection and setup code does not
seem like a great idea.

> +
> +static unsigned int psize(void)
> +{
> +	static unsigned int __page_size;
> +
> +	if (!__page_size)
> +		__page_size = sysconf(_SC_PAGESIZE);
> +	return __page_size;
> +}
> +
> +static unsigned long default_huge_page_size(void)
> +{
> +	FILE *f = fopen("/proc/meminfo", "r");
> +	unsigned long hpage_size = 0;
> +	char buf[256];
> +
> +	if (!f)
> +		return 0;
> +	while (fgets(buf, sizeof(buf), f)) {
> +		if (sscanf(buf, "Hugepagesize:       %lu kB", &hpage_size) == 1)
> +			break;
> +	}
> +	fclose(f);
> +	hpage_size <<= 10;
> +	return hpage_size;
> +}

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v2 1/2] tools/mm: add a standalone GUP microbenchmark
From: Dev Jain @ 2026-05-20  9:02 UTC (permalink / raw)
  To: Mike Rapoport, Sarthak Sharma
  Cc: Andrew Morton, David Hildenbrand, Jonathan Corbet,
	Jason Gunthorpe, John Hubbard, Peter Xu, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-mm, linux-kselftest, linux-kernel,
	linux-doc, Mark Brown
In-Reply-To: <ag13GbKcLMIoHOHj@kernel.org>



On 20/05/26 2:25 pm, Mike Rapoport wrote:
> (added broonie)
> 
> Hi,
> 
> On Tue, May 19, 2026 at 05:35:05PM +0530, Sarthak Sharma wrote:
>> Add a command-line tool for benchmarking get_user_pages fast-path
>> (GUP_FAST), pin_user_pages fast-path (PIN_FAST), and pin_user_pages
>> longterm (PIN_LONGTERM) via the CONFIG_GUP_TEST debugfs interface.
>>
>> When invoked without arguments, gup_bench runs the same matrix of
>> configurations as run_gup_matrix() in run_vmtests.sh: all three GUP
>> commands across read/write, private/shared mappings, and a range of
>> page counts, with THP on/off for regular mappings and hugetlb for huge
>> page mappings.
>>
>> This tool is a mix of reused and new logic. The mapping/setup path comes
>> from selftests/mm/gup_test.c, while the default benchmark matrix matches
>> run_gup_matrix() in run_vmtests.sh. The standalone CLI and tools/mm
>> integration are added here so tools/mm does not depend on kselftest.
>>
>> Add gup_bench to BUILD_TARGETS and INSTALL_TARGETS in tools/mm/Makefile,
>> and ignore the resulting binary in tools/mm/.gitignore. While here, also
>> add the missing thp_swap_allocator_test entry to .gitignore.
>>
>> Add tools/mm/gup_bench.c to the GUP entry in MAINTAINERS.
>>
>> Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
>> Signed-off-by: Sarthak Sharma <sarthak.sharma@arm.com>
>> ---
>>  MAINTAINERS          |   1 +
>>  tools/mm/.gitignore  |   2 +
>>  tools/mm/Makefile    |   6 +-
>>  tools/mm/gup_bench.c | 491 +++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 497 insertions(+), 3 deletions(-)
>>  create mode 100644 tools/mm/gup_bench.c
> 
> ...
>  
>> +/*
>> + * Local HugeTLB setup helpers for gup_bench.
>> + *
>> + * These helpers were copied from tools/testing/selftests/mm/ and adjusted to
>> + * remove the ksft formatting. Keep this copy local so tools/mm does not
>> + * depend on ksft output behavior.
>> + */
> 
> It looks like self tests of at least 5 subsystems beside mm use hugetlb:
> 
> $ git grep -l "Hugepagesize:" tools/testing/selftests/ | grep -v "selftests/mm"
> tools/testing/selftests/arm64/mte/check_hugetlb_options.c
> tools/testing/selftests/cgroup/test_hugetlb_memcg.c
> tools/testing/selftests/kvm/lib/test_util.c
> tools/testing/selftests/memfd/common.c
> tools/testing/selftests/net/tcp_mmap.c
> 
> It seems that we need to better share the common code in
> tools/testing/selftest.
> 
> And adding another copy of the hugetlb detection and setup code does not
> seem like a great idea.


Does it sound too insane to just do some sort of #include "../testing/selftests/mm/..."
to use the common helpers?

> 
>> +
>> +static unsigned int psize(void)
>> +{
>> +	static unsigned int __page_size;
>> +
>> +	if (!__page_size)
>> +		__page_size = sysconf(_SC_PAGESIZE);
>> +	return __page_size;
>> +}
>> +
>> +static unsigned long default_huge_page_size(void)
>> +{
>> +	FILE *f = fopen("/proc/meminfo", "r");
>> +	unsigned long hpage_size = 0;
>> +	char buf[256];
>> +
>> +	if (!f)
>> +		return 0;
>> +	while (fgets(buf, sizeof(buf), f)) {
>> +		if (sscanf(buf, "Hugepagesize:       %lu kB", &hpage_size) == 1)
>> +			break;
>> +	}
>> +	fclose(f);
>> +	hpage_size <<= 10;
>> +	return hpage_size;
>> +}
> 


^ permalink raw reply

* Re: [PATCH v2 1/3] dt-bindings: iio: dac: Add AD5529R
From: Jonathan Cameron @ 2026-05-20  9:41 UTC (permalink / raw)
  To: Janani Sunil
  Cc: David Lechner, Janani Sunil, Lars-Peter Clausen,
	Michael Hennerich, Nuno Sá, Andy Shevchenko, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Philipp Zabel, Jonathan Corbet,
	Shuah Khan, linux-iio, devicetree, linux-kernel, linux-doc,
	rodrigo.alencar
In-Reply-To: <e245de68-555a-42c8-900b-a4abbaa4ea3e@gmail.com>

On Tue, 19 May 2026 09:13:24 +0200
Janani Sunil <jan.sun97@gmail.com> wrote:

> On 5/16/26 21:25, David Lechner wrote:
> > On 5/8/26 7:48 AM, Jonathan Cameron wrote:  
> >> On Fri, 8 May 2026 13:55:47 +0200
> >> Janani Sunil <janani.sunil@analog.com> wrote:
> >>  
> >>> Devicetree bindings for AD5529R 16 channel 12/16 bit high voltage,
> >>> buffered voltage output digital-to-analog converter (DAC) with an
> >>> integrated precision reference.
> >>>
> >>> Signed-off-by: Janani Sunil <janani.sunil@analog.com>
> >>> ---  
> > ...
> >  
> >>> +  * Multiplexer for output voltage, load current sense and die temperature
> >>> +
> >>> +  Datasheet: https://www.analog.com/media/en/technical-documentation/data-sheets/ad5529r.pdf
> >>> +
> >>> +properties:
> >>> +  compatible:
> >>> +    const: adi,ad5529r
> >>> +
> >>> +  reg:
> >>> +    maxItems: 1
> >>> +
> >>> +  spi-max-frequency:
> >>> +    maximum: 50000000
> >>> +
> >>> +  reset-gpios:
> >>> +    maxItems: 1
> >>> +    description:
> >>> +      GPIO connected to the RESET pin. Active low. When asserted low,
> >>> +      performs a power-on reset and initializes the device to its default state.
> >>> +
> >>> +  vdd-supply:
> >>> +    description: Digital power supply (typically 3.3V)
> >>> +
> >>> +  avdd-supply:
> >>> +    description: Analog power supply (typically 5V)
> >>> +
> >>> +  hvdd-supply:
> >>> +    description: High voltage positive supply (up to 40V for output range)
> >>> +
> >>> +  hvss-supply:
> >>> +    description: High voltage negative supply (ground or negative voltage)  
> >> I don't mind doing it this way but in some similar cases where 0 is something that
> >> can be considered the 'default' we've made the supply optional.  What was
> >> your reasoning for requiring it in this case?
> >>
> >> dt-bindings should be as complete as we can make them - with that in mind...
> >>
> >> There are some more interesting corners on this device the binding doesn't
> >> currently cover such as mux_out pin.  We'd normally do that by making the
> >> driver potentially a client of an ADC
> >>
> >> Easier though is !alarm which smells like an interrupt.
> >> !clear probably a gpio. TG0-3 also GPIOs.  
> > also optional vref-supply for external vs internal reference  
> 
> I will add bindings for optional Vref supply in the next version.
> 
> Best Regards,
> Janani Sunil
Hi Janil

One of those process things.  Don't reply to a review to say you
are going to do something suggested - just save us all reading an email
by making that clear in the change log for the next version.

Lots of folk are over enthusiastic in replying like you have done initially.
They only begin to appreciate why this is a bad idea when they start trying
to keep up with the mailing list firehoses!

Jonathan

> 


^ permalink raw reply

* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper
From: Amir Goldstein @ 2026-05-20  9:43 UTC (permalink / raw)
  To: Ian Kent
  Cc: Jan Kara, NeilBrown, Horst Birthelmer, Miklos Szeredi,
	Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner,
	linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer
In-Reply-To: <27a5593e-ffb8-4471-996f-7983bac0b1ab@themaw.net>

On Wed, May 20, 2026 at 9:16 AM Ian Kent <raven@themaw.net> wrote:
>
> On 19/5/26 17:12, Jan Kara wrote:
> > On Mon 18-05-26 21:39:13, Ian Kent wrote:
> >> On 18/5/26 16:19, Jan Kara wrote:
> >>> Hi Ian,
> >>>
> >>> On Mon 18-05-26 10:55:43, Ian Kent wrote:
> >>>> On 18/5/26 07:55, NeilBrown wrote:
> >>>>> On Fri, 15 May 2026, Horst Birthelmer wrote:
> >>>>> According to the email you linked, a problem arises when a directory has
> >>>>> a great many negative children.  Code which walks the list of children
> >>>>> (such as fsnotify) while holding a lock can suffer unpredictable delays
> >>>>> and result in long lock-hold times.  So maybe a limit on negative
> >>>>> dentries for any parent is what we really want.  That would be clumsy to
> >>>>> implement I imagine.
> >>>> But the notion of dropping the dentry in ->d_delete() on last dput() is
> >>>> simple enough but did see regressions (the only other place in the VFS
> >>>> besides dentry_kill() that the inode is unlinked from the dentry on
> >>>> dput()). I wonder if the regression was related to the test itself
> >>>> deliberately recreating deleted files and if that really is normal
> >>>> behaviour. By itself that should prevent almost all negative dentries
> >>>> being retained. Although file systems could do this as well (think XFS
> >>>> inode recycling) it should be reasonable to require it be left to the
> >>>> VFS.
> >>>>
> >>>> But even that's not enough given that, in my case, there would still be
> >>>> around 4 million dentries in the LRU cache and in fsnotify there are
> >>>> directory child traversals holding the parent i_lock "spinlock" that are
> >>>> going to cause problems.
> >>> Do you mean there are very many positive children of a directory?
> >> Didn't quantify that.
> >>
> >> The symptom is the "Spinlock held for more than ... seconds" occurring in
> >> the log. So there are certainly a lot of children in the list, but it's
> >> an assumption the ratio of positive to negative entries is roughly the
> >> same as the overall ratio in the dcache.
> > OK, but that's not necessarily true. I have seen these complaints from the
> > kernel but in all the cases I remember it was due to negative dentries
> > accumultating in a particular directory. There are certain apps such as
> > ElasticSearch which really do like creating huge amounts of negative
> > dentries in one directory - they use hashes as filenames and use directory
> > lookup instead of a DB table lookup and lookup lots of non-existent keys...
>
> Umm ... that's a good point, I hadn't paid much attention to ENOENT result
>
> lookups, I'll need to check on the like cycle of those, I think they do get
>
> hashed. That has to be the other source of negative dentries that I've
>
> neglected ...
>

Yes, it has been claimed that some real life workloads create a lot of those.

If we can keep those at the tail of the children list, it will be best
for the fsnotify
iteration, which only cares about positive dentries.

> >
> >>>> so why is this traversal even retained in fsnotify?
> >>> Not sure which traversal you mean but if you set watch on a parent, you
> >>> have to walk all children to set PARENT_WATCHED flag so that you don't miss
> >>> events on children...
> >> Yes, that traversal is what I'm questioning ... again thanks.
> >>
> >> I think the function name is still fsnotify_set_children_dentry_flags()
> >> in recent kernels, the subject of commit 172e422ffea2 I mentioned above.
> > OK, thanks.
> >
> >> When you say miss events are you saying that accessing the parent dentry to
> >> work out if the child needs to respond to an event is quite expensive in the
> >> overall event processing context, that might make more sense to me ... or do
> >> I completely not yet understand the reasoning behind the need for the flag?
> > Close but not quite. The cost is the overhead of dget_parent() in
> > fsnotify_parent() which is often a couple of cache cold loads and atomic
> > instructions to find out we don't need to send any event for the current
> > write(2) or read(2) call. It gets worse if there are many IOs happening to
> > dentries in the same directory from multiple CPUs because instead of
> > cache-cold loads you get a cacheline contention on the parent.
> >
> >>>>> But what if we move dentries to the end of the list when they become
> >>>>> negative, and to the start of the list when they become positive?  Then
> >>>>> code which walks the child list could simply abort on the first
> >>>>> negative.
> >>>>>
> >>>>> I doubt that would be quite as easy as it sounds, but it would at least
> >>>>> be more focused on the observed symptom rather than some whole-system
> >>>>> number which only vaguely correlates with the observed symptom.
> >>>>>
> >>>>> Maybe a completely different approach: change children-walking code to
> >>>>> drop and retake the lock (with appropriate validation) periodically.
> >>>>> What too would address the specific symptom.
> >>>> Another good question.
> >>>>
> >>>> I have assumed that dropping and re-taking the lock cannot be done but
> >>>> this is a question I would like answered as well. Dropping and re-taking
> >>>> lock would require, as Miklos pointed out to me off-list, recording the
> >>>> list position with say a cursor, introducing unwanted complexity when it
> >>>> would be better to accept the cost of a single extra access to the parent
> >>>> flags (which I assume is one reason to set the flag in the child).
> >>> The parent access is actually more expensive than you might think. Based on
> >>> experience with past fsnotify related performance regression I expect some
> >>> 20% performance hit for small tmpfs writes if you add unconditional parent
> >>> access to the write path.
> >> That sounds like a lot for what should be a memory access of an already in
> >> memory structure since the parent must be accessed to traverse the list of
> >> child entries. I clearly don't fully understand the implications of what
> >> I'm saying but there has been mention of another context ...
> > Parent dentry is of course in memory but often cache cold - you don't need
> > the parent to do e.g. write(2) to an already open file. You seem to be
> > somewhat confused about the child dentry list traversal (or maybe I'm
> > misunderstanding) - that happens only when placing the notification mark
> > but definitely not for each IO operation.
>
> LOL, confusion is a pretty common state of mind for me!
>
>
> I do get your point though and I am confusing the traversal with other
>
> operations. I think this answers the question I've been asking (maybe
>
> that wasn't obvious) about the reason for the traversal (ie. the reason
>
> to maintain a flag in the child).
>
>
> While I have looked at the code here I haven't absorbed it and I
>
> definitely don't understand it, your continued patience is appreciated
>
> and will be beneficial when I get time to look at it a bit closer. I
>
> do still need to use a notifications mechanism to match up with Miklos's
>
> statmount implementation to get the full benefit of that in user space,
>
> if I ever get a chance to work on that again.
>
>
> So it sounds like it would be worth while considering a traversal that's
>
> based on taking a reference on each dentry rather than a spinlock for
>
> the duration. It would be tricky though, for obvious reasons, like
>
> children added during the traversal, added overhead of getting the next
>
> entry reference, etc.

Didn't look closely, but it feels like RCU traversal should be
possible if entries are added to the tail, or to the END_OF_POSITIVE
location.

When we discussed the "negavites at tail" at LSFMM
it was said that managing the transitions positive<->negative
would be challenging, but I don't know that anyone tried to look closer at this.

At least for fsnotify, positive->negative transition is not a problem
w.r.t skipping entry and observing entry twice during positive iteration.

If negative->positive transitions inserts at END_OF_POSITIVE
location, then should be fine as well?

Iterators that need to iterate all children can do this under lock.

Does that make sense?

Thanks,
Amir.

^ permalink raw reply

* Re: [PATCH RFC v4 09/10] Documentation: ABI: testing: add docs for ad9910 sysfs entries
From: Jonathan Cameron @ 2026-05-20  9:54 UTC (permalink / raw)
  To: Rodrigo Alencar
  Cc: Rodrigo Alencar via B4 Relay, rodrigo.alencar, linux-iio,
	devicetree, linux-kernel, linux-doc, linux-hardening,
	Lars-Peter Clausen, Michael Hennerich, David Lechner,
	Andy Shevchenko, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Philipp Zabel, Jonathan Corbet, Shuah Khan, Kees Cook,
	Gustavo A. R. Silva
In-Reply-To: <pkx5v4od3wkyyzxomfrjf4ei7leboadzth262xnl55fvu76pf3@yqrezmo6gtq7>

On Mon, 18 May 2026 16:27:23 +0100
Rodrigo Alencar <455.rodrigo.alencar@gmail.com> wrote:

> On 26/05/18 02:45PM, Jonathan Cameron wrote:
> > On Sun, 17 May 2026 18:30:27 +0100
> > Rodrigo Alencar <455.rodrigo.alencar@gmail.com> wrote:
> >   
> > > On 26/05/17 03:58PM, Jonathan Cameron wrote:  
> > > > On Fri, 08 May 2026 18:00:25 +0100
> > > > Rodrigo Alencar via B4 Relay <devnull+rodrigo.alencar.analog.com@kernel.org> wrote:
> > > >     
> > > > > From: Rodrigo Alencar <rodrigo.alencar@analog.com>
> > > > > 
> > > > > Add custom ABI documentation file for the DDS AD9910 with sysfs entries to
> > > > > control Parallel Port, Digital Ramp Generator and OSK parameters.
> > > > > 
> > > > > Signed-off-by: Rodrigo Alencar <rodrigo.alencar@analog.com>    
> > > > I'm fine with phase and frequency as defined, but for the scaling it made me wonder.
> > > > For outvoltage0 channels the assumption the value is the peak voltage so if
> > > > we know what input to be modulated by the ramp generator can we express them
> > > > in volts (well milivolts) rather than as a scaling multiplier?    
> > > 
> > > The DAC output is current-based and differential. Voltage conversion would happen
> > > outside the device...  
> > 
> > Why aren't we representing this as out_altcurrentX-Y_xxxx?  
> 
> Good point! altcurrent makes more sense than altvoltage if we want to use raw to
> control the output level rather than scale, which would be a constant to convert
> raw into current units (what is the one that is used in the sysfs ABI? Ampere, mA or uA?)

Same as non alternating version so mA (which is a historical design error we have
long been stuck with :()  The altvoltageY_raw docs don't give a unit either.
If you don't mind, please send a patch adding that whilst you are here.
Same mid to peak - hopefully that is what any users not modifying to RMS have
been doing!

Seems we either never had one or that particular bit of ABI doc is missing.
Please add an entry for altcurrentX_raw

> 
> Not sure about the benefits on setting "differential" in channel spec.. the name would
> become out_altcurrentX-altcurrentY_xxxxx...

Becomes a question of whether it is useful to represent that - maybe not
in this particular case.

> 
> Is there any modifier for amplitude/peak/envelope? I see IIO_MOD_RMS, which could be used
> if adding a 1/sqrt(2) factor to the fixed scale.

For altcurrent / altvoltage assumption is it's mid to peak.  Unless the modifier switches
it to RMS as you've noted.

> 
> Then, I would consider something like out_altcurrent_rms_xxxx as a good alternative.
> 
> "scale" would be a constant in the top-level phy channel
> 
> single tone profile channels would have:
> - frequency
> - phase
> - raw
> 
> drg ramp up/down channels:
> - frequency and frequency_roc
> - phase and phase_roc
> - raw and raw_roc
> 
> parallel port channel(s):
> - frequency_scale and frequency_offset (frequency destination)
> - phase_offset (polar destination)
> - offset (polar destination)
> 
> osk channel:
> - raw
> - raw_roc
> 
> raw_roc could be just roc, but that sounds like it carries the scale and refers to
> a current value? and maybe that breaks consistency with other destination attributes?
> I am fine with just roc if that refers to the raw value, not (raw * scale).

This is a good question.  We ran into ambiguity with events where we have to derive
if it is _raw or _processed for the thresholds based on whether the main attribute
is _raw or _processed.  Nice to avoid doing that again.

I'd be interesting in others views on this but to me raw_roc seems fine.

> 
> With all the above, still using altvoltage is not incorrect, just a matter on how
> we want to express the units.

Agreed - but to get to directly useable values we'd need to provide info on the
external circuit - and given we are dealing with AC signals that is tricky to do
in a compact way.


> Note that using raw instead of scale to control the
> amplitude is just another option to tackle the problem. I suppose that the
> important thing here is being technically corrent and consistent in terms of
> usage. Maybe out_altcurrent_rms_* is more clear in terms of amplitude level.

Agreed.  It is always (?) possible to switch between scale and raw.
For an ADC the distinction is clear as we can't control _raw. For a DAC it all gets
rather value as we can logically control both and for an AC type of DAC / DDS it
all gets less intuitive.  As you say, consistency is key.

I'd like us to at least be consistent across DDS devices. Perhaps we need some
general documentation on whatever the outcome of this discussion is to record
some of the logic behind those decisions.

> 
> > 
> >   
> > > using a resistor load or an op-amp transimpedance stage,
> > > and I am no expert on that, but that often requires impedance matching so voltage
> > > levels may depend on the frequency. Then, I suppose that voltage is not the right
> > > unit to use.  
> > 
> > Understood that it can get complex!  
> > > 
> > > The scale here controls the amplitude of the varying signal. Assuming the peak voltage
> > > (amplitude) is constant means we have a constant envelope, but that should not mean
> > > we can't control it or it should not mean that the hardware can have other ways to
> > > control it. That said, scale behaves as a "gain multiplier".  
> > Understood. Given it's the envelope then if scale happened to be 1 always it would
> > be presented as _processed. So this is consistent with other channel types.
> >   
> > >   
> > > > 
> > > > That seems to me like it fits better with the overall ABI.
> > > >     
> > > > > +What:		/sys/bus/iio/devices/iio:deviceX/out_altvoltageY_scale_offset
> > > > > +KernelVersion:
> > > > > +Contact:	linux-iio@vger.kernel.org
> > > > > +Description:
> > > > > +		For a channel that allows amplitude control through buffers, this
> > > > > +		represents the value for a base amplitude scale. The actual output
> > > > > +		amplitude scale is a result with the sum of this value.
> > > > > +    
> > > >     
> > > > > +
> > > > > +What:		/sys/bus/iio/devices/iio:deviceX/out_altvoltageY_scale_roc    
> > > > 
> > > > Silly question perhaps but can work out how this related to millivolts/sec
> > > > That might make a more intuitive interface than scaling multiplier per sec
> > > > Perhaps the combination with offset makes this impossible though maybe that
> > > > could be a expressed as a voltage offset?  Afterall if the amplitude being
> > > > scaled is 5V then 5 * (offset + scale) = 5 * offset + 5 * scale
> > > >      
> > > > > +KernelVersion:
> > > > > +Contact:	linux-iio@vger.kernel.org
> > > > > +Description:
> > > > > +		Amplitude scale rate of change in 1/s for channels that ramp
> > > > > +		amplitude. This value may be influenced by the channel's
> > > > > +		sampling_frequency setting.    
> > > > 
> > > >     
> > >   
> >   
> 


^ permalink raw reply

* Re: [PATCH RESEND bpf-next v10 2/8] bpf: clear list node owner and unlink before drop
From: Kaitao Cheng @ 2026-05-20  9:55 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: bpf, Alexei Starovoitov, linux-kernel, linux-doc, ast, memxor,
	corbet, martin.lau, daniel, andrii, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, chengkaitao,
	skhan, vmalik, linux-kselftest, martin.lau, clm, ihor.solodrai,
	bot+bpf-ci
In-Reply-To: <782833db5da77e4aa9761fc410827e7abe8583c8.camel@gmail.com>

在 2026/5/20 06:56, Eduard Zingerman 写道:
> On Mon, 2026-05-18 at 11:02 +0800, Kaitao Cheng wrote:
> 
> [...]
> 
>>>>> The patch does have a bug, however. To fix the issues we are seeing now,
>>>>> I propose the additional changes below and would appreciate feedback.
>>>>>
>>>>> --- a/kernel/bpf/helpers.c
>>>>> +++ b/kernel/bpf/helpers.c
>>>>> @@ -2263,8 +2263,10 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
>>>>>         if (!head->next || list_empty(head))
>>>>>                 goto unlock;
>>>>>         list_for_each_safe(pos, n, head) {
>>>>> -               WRITE_ONCE(container_of(pos,
>>>>> -                       struct bpf_list_node_kern, list_head)->owner, NULL);
>>>>> +               struct bpf_list_node_kern *node;
>>>>> +
>>>>> +               node = container_of(pos, struct bpf_list_node_kern, list_head);
>>>>> +               WRITE_ONCE(node->owner, BPF_PTR_POISON);
>>>>>                 list_move_tail(pos, &drain);
>>>>>         }
>>>>>  unlock:
>>>>> @@ -2272,8 +2274,12 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
>>>>>         __bpf_spin_unlock_irqrestore(spin_lock);
>>>>>
>>>>>         while (!list_empty(&drain)) {
>>>>> +               struct bpf_list_node_kern *node;
>>>>> +
>>>>>                 pos = drain.next;
>>>>> +               node = container_of(pos, struct bpf_list_node_kern, list_head);
>>>>>                 list_del_init(pos);
>>>>> +               WRITE_ONCE(node->owner, NULL);
> 
> Is CPU allowed to reorder the stores in list_del_init() and WRITE_ONCE()?
> If it is, I think there is a race here.

Thanks for taking a close look at this. You are right that there is an
ordering issue here, but I don't think the specific sequence illustrated
by the example below is problematic.

> Thread #1:
>   enter bpf_list_head_free()
>   acquire H1 lock
>   list_move_tail(pos, &drain);             // reordered
>   <-- ip here -->
>   WRITE_ONCE(node->owner, BPF_PTR_POISON); // reordered
> 
> Thread #2:
> 
>   acquire H1 lock
>   n = bpf_refcount_acquire()
>   release H1 lock
>   acquire H2 lock
>   enter __bpf_list_add()
>   <-- ip here -->
>   cmpxchg(&node->owner, NULL, BPF_PTR_POISON)

Even if the stores from list_move_tail(pos, &drain) become visible before
WRITE_ONCE(node->owner, BPF_PTR_POISON), node->owner is not NULL in that
window. Before the WRITE_ONCE(), it still points to H1. After the WRITE_ONCE(),
it is BPF_PTR_POISON. In both cases, __bpf_list_add() will fail:

	cmpxchg(&node->owner, NULL, BPF_PTR_POISON)

because the old value is neither NULL nor expected to become NULL from this
part of bpf_list_head_free().


However, I agree that your original concern about the ordering between
list_del_init() and WRITE_ONCE(node->owner, NULL) is valid for the later
drain loop:

	list_del_init(pos);
	WRITE_ONCE(node->owner, NULL);

Here owner == NULL is the signal that the node can be inserted into another
list. Since WRITE_ONCE() does not provide release ordering, another CPU may
observe owner == NULL and successfully acquire the node in __bpf_list_add()
before the list_del_init() stores are visible. In that case __bpf_list_add()
can link the node into H2, and the delayed stores from list_del_init() may
then overwrite the node's list pointers and corrupt the H2 list.

So the fix should be to publish owner == NULL with release ordering after the
node has been fully unlinked, for example:

```
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2279,7 +2279,8 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
                pos = drain.next;
                node = container_of(pos, struct bpf_list_node_kern, list_head);
                list_del_init(pos);
-               WRITE_ONCE(node->owner, NULL);
+               /* Ensure __bpf_list_add() sees the node as unlinked. */
+               smp_store_release(&node->owner, NULL);
                /* The contained type can also have resources, including a
                 * bpf_list_head which needs to be freed.
                 */
@@ -2607,7 +2608,8 @@ static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head,
                return NULL;

        list_del_init(n);
-       WRITE_ONCE(node->owner, NULL);
+       /* Ensure __bpf_list_add() sees the node as unlinked. */
+       smp_store_release(&node->owner, NULL);
        return (struct bpf_list_node *)n;
 }
```

The existing cmpxchg() in __bpf_list_add() is a successful RMW with return
value, so it is fully ordered and is sufficient on the acquire side.

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* Re: [PATCH net-next v2 2/2] net: ti: icssg: Add HSR and LRE PA statistics
From: MD Danish Anwar @ 2026-05-20 10:00 UTC (permalink / raw)
  To: Jakub Kicinski, Luka Gejak
  Cc: Felix Maurer, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, Roger Quadros,
	Andrew Lunn, Meghana Malladi, Jacob Keller, David Carlier,
	Vadim Fedorenko, Kevin Hao, netdev, linux-doc, linux-kernel,
	linux-arm-kernel, Vladimir Oltean
In-Reply-To: <20260519165646.09b0783f@kernel.org>

Hi Jakub,

On 20/05/26 5:26 am, Jakub Kicinski wrote:
> On Tue, 19 May 2026 07:55:55 +0200 Luka Gejak wrote:
>> On May 19, 2026 3:45:06 AM GMT+02:00, Jakub Kicinski <kuba@kernel.org> wrote:
>>> On Thu, 14 May 2026 13:26:05 +0530 MD Danish Anwar wrote:  
>>>> Add new firmware PA statistics counters for HSR and LRE to the ethtool
>>>> statistics exposed by the ICSSG driver.
>>>>
>>>> New statistics added:
>>>>  - FW_HSR_FWD_CHECK_FAIL_DROP: Packets dropped on the HSR forwarding path
>>>>  - FW_HSR_HE_CHECK_FAIL_DROP: Packets dropped on the HSR host egress path
>>>>  - FW_HSR_SKIP_HOST_DUP_DISCARD_FRAMES: Frames with duplicate discard
>>>>    skipped
>>>>  - FW_LRE_CNT_UNIQUE/DUPLICATE/MULTIPLE_RX: LRE duplicate detection
>>>>    counters
>>>>  - FW_LRE_CNT_RX/TX: LRE per-port frame counters
>>>>  - FW_LRE_CNT_OWN_RX: Own HSR tagged frames received
>>>>  - FW_LRE_CNT_ERRWRONGLAN: Frames with wrong LAN identifier (PRP)
>>>>
>>>> Document the new HSR/LRE statistics in icssg_prueth.rst.  
>>>
>>> To an untrained eye these stats look like stuff that could 
>>> be standardized across drivers. 
>>>
>>> Luka, Felix, others on CC, do you think we should expose these
>> >from HSR over netlink as "standard" offload stats different drivers 
>>> can plug into or not worth it?  
>>
>> I think there is a case for standardizing part of this, but I would 
>> not standardize the whole set as-is.
>>
>> The LRE counters look generic enough to me, especially:
>>  - unique rx
>>  - duplicate rx
>>  - multiple rx
>>  - rx / tx
>>  - own rx
>>  - wrong LAN, PRP only
>>
>> Those are protocol/LRE concepts rather than TI firmware details, so
>> exposing them from the HSR/PRP layer sounds useful. I would expect 
>> both the software implementation and offloaded implementations to be 
>> able to provide at least some of them, with unsupported counters 
>> omitted or reported as not available.
>> I would not put the firmware check/drop counters in the same standard
>> bucket, though:
>>  - FW_HSR_FWD_CHECK_FAIL_DROP
>>  - FW_HSR_HE_CHECK_FAIL_DROP
>>  - FW_HSR_SKIP_HOST_DUP_DISCARD_FRAMES
> 
> Thanks for the breakdown!
> 
>> Those sound more like implementation/debug counters for the ICSSG
>> firmware pipeline. They are still useful in ethtool driver stats, but 
>> I would be hesitant to bake their exact semantics into HSR UAPI.
>> So my preference would be:
>>  1. Keep driver-private ethtool stats for the full firmware counter set.
>>  2. Add a small HSR/PRP standard stats set separately, limited to
>>     well-defined LRE counters.
>>  3. Make the HSR layer expose them, with offload drivers plugging in via
>>     an optional callback or offload stats op.
>>  4. Define the counters carefully, including whether they are per-HSR
>>     device or per-port A/B, and what PRP-only counters mean for HSR.
>>
>> I do not think this patch should blindly become the UAPI definition, 
> 
> Not at all, the unique / multiple stats gave me pause. We should
> only put in the standard API what can be easily and unambiguously
> defined given the protocol spec.
> 
>> but I do think it points at a useful follow-up. If we want to avoid 
>> adding driver-private names first and then standardizing different 
>> names later, then it may be worth asking Danish to split the 
>> protocol-level LRE counters out and route those through a common HSR 
>> stats interface.
> 
> As a general policy we ask for standard stats to be added first and
> ethtool to only contain what didn't fit in the standard ones.
> There are some technical reasons but it's mostly a mindset thing.

What should be the next steps here? Is there any existing defined set of
stats where I could populate stats from ICSSG firmware for HSR (similar
to ndo_get_stats64 callback). Or de we need to implement a new callback
that will do this for HSR.

I agree with Luka on the categorization,

Below stats can be generic,
 - unique rx
 - duplicate rx
 - multiple rx
 - rx / tx
 - own rx
 - wrong LAN, PRP only

Below stats can be driver specific and can be pulled using `ethtool -S`
on child interfaces of HSR

 - FW_HSR_FWD_CHECK_FAIL_DROP
 - FW_HSR_HE_CHECK_FAIL_DROP
 - FW_HSR_SKIP_HOST_DUP_DISCARD_FRAMES

Let me know if I should go ahead and implement this.

-- 
Thanks and Regards,
Danish


^ permalink raw reply

* Re: [PATCH v2 4/4] cpufreq: Use policy->min/max init as QoS request
From: Viresh Kumar @ 2026-05-20 10:03 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: linux-kernel, Jie Zhan, Lifeng Zheng, Ionela Voinescu,
	Sumit Gupta, Zhongqiu Han, Rafael J. Wysocki, Jonathan Corbet,
	Shuah Khan, Huang Rui, Mario Limonciello, Perry Yuan,
	K Prateek Nayak, Srinivas Pandruvada, Len Brown, Saravana Kannan,
	linux-pm, linux-doc
In-Reply-To: <20260511135538.522653-5-pierre.gondois@arm.com>

On 11-05-26, 15:55, Pierre Gondois wrote:
> @@ -1399,8 +1399,16 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
>  
>  static int cpufreq_policy_init_qos(struct cpufreq_policy *policy)
>  {
> +	unsigned int min_freq, max_freq;
>  	int ret;
>  
> +	/* Use policy->min/max set by the driver as QoS requests. */
> +	min_freq = max(FREQ_QOS_MIN_DEFAULT_VALUE, policy->min);
> +	if (policy->max)
> +		max_freq = min(FREQ_QOS_MAX_DEFAULT_VALUE, policy->max);
> +	else
> +		max_freq = FREQ_QOS_MAX_DEFAULT_VALUE;
> +

Why is this required to be done before setting policy->min/max ? And
so I don't think patch 1/4 is required at all.

>  	/*
>  	 * If the driver didn't set policy->min/max, set them as
>  	 * they are used to clamp frequency requests.
> @@ -1418,12 +1426,12 @@ static int cpufreq_policy_init_qos(struct cpufreq_policy *policy)
>  	}
>  
>  	ret = freq_qos_add_request(&policy->constraints, &policy->min_freq_req,
> -				   FREQ_QOS_MIN, FREQ_QOS_MIN_DEFAULT_VALUE);
> +				   FREQ_QOS_MIN, min_freq);
>  	if (ret < 0)
>  		return ret;
>  
>  	ret = freq_qos_add_request(&policy->constraints, &policy->max_freq_req,
> -				   FREQ_QOS_MAX, FREQ_QOS_MAX_DEFAULT_VALUE);
> +				   FREQ_QOS_MAX, max_freq);
>  	if (ret < 0)
>  		return ret;
>  
> -- 
> 2.43.0

-- 
viresh

^ permalink raw reply

* Re: [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work
From: Frederic Weisbecker @ 2026-05-20 10:08 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Jonathan Corbet, Shuah Khan, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
	Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
	Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
	Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
	Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
	Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
	Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
	Masahiro Yamada, linux-doc, linux-kernel, linux-mm,
	linux-rt-devel, Marcelo Tosatti
In-Reply-To: <20260519012754.240804-2-leobras.c@gmail.com>

Le Mon, May 18, 2026 at 10:27:47PM -0300, Leonardo Bras a écrit :
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem:
> scheduling work on remote cpu that are executing low latency tasks
> is undesired and can introduce unexpected deadline misses.
> 
> It's interesting, though, that local_lock()s in RT kernels become
> spinlock(). We can make use of those to avoid scheduling work on a remote
> cpu by directly updating another cpu's per_cpu structure, while holding
> it's spinlock().
> 
> In order to do that, it's necessary to introduce a new set of functions to
> make it possible to get another cpu's per-cpu "local" lock (pw_{un,}lock*)
> and also do the corresponding queueing (pw_queue_on()) and flushing
> (pw_flush()) helpers to run the remote work.
> 
> Users of non-RT kernels but with low latency requirements can select
> similar functionality by using the CONFIG_PWLOCKS compile time option.
> 
> On CONFIG_PWLOCKS disabled kernels, no changes are expected, as every
> one of the introduced helpers work the exactly same as the current
> implementation:
> pw_{un,}lock*()		->  local_{un,}lock*() (ignores cpu parameter)
> pw_queue_on()  		->  queue_work_on()
> pw_flush()		->  flush_work()
> 
> For PWLOCKS enabled kernels, though, pw_{un,}lock*() will use the extra
> cpu parameter to select the correct per-cpu structure to work on,
> and acquire the spinlock for that cpu.
> 
> pw_queue_on() will just call the requested function in the current
> cpu, which will operate in another cpu's per-cpu object. Since the
> local_locks() become spinlock()s in PWLOCKS enabled kernels, we are
> safe doing that.
> 
> pw_flush() then becomes a no-op since no work is actually scheduled on a
> remote cpu.
> 
> Some minimal code rework is needed in order to make this mechanism work:
> The calls for local_{un,}lock*() on the functions that are currently
> scheduled on remote cpus need to be replaced by either pw_{un,}lock_*(),
> PWLOCKS enabled kernels they can reference a different cpu. It's also
> necessary to use a pw_struct instead of a work_struct, but it just
> contains a work struct and, in CONFIG_PWLOCKS, the target cpu.
> 
> This should have almost no impact on non-CONFIG_PWLOCKS kernels: few
> this_cpu_ptr() will become per_cpu_ptr(,smp_processor_id()) on non-hotpath
> functions.
> 
> On CONFIG_PWLOCKS kernels, this should avoid deadlines misses by
> removing scheduling noise.
> 
> Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

I like it! Just a few observations:

> +#ifndef CONFIG_PWLOCKS
> +
> +typedef local_lock_t pw_lock_t;
> +typedef local_trylock_t pw_trylock_t;
> +
> +struct pw_struct {
> +	struct work_struct work;
> +};
> +
> +#define pw_lock_init(lock)				\
> +	local_lock_init(lock)
> +
> +#define pw_trylock_init(lock)				\
> +	local_trylock_init(lock)
> +
> +#define pw_lock(lock, cpu)				\
> +	local_lock(lock)

For debugging purpose, it would be nice to ensure that in those off-case,
cpu is indeed the local one. Basically all the non-local functions, those that
take a cpu, should verify:

lockdep_assert(cpu == smp_processor_id())

> +
> +#define pw_lock_local(lock)				\
> +	local_lock(lock)
> +
> +#define pw_lock_irqsave(lock, flags, cpu)		\
> +	local_lock_irqsave(lock, flags)
> +
> +#define pw_lock_local_irqsave(lock, flags)		\
> +	local_lock_irqsave(lock, flags)
> +
> +#define pw_trylock(lock, cpu)				\
> +	local_trylock(lock)
> +
> +#define pw_trylock_local(lock)				\
> +	local_trylock(lock)
> +
> +#define pw_trylock_irqsave(lock, flags, cpu)		\
> +	local_trylock_irqsave(lock, flags)
> +
> +#define pw_unlock(lock, cpu)				\
> +	local_unlock(lock)
> +
> +#define pw_unlock_local(lock)				\
> +	local_unlock(lock)
> +
> +#define pw_unlock_irqrestore(lock, flags, cpu)		\
> +	local_unlock_irqrestore(lock, flags)
> +
> +#define pw_unlock_local_irqrestore(lock, flags)		\
> +	local_unlock_irqrestore(lock, flags)
> +
> +#define pw_lockdep_assert_held(lock)			\
> +	lockdep_assert_held(lock)
> +
> +#define pw_queue_on(c, wq, pw)				\
> +	queue_work_on(c, wq, &(pw)->work)
> +
> +#define pw_flush(pw)					\
> +	flush_work(&(pw)->work)
> +
> +#define pw_get_cpu(pw)	smp_processor_id()
> +
> +#define pw_is_cpu_remote(cpu)		(false)
> +
> +#define INIT_PW(pw, func, c)				\
> +	INIT_WORK(&(pw)->work, (func))
> +
> +#else /* CONFIG_PWLOCKS */
> +
> +DECLARE_STATIC_KEY_MAYBE(CONFIG_PWLOCKS_DEFAULT, pw_sl);
> +
> +typedef union {
> +	spinlock_t sl;
> +	local_lock_t ll;
> +} pw_lock_t;
> +
> +typedef union {
> +	spinlock_t sl;
> +	local_trylock_t ll;
> +} pw_trylock_t;
> +
> +struct pw_struct {
> +	struct work_struct work;
> +	int cpu;
> +};
> +
> +#ifdef CONFIG_PREEMPT_RT
> +#define preempt_or_migrate_disable migrate_disable
> +#define preempt_or_migrate_enable migrate_enable
> +#else
> +#define preempt_or_migrate_disable preempt_disable
> +#define preempt_or_migrate_enable preempt_enable

This can be no-op in !CONFIG_PREEMPT_RT because non-rt spinlocks
disable preemption already.

> +#endif
> +
> +#define pw_lock_init(lock)							\
> +do {										\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
> +		spin_lock_init(lock.sl);					\
> +	else									\
> +		local_lock_init(lock.ll);					\
> +} while (0)

It looks like all these macros could be inline functions.

> +
> +#define pw_trylock_init(lock)							\
> +do {										\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
> +		spin_lock_init(lock.sl);					\
> +	else									\
> +		local_trylock_init(lock.ll);					\
> +} while (0)
> +
> +#define pw_lock(lock, cpu)
> \

And those could have the same local CPU debug check.

> +do {										\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
> +		spin_lock(per_cpu_ptr(lock.sl, cpu));				\
> +	else									\
> +		local_lock(lock.ll);						\
> +} while (0)
> +
> +#define pw_lock_local(lock)							\
> +do {										\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) {		\
> +		preempt_or_migrate_disable();					\
> +		spin_lock(this_cpu_ptr(lock.sl));				\
> +	} else {								\
> +		local_lock(lock.ll);						\
> +	}									\
> +} while (0)
> +
> +#define pw_lock_irqsave(lock, flags, cpu)					\
> +do {										\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
> +		spin_lock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);	\
> +	else									\
> +		local_lock_irqsave(lock.ll, flags);				\
> +} while (0)
> +
> +#define pw_lock_local_irqsave(lock, flags)					\
> +do {										\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) {		\
> +		preempt_or_migrate_disable();					\
> +		spin_lock_irqsave(this_cpu_ptr(lock.sl), flags);		\
> +	} else {								\
> +		local_lock_irqsave(lock.ll, flags);				\
> +	}									\
> +} while (0)
> +
> +#define pw_trylock(lock, cpu)							\
> +({										\
> +	int t;									\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
> +		t = spin_trylock(per_cpu_ptr(lock.sl, cpu));			\
> +	else									\
> +		t = local_trylock(lock.ll);					\
> +	t;									\
> +})
> +
> +#define pw_trylock_local(lock)							\
> +({										\
> +	int t;									\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) {		\
> +		preempt_or_migrate_disable();					\
> +		t = spin_trylock(this_cpu_ptr(lock.sl));			\
> +		if (!t)								\
> +			preempt_or_migrate_enable();
> \

This is duplicating the RT logic in local_lock_internal.h and it would be
tempting to propose spin_local_lock_t that both pw and RT local_lock could rely
upon. But I'm afraid that would create a less readable result:

- we would need to check the CONFIG_PREEMPT_RT there before doing the
  migrate_disable/enable

- RT local lock don't take the lock on IRQ/NMI, which is fine as pw is not
  expected to be used on the non-threaded parts of IRQs not NMIs. Still that's
  one more conditional to add there.

- we'll need to differenciate local/remote operations.

Well let's stick to what you did for now (Peter might have a different opinion though).

> +	} else {								\
> +		t = local_trylock(lock.ll);					\
> +	}									\
> +	t;									\
> +})
> +
> +#define pw_trylock_irqsave(lock, flags, cpu)					\
> +({										\
> +	int t;									\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
> +		t = spin_trylock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);	\
> +	else									\
> +		t = local_trylock_irqsave(lock.ll, flags);			\
> +	t;									\
> +})
> +
> +#define pw_unlock(lock, cpu)							\
> +do {										\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
> +		spin_unlock(per_cpu_ptr(lock.sl, cpu));			\
> +	else									\
> +		local_unlock(lock.ll);					\
> +} while (0)
> +
> +#define pw_unlock_local(lock)							\
> +do {										\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) {		\
> +		spin_unlock(this_cpu_ptr(lock.sl));				\
> +		preempt_or_migrate_enable();					\
> +	} else {								\
> +		local_unlock(lock.ll);						\
> +	}									\
> +} while (0)
> +
> +#define pw_unlock_irqrestore(lock, flags, cpu)					\
> +do {										\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
> +		spin_unlock_irqrestore(per_cpu_ptr(lock.sl, cpu), flags);	\
> +	else									\
> +		local_unlock_irqrestore(lock.ll, flags);			\
> +} while (0)
> +
> +#define pw_unlock_local_irqrestore(lock, flags)					\
> +do {										\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) {		\
> +		spin_unlock_irqrestore(this_cpu_ptr(lock.sl), flags);	\
> +		preempt_or_migrate_enable();					\
> +	} else {								\
> +		local_unlock_irqrestore(lock.ll, flags);			\
> +	}									\
> +} while (0)
> +
> +#define pw_lockdep_assert_held(lock)						\
> +do {										\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
> +		lockdep_assert_held(this_cpu_ptr(lock.sl));			\
> +	else									\
> +		lockdep_assert_held(this_cpu_ptr(lock.ll));			\
> +} while (0)
> +
> +#define pw_queue_on(c, wq, pw)							\
> +do {										\
> +	int __c = c;								\
> +	struct pw_struct *__pw = (pw);						\
> +	if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) {		\
> +		WARN_ON((__c) != __pw->cpu);					\
> +		__pw->work.func(&__pw->work);					\
> +	} else {								\
> +		queue_work_on(__c, wq, &(__pw)->work);				\
> +	}									\
> +} while (0)
> +
> +/*
> + * Does nothing if PWLOCKS is set to use spinlock, as the task is already done at the
> + * time pw_queue_on() returns.
> + */
> +#define pw_flush(pw)								\
> +do {										\
> +	struct pw_struct *__pw = (pw);						\
> +	if (!static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl))		\
> +		flush_work(&__pw->work);					\
> +} while (0)
> +
> +#define pw_get_cpu(w)			container_of((w), struct pw_struct, work)->cpu
> +
> +#define pw_is_cpu_remote(cpu)		((cpu) != smp_processor_id())
> +
> +#define INIT_PW(pw, func, c)							\
> +do {										\
> +	struct pw_struct *__pw = (pw);						\
> +	INIT_WORK(&__pw->work, (func));						\
> +	__pw->cpu = (c);							\
> +} while (0)
> +
> +#endif /* CONFIG_PWLOCKS */
> +#endif /* LINUX_PWLOCKS_H */
> diff --git a/kernel/pwlocks.c b/kernel/pwlocks.c
> new file mode 100644
> index 000000000000..1ebf5cb979b9
> --- /dev/null
> +++ b/kernel/pwlocks.c
> @@ -0,0 +1,47 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/export.h"
> +#include <linux/sched.h>
> +#include <linux/pwlocks.h>
> +#include <linux/string.h>
> +#include <linux/sched/isolation.h>
> +
> +DEFINE_STATIC_KEY_MAYBE(CONFIG_PWLOCKS_DEFAULT, pw_sl);
> +EXPORT_SYMBOL(pw_sl);
> +
> +static bool pwlocks_param_specified;
> +
> +static int __init pwlocks_setup(char *str)
> +{
> +	int opt;
> +
> +	if (!get_option(&str, &opt)) {
> +		pr_warn("PWLOCKS: invalid pwlocks parameter: %s, ignoring.\n", str);
> +		return 0;
> +	}
> +
> +	if (opt)
> +		static_branch_enable(&pw_sl);
> +	else
> +		static_branch_disable(&pw_sl);
> +
> +	pwlocks_param_specified = true;
> +
> +	return 1;
> +}
> +__setup("pwlocks=", pwlocks_setup);
> +
> +/*
> + * Enable PWLOCKS if CPUs want to avoid kernel noise.
> + */
> +static int __init pwlocks_init(void)
> +{
> +	if (pwlocks_param_specified)
> +		return 0;
> +
> +	if (housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
> +		static_branch_enable(&pw_sl);
> +
> +	return 0;
> +}
> +
> +late_initcall(pwlocks_init);

That should be a pre-SMP initcall. Otherwise you risk some asymetric calls.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply

* Re: [PATCH v2 1/2] tools/mm: add a standalone GUP microbenchmark
From: Sarthak Sharma @ 2026-05-20 10:15 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, David Hildenbrand, Jonathan Corbet,
	Jason Gunthorpe, John Hubbard, Peter Xu, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-mm, linux-kselftest, linux-kernel,
	linux-doc, Mark Brown
In-Reply-To: <ag13GbKcLMIoHOHj@kernel.org>

Hi Mike!

On 5/20/26 2:25 PM, Mike Rapoport wrote:
> (added broonie)
> 
> Hi,
> 
> On Tue, May 19, 2026 at 05:35:05PM +0530, Sarthak Sharma wrote:
>> Add a command-line tool for benchmarking get_user_pages fast-path
>> (GUP_FAST), pin_user_pages fast-path (PIN_FAST), and pin_user_pages
>> longterm (PIN_LONGTERM) via the CONFIG_GUP_TEST debugfs interface.
>>
>> When invoked without arguments, gup_bench runs the same matrix of
>> configurations as run_gup_matrix() in run_vmtests.sh: all three GUP
>> commands across read/write, private/shared mappings, and a range of
>> page counts, with THP on/off for regular mappings and hugetlb for huge
>> page mappings.
>>
>> This tool is a mix of reused and new logic. The mapping/setup path comes
>> from selftests/mm/gup_test.c, while the default benchmark matrix matches
>> run_gup_matrix() in run_vmtests.sh. The standalone CLI and tools/mm
>> integration are added here so tools/mm does not depend on kselftest.
>>
>> Add gup_bench to BUILD_TARGETS and INSTALL_TARGETS in tools/mm/Makefile,
>> and ignore the resulting binary in tools/mm/.gitignore. While here, also
>> add the missing thp_swap_allocator_test entry to .gitignore.
>>
>> Add tools/mm/gup_bench.c to the GUP entry in MAINTAINERS.
>>
>> Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
>> Signed-off-by: Sarthak Sharma <sarthak.sharma@arm.com>
>> ---
>>  MAINTAINERS          |   1 +
>>  tools/mm/.gitignore  |   2 +
>>  tools/mm/Makefile    |   6 +-
>>  tools/mm/gup_bench.c | 491 +++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 497 insertions(+), 3 deletions(-)
>>  create mode 100644 tools/mm/gup_bench.c
> 
> ...
>  
>> +/*
>> + * Local HugeTLB setup helpers for gup_bench.
>> + *
>> + * These helpers were copied from tools/testing/selftests/mm/ and adjusted to
>> + * remove the ksft formatting. Keep this copy local so tools/mm does not
>> + * depend on ksft output behavior.
>> + */
> 
> It looks like self tests of at least 5 subsystems beside mm use hugetlb:
> 
> $ git grep -l "Hugepagesize:" tools/testing/selftests/ | grep -v "selftests/mm"
> tools/testing/selftests/arm64/mte/check_hugetlb_options.c
> tools/testing/selftests/cgroup/test_hugetlb_memcg.c
> tools/testing/selftests/kvm/lib/test_util.c
> tools/testing/selftests/memfd/common.c
> tools/testing/selftests/net/tcp_mmap.c
> 
> It seems that we need to better share the common code in
> tools/testing/selftest.
> 
> And adding another copy of the hugetlb detection and setup code does not
> seem like a great idea.

Agreed, but that was the least disruptive approach I could think of.

I am thinking of doing this now: should I move the
hugepage_settings.[ch] to tools/lib/ and move the read_num(),
write_num(), read_file() and write_file() helpers to a separate file in
tools/lib/ itself without any ksft dependency? Then both
tools/testing/selftests/* and tools/mm/ could share the same code.

Please let me know if some different approach is preferred.

> 
>> +
>> +static unsigned int psize(void)
>> +{
>> +	static unsigned int __page_size;
>> +
>> +	if (!__page_size)
>> +		__page_size = sysconf(_SC_PAGESIZE);
>> +	return __page_size;
>> +}
>> +
>> +static unsigned long default_huge_page_size(void)
>> +{
>> +	FILE *f = fopen("/proc/meminfo", "r");
>> +	unsigned long hpage_size = 0;
>> +	char buf[256];
>> +
>> +	if (!f)
>> +		return 0;
>> +	while (fgets(buf, sizeof(buf), f)) {
>> +		if (sscanf(buf, "Hugepagesize:       %lu kB", &hpage_size) == 1)
>> +			break;
>> +	}
>> +	fclose(f);
>> +	hpage_size <<= 10;
>> +	return hpage_size;
>> +}
> 


^ permalink raw reply

* Re: [PATCH v11 4/6] iio: adc: ad4691: add SPI offload support
From: Jonathan Cameron @ 2026-05-20 10:36 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: David Lechner, Sabau, Radu bogdan, Lars-Peter Clausen,
	Hennerich, Michael, Sa, Nuno, Andy Shevchenko, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Uwe Kleine-König,
	Liam Girdwood, Mark Brown, Linus Walleij, Bartosz Golaszewski,
	Philipp Zabel, Jonathan Corbet, Shuah Khan,
	linux-iio@vger.kernel.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pwm@vger.kernel.org,
	linux-gpio@vger.kernel.org, linux-doc@vger.kernel.org
In-Reply-To: <agtZwbeVeZdnlXTI@ashevche-desk.local>

On Mon, 18 May 2026 21:26:09 +0300
Andy Shevchenko <andriy.shevchenko@intel.com> wrote:

> On Mon, May 18, 2026 at 10:16:38AM -0500, David Lechner wrote:
> > On 5/18/26 10:14 AM, Sabau, Radu bogdan wrote:  
> > >> -----Original Message-----
> > >> From: David Lechner <dlechner@baylibre.com>
> > >> Sent: Saturday, May 16, 2026 8:53 PM  
> 
> ...
> 
> > >>> +	if (st->manual_mode && st->offload)
> > >>> +		return sysfs_emit(buf, "%llu\n", READ_ONCE(st->offload-
> > >>> trigger_hz));  
> > >>
> > >> Why do we need READ_ONCE?  
> > > 
> > > trigger_hz is u64 and if the target is 32-bit, a 64-bit access compiles to two 32-bit
> > > instructions, so show() reading it without a lock and store() writing it concurrently
> > > can produce a torn value at the compiler level. READ_ONCE/WRITE_ONCE suppress
> > > the compiler transformations that would allow that splitting or caching. We could
> > > have st->lock in show() instead, but that felt heavier than necessary for a single
> > > scalar where a transiently stale-but-whole read is fine.  
> > 
> > I would go with the mutex. It will be easier for people to understand.  
> 
> But why? READ_ONCE() here is exactly enough. We do not care about
> serialisation, we care only about integrity. With mutex it will confuse
> (some) people more, e.g., me. Because in that case I would think about
> some specific access to it that may happen. Yes, I saw many times the show
> functions that do mutex and then print the result when mutex is not held
> anymore, but for simple cases like here, mutex is overkill. Interestingly
> that using guard()() inside show makes the mentioned functions to print
> (almost) latest value of the variable in question. It narrows window down
> as printing will go inside critical section.
> 

I think it's worth noting that we are very lax in IIO wrt to READ_ONCE()
usage.  It might be worth starting to tighten that up for state variable reads
etc whether they are 64 bit or not (that just increases the chances).
In theory compilers can do far too many evil things.  I've been scared
of pushing this because of the massive number of incorrect instances
(and the bad example I set with early drivers :(), but it would be good
to have a few examples in tree so we can start to encourage people to
do that stuff right.

Jonathan

^ permalink raw reply

* [PATCH AUTOSEL 7.0-6.6] docs: cgroup-v1: Update charge-commit section
From: Sasha Levin @ 2026-05-20 11:18 UTC (permalink / raw)
  To: patches, stable
  Cc: T.J. Mercier, Tejun Heo, Sasha Levin, hannes, mkoutny, corbet,
	cgroups, linux-doc, linux-kernel
In-Reply-To: <20260520111944.3424570-1-sashal@kernel.org>

From: "T.J. Mercier" <tjmercier@google.com>

[ Upstream commit d8769544bde51b0ac980d10f8fe9f9fed6c95995 ]

Commit 1d8f136a421f ("memcg/hugetlb: remove memcg hugetlb
try-commit-cancel protocol") removed mem_cgroup_commit_charge() and
mem_cgroup_cancel_charge(), but the docs still refer to those functions.
There is no longer any charge cancellation.

Update the docs to match the code.

Signed-off-by: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Phase 1: Commit Message Forensics
Step 1.1 Record: subsystem `docs: cgroup-v1`; action verb `Update`;
intent: update `Documentation/admin-guide/cgroup-v1/memcg_test.rst` so
the charge-commit section matches current memcg code.

Step 1.2 Record: tags present: `Signed-off-by: T.J. Mercier
<tjmercier@google.com>` and `Signed-off-by: Tejun Heo <tj@kernel.org>`.
No `Fixes:`, `Reported-by:`, `Tested-by:`, `Reviewed-by:`, `Acked-by:`,
`Link:`, or `Cc: stable@vger.kernel.org` tags in the committed message.

Step 1.3 Record: the body says commit `1d8f136a421f` removed
`mem_cgroup_commit_charge()` and `mem_cgroup_cancel_charge()`, while the
cgroup-v1 memcg docs still name them and describe cancellation.
Symptom/failure mode: incorrect documentation only. Version information:
the referenced removal commit is from 2025-01-13 in the local object
database; candidate commit is from 2026-04-30. Root cause: docs were not
updated when the code was changed.

Step 1.4 Record: this is not a hidden runtime bug fix. It is an explicit
documentation correctness fix.

## Phase 2: Diff Analysis
Step 2.1 Record: one file changed: `Documentation/admin-
guide/cgroup-v1/memcg_test.rst`, `2 insertions(+), 4 deletions(-)`. No
functions modified. Scope: single-file documentation-only surgical
change.

Step 2.2 Record: before, the section was titled `charge-commit-cancel`,
listed `mem_cgroup_commit_charge()` or `mem_cgroup_cancel_charge()`, and
described `cancel()`. After, it is titled `charge-commit`, lists
`commit_charge()`, and removes the cancellation text.

Step 2.3 Record: bug category is documentation/comment correctness.
Specific mechanism: removes stale references to APIs that are absent in
current `6.19.y`/`7.0.y` code and in `origin/master`.

Step 2.4 Record: fix quality is obviously correct for trees whose memcg
code no longer has `mem_cgroup_commit_charge()` /
`mem_cgroup_cancel_charge()`. Regression risk is zero runtime risk, but
there is a branch-selection concern: `stable/linux-6.12.y` still
contains those functions, so this exact doc update would be misleading
there.

## Phase 3: Git History Investigation
Step 3.1 Record: `git blame` before the candidate shows the stale doc
lines came from `f3f5edc5e41e` in this repository’s history. `git blame
origin/master` shows the changed lines attributed to `d8769544bde51`.

Step 3.2 Record: no `Fixes:` tag. I still inspected referenced commit
`1d8f136a421f26747e58c01281cba5bffae8d289`; it removed prototypes and
implementations for `mem_cgroup_commit_charge()`,
`mem_cgroup_cancel_charge()`, and related hugetlb try/commit/cancel
helpers.

Step 3.3 Record: recent history for `Documentation/admin-
guide/cgroup-v1/memcg_test.rst` on `origin/master` has only the file
introduction/import commit and this candidate. Related path history
shows the candidate was merged via `cgroup-for-7.1-rc2-fixes`. No
prerequisite doc series found.

Step 3.4 Record: author `T.J. Mercier` has at least one other cgroup-
related commit in `origin/master`; Tejun Heo is listed in `MAINTAINERS`
as cgroup maintainer and committed the patch.

Step 3.5 Record: no code dependencies for the patch itself. It can apply
standalone to the current `stable/linux-7.0.y` checkout. Applicability
must be gated by whether the target tree’s code has removed the old
APIs.

## Phase 4: Mailing List And External Research
Step 4.1 Record: `b4 dig -c d8769544bde51...` found the original
submission at
`https://patch.msgid.link/20260430201142.240387-1-tjmercier@google.com`.
`b4 dig -a` found only v1 of a single-patch series. The mbox contains
Tejun’s reply: “Applied to cgroup/for-7.1-fixes.” No NAKs or concerns
found in the saved thread.

Step 4.2 Record: `b4 dig -w` shows recipients included cgroup
maintainers/list and docs list: `tj@kernel.org`, `hannes@cmpxchg.org`,
`mkoutny@suse.com`, `cgroups@vger.kernel.org`, `corbet@lwn.net`,
`skhan@linuxfoundation.org`, `linux-doc@vger.kernel.org`, `linux-
kernel@vger.kernel.org`.

Step 4.3 Record: no `Reported-by` or bug-report `Link:` tags. No syzbot,
bugzilla, or user-impact bug report applies.

Step 4.4 Record: b4 found a standalone single-patch series, not part of
a multi-patch dependency chain.

Step 4.5 Record: web search did not find stable-list discussion for this
exact patch. `WebFetch` to lore/patch.msgid was blocked by Anubis, but
b4 successfully fetched the mbox.

## Phase 5: Code Semantic Analysis
Step 5.1 Record: no functions modified by the diff.

Step 5.2 Record: no callers affected by the diff. For documentation
accuracy, I verified current `stable/linux-7.0.y` code has
`commit_charge()` callers in `charge_memcg()`,
`mem_cgroup_replace_folio()`, and `mem_cgroup_migrate()`.

Step 5.3 Record: no changed callees. The relevant current code has
`commit_charge()` assigning `folio->memcg_data`, and callers also invoke
`memcg1_commit_charge()` where appropriate.

Step 5.4 Record: runtime reachability is not relevant because no
executable code changes.

Step 5.5 Record: similar stale docs pattern exists in stable branches;
however code state differs by branch. `6.19.y`/`7.0.y` have stale docs
and no old API. `6.12.y` still has `mem_cgroup_commit_charge()` and
`mem_cgroup_cancel_charge()` in code.

## Phase 6: Cross-Referencing And Stable Tree Analysis
Step 6.1 Record: buggy stale documentation exists in
`stable/linux-7.0.y`, `stable/linux-6.19.y`, `stable/linux-6.18.y`,
`stable/linux-6.6.y`, `stable/linux-6.1.y`, `stable/linux-5.15.y`, and
`stable/linux-5.10.y` by exact doc grep. I verified the old APIs are
absent in several of those trees, but `stable/linux-6.12.y` still
contains `mem_cgroup_commit_charge()` and `mem_cgroup_cancel_charge()`,
so this exact upstream text is not universally correct for all
maintained stable lines.

Step 6.2 Record: `git apply --check` of the candidate diff succeeds on
the current `stable/linux-7.0.y` checkout. Expected backport difficulty:
clean for trees with matching doc context; branch-specific review needed
for `6.12.y`.

Step 6.3 Record: no related stable fix for this exact doc update found
by local stable branch ancestry checks or web search.

## Phase 7: Subsystem And Maintainer Context
Step 7.1 Record: subsystem is cgroup-v1 memcg documentation.
Criticality: peripheral/runtime-none; important only for documentation
correctness.

Step 7.2 Record: cgroup and memcg areas are active; recent
`origin/master` history includes multiple cgroup/mm fixes. The candidate
was applied by the cgroup maintainer.

## Phase 8: Impact And Risk Assessment
Step 8.1 Record: affected population is documentation readers and
developers/admins consulting old cgroup-v1 memcg internals. No kernel
runtime users are affected.

Step 8.2 Record: trigger condition is reading or relying on the stale
documentation. Unprivileged users cannot trigger a kernel failure
because there is no runtime behavior.

Step 8.3 Record: failure mode is incorrect documentation. Severity: LOW.
It can mislead developers/admins, but it does not fix crash, corruption,
leak, deadlock, or security behavior.

Step 8.4 Record: benefit is low but real for documentation correctness,
especially because the stable rules exception allows
documentation/comment fixes and runtime risk is zero. Risk is very low
for branches whose code matches the new text; risk is documentation-
regression risk if applied to a branch like `6.12.y` where cancellation
APIs still exist.

## Phase 9: Final Synthesis
Step 9.1 Record: evidence for backporting: tiny docs-only patch;
corrects verified stale references in current `7.0.y` and `6.19.y`;
applied by cgroup maintainer; no runtime regression risk; documentation
fixes are allowed. Evidence against: no runtime bug or user-visible
stability issue; not correct for every stable branch, specifically
verified `6.12.y` still has the old commit/cancel APIs. Unresolved: I
did not exhaustively validate every maintained stable branch’s memcg API
state.

Step 9.2 Stable rules checklist:
1. Obviously correct and tested? Yes for matching-code branches;
   verified by code grep, `git diff --check`, and `git apply --check` on
   `7.0.y`.
2. Fixes a real bug that affects users? It fixes incorrect
   documentation, not runtime behavior.
3. Important issue? No runtime severity; LOW documentation correctness
   issue.
4. Small and contained? Yes, 6-line docs-only diff in one file.
5. No new features or APIs? Yes.
6. Can apply to stable trees? Yes to at least current `7.0.y`; should
   not be applied blindly to `6.12.y`.

Step 9.3 Record: exception category is documentation fix.

Step 9.4 Decision: backport is acceptable for stable trees whose memcg
implementation no longer has charge cancellation APIs. It should be
excluded from trees like `6.12.y` where the old APIs still exist. With
that branch-selection caveat, this qualifies under the documentation-fix
exception.

Verification:
- [Phase 1] Parsed candidate commit
  `d8769544bde51b0ac980d10f8fe9f9fed6c95995`: only two Signed-off-by
  tags, no bug/report/stable tags.
- [Phase 2] `git show --stat --patch`: confirmed one docs file, `2
  insertions(+), 4 deletions(-)`.
- [Phase 3] `git show 1d8f136a421f...`: confirmed referenced commit
  removed `mem_cgroup_commit_charge()` and `mem_cgroup_cancel_charge()`
  from mainline code.
- [Phase 3] `git log origin/master --grep`: found candidate and merge
  via `cgroup-for-7.1-rc2-fixes`.
- [Phase 4] `b4 dig`: found lore thread and exact patch-id match.
- [Phase 4] `b4 dig -a`: only v1 single-patch series.
- [Phase 4] `b4 dig -w`: relevant cgroup and docs maintainers/lists were
  included.
- [Phase 4] Saved mbox: confirmed Tejun replied “Applied to
  cgroup/for-7.1-fixes”; no NAKs or stable nomination found.
- [Phase 5] `rg`/`git grep`: confirmed current code uses
  `commit_charge()` and no longer has old API names in `7.0.y`.
- [Phase 6] `git apply --check`: candidate applies cleanly to current
  `7.0.y`.
- [Phase 6] Branch checks: `6.12.y` still has
  `mem_cgroup_commit_charge()` and `mem_cgroup_cancel_charge()`, so this
  patch must not be applied there unchanged.
- [Phase 8] Runtime impact verified as none because only
  `Documentation/admin-guide/cgroup-v1/memcg_test.rst` changes.

**YES**

 Documentation/admin-guide/cgroup-v1/memcg_test.rst | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 9f8e27355cba5..7c7cd457cf695 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -47,21 +47,19 @@ Please note that implementation details can be changed.
 	  Called when swp_entry's refcnt goes down to 0. A charge against swap
 	  disappears.
 
-3. charge-commit-cancel
+3. charge-commit
 =======================
 
 	Memcg pages are charged in two steps:
 
 		- mem_cgroup_try_charge()
-		- mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
+		- commit_charge()
 
 	At try_charge(), there are no flags to say "this page is charged".
 	at this point, usage += PAGE_SIZE.
 
 	At commit(), the page is associated with the memcg.
 
-	At cancel(), simply usage -= PAGE_SIZE.
-
 Under below explanation, we assume CONFIG_SWAP=y.
 
 4. Anonymous
-- 
2.53.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox