Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Sasha Levin @ 2026-05-19 20:00 UTC (permalink / raw)
  To: Paul Moore
  Cc: Song Liu, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAHC9VhTEs7rCaoPG7cWAzyVkN3ztdadHAq0g8mEy_MgCiCe=0g@mail.gmail.com>

On Mon, May 18, 2026 at 11:08:38PM -0400, Paul Moore wrote:
>On Mon, May 18, 2026 at 8:31 PM Sasha Levin <sashal@kernel.org> wrote:
>> On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
>> >From my perspective there are two different issues here: should
>> >killswitch be a LSM, and should killswitch leverage kprobes to be able
>> >to "kill" security related symbols.  After all, are we okay with
>> >killswitch killing capable() and friends?
>>
>> killswitch doesn't do it on it's own. It may be instructed by root to do that,
>> at which point that is root's problem.
>
>As I mentioned previously, there are cases where we can restrict
>root's privileges today, but a functional killswitch would allow that
>restriction to be bypassed.  My last email to Song has an example with
>SELinux.

This would be handled by just disabling killswitch in those scenarios like how
we do with lockdown, no?

>> >In my opinion, making killswitch an LSM is more of a procedural item
>> >that deals with how we view a capability like killswitch.  I
>> >personally view killswitch as somewhat similar to Lockdown, which is
>> >why I made the suggestion.
>>
>> Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
>> "random" code paths from executing, and I don't think we can create LSM hooks
>> at that granularity, no?
>
>I don't see any LSM hooks in this revision of killswitch, and as long
>as it is based on a kprobes I can't imagine it would ever use any.  As
>I mentioned above, my killswitch-as-a-LSM comment is primarily about
>killswitch filling a role very similar to Lockdown.

My question was more about how to structure killswitch as an LSM. I want to be
able to poke at pretty much any function in the kernel, rather than restrict
access to a known list of functions.

>> >The use of kprobes, while an interesting idea, presents problems as
>> >allowing any kernel symbol to be killed introduces the potential for
>> >security regressions.  As a reminder, some LSMs, as well as other
>> >kernel subsystems, have mechanisms in place to restrict root and/or
>> >enforce one-way configuration locks; while many people equate "root"
>> >with full control, in many cases today that is not strictly correct.
>>
>> killswitch "complies" with lockdown. Is there a different scenario which we
>> should be blocking?
>
>See the SELinux example I mentioned in my email to Song.
>
>> >Yes, kprobes have been around for some time, this is not a new
>> >problem, but killswitch makes it far more convenient and accessible to
>> >do dangerous things with kprobes.  If killswitch makes it past the RFC
>> >stage without any significant changes to its kill mechanism, we may
>> >need to start considering more liberal usage of NOKPROBE_SYMBOL()
>> >which I think would be an unfortunate casualty.
>>
>> Why? If I don't really mind the security impact, I want to be able to have a
>> killswitch-like interface on my systems. If an attacker is in my systems,
>> killswitch is the least of my concerns I think.
>>
>> If you are security concious, just don't enable CONFIG_KILLSWITCH?
>
>Isn't the whole point of killswitch to have it enabled everywhere
>because you never know when you might want/need it?

Right. We have different usecases. If you want selinux/lockdown/etc and a
really crippled root, that should be an option. If you choose to allow
something like killswitch, it should be an option too.

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH v3] killswitch: add per-function short-circuit mitigation primitive
From: Sasha Levin @ 2026-05-19 19:57 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Song Liu, linux-kernel, linux-doc, linux-kselftest, bpf,
	live-patching, Greg Kroah-Hartman, Andrew Morton, Jonathan Corbet,
	Mathieu Desnoyers, Joshua Peisach, Florian Weimer, Breno Leitao,
	Anthony Iliopoulos, Michal Hocko, Jiri Olsa, John Fastabend,
	Christian Brauner
In-Reply-To: <b342c38b-7323-4b72-a239-8a574d6bc36b@iogearbox.net>

On Tue, May 19, 2026 at 02:13:26PM +0200, Daniel Borkmann wrote:
>On 5/19/26 1:59 AM, Song Liu wrote:
>>On Mon, May 18, 2026 at 6:33 AM Sasha Levin <sashal@kernel.org> wrote:
>>>On Sun, May 17, 2026 at 11:37:36PM -0700, Song Liu wrote:
>>>>On Sun, May 17, 2026 at 6:49 AM Sasha Levin <sashal@kernel.org> wrote:
>>>>>* fail_function (CONFIG_FUNCTION_ERROR_INJECTION) is disabled in
>>>>>   most production kernels. Even where enabled, it only works on
>>>>>   functions pre-annotated with ALLOW_ERROR_INJECTION() in source -
>>>>>   no help for a freshly-disclosed CVE. The debugfs UI is blocked by
>>>>>   lockdown=integrity and the override is probabilistic.
>>>>>
>>>>>* BPF override (bpf_override_return) honors the same
>>>>>   ALLOW_ERROR_INJECTION() whitelist, and BPF itself is off in many
>>>>>   production kernels. Even where on, the operator interface is
>>>>>   "load a verified BPF program," not a one-line write.
>>>>
>>>>If it is OK for killswitch to attach to any kernel functions, do we still
>>>>need ALLOW_ERROR_INJECTION() for fail_function and BPF
>>>>override? Shall we instead also allow fail_function and BPF override
>>>>to attach to any kernel functions?
>>>
>>>I don't think so. ALLOW_ERROR_INJECTION is not a security mechanism, it's an
>>>integrity/safety mechanism for both bpf and fault injection.
>>>
>>>It protects against a "developer or CI script doing legitimate fault injection
>>>accidentally panics the box" scenario, not an "attacker gets in" one.
>>
>>There really isn't a clear boundary between "security mechanism" and
>>"non-security mechanism". As we are making killswitch available
>>everywhere under root, users will soon learn to use it to do fault injection,
>>and potentially much more scary things. (Think about agents with sudo
>>access).
>
>Fully agree with Song here that there is no clear boundary, and that the
>killswitch could lead to arbitrary, hard to debug breakage if applied to
>the wrong function.. introducing worse bugs than the one being mitigated
>or even /short-circuit LSM enforcement/ (engage security_file_open 0,
>engage cap_capable 0, engage apparmor_* etc).

This is similar to livepatch, right? Do we need guardrails there too?

Or do we just trust root to do the right thing for it's systems without needing
to be it's babysitter?

>The ALLOW_ERROR_INJECTION() provides a curated white-list where you may
>return with an error without causing more severe damage (assuming the
>error handling code is right). The right thing would be to more widely
>apply ALLOW_ERROR_INJECTION() or to figure out a better way to safely
>enable the latter without explicit function annotation.

Sure, this would also work. How do you see this happening? Can we let a certain
user/pid/etc disable the allowlist if they choose to?

>Wrt BPF:
>
>>>>>* BPF override (bpf_override_return) honors the same
>>>>>   ALLOW_ERROR_INJECTION() whitelist, and BPF itself is off in many
>>>>>   production kernels. Even where on, the operator interface is
>>>>>   "load a verified BPF program," not a one-line write.
>
>The claim that BPF itself is off in many production kernels is not really
>true, where did you get that from? All the major distros and cloud providers
>have BPF enabled these days, and even systemd ships BPF programs for
>custom service firewalling etc.

The world is a bit bigger than home distros and cloud providers, but sure - bpf
is enabled widely enough at this point.

How do you see this working with the allowlist?

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH mm-unstable v17 00/14] khugepaged: mTHP support
From: Nico Pache @ 2026-05-19 19:20 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260518125007.a4z3pw4r73uuwja4@master>

On Mon, May 18, 2026 at 6:50 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Mon, May 11, 2026 at 12:58:00PM -0600, Nico Pache wrote:
> >The following series provides khugepaged with the capability to collapse
> >anonymous memory regions to mTHPs.
> >
> >To achieve this we generalize the khugepaged functions to no longer depend
> >on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> >pages that are occupied (!none/zero). After the PMD scan is done, we use
> >the bitmap to find the optimal mTHP sizes for the PMD range. The
> >restriction on max_ptes_none is removed during the scan, to make sure we
> >account for the whole PMD range in the bitmap. When no mTHP size is
> >enabled, the legacy behavior of khugepaged is maintained.
> >
> >We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1
> >(ie 511). If any other value is specified, the kernel will emit a warning
> >and no mTHP collapse will be attempted. If a mTHP collapse is attempted,
> >but contains swapped out, or shared pages, we don't perform the collapse.
> >It is now also possible to collapse to mTHPs without requiring the PMD THP
> >size to be enabled. These limitations are to prevent collapse "creep"
> >behavior. This prevents constantly promoting mTHPs to the next available
> >size, which would occur because a collapse introduces more non-zero pages
> >that would satisfy the promotion condition on subsequent scans.
> >
> >Patch 1-2:   Generalize hugepage_vma_revalidate and alloc_charge_folio
> >            for arbitrary orders.
> >Patch 3:     Rework max_ptes_* handling into helper functions
> >Patch 4:     Generalize __collapse_huge_page_* for mTHP support
> >Patch 5:     Require collapse_huge_page to enter/exit with the lock dropped
> >Patch 6:     Generalize collapse_huge_page for mTHP collapse
> >Patch 7:     Skip collapsing mTHP to smaller orders
> >Patch 8-9:   Add per-order mTHP statistics and tracepoints
> >Patch 10:    Introduce collapse_allowable_orders helper function
> >Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled
> >Patch 14:    Documentation
> >
> >Testing:
> >- Built for x86_64, aarch64, ppc64le, and s390x
> >- ran all arches on test suites provided by the kernel-tests project
> >- internal testing suites: functional testing and performance testing
> >- selftests mm
> >- I created a test script that I used to push khugepaged to its limits
> >   while monitoring a number of stats and tracepoints. The code is
> >   available here[1] (Run in legacy mode for these changes and set mthp
> >   sizes to inherit)
> >   The summary from my testings was that there was no significant
> >   regression noticed through this test. In some cases my changes had
> >   better collapse latencies, and was able to scan more pages in the same
> >   amount of time/work, but for the most part the results were consistent.
> >- redis testing. I did some testing with these changes along with my defer
> >  changes (see followup [2] post for more details). We've decided to get
> >  the mTHP changes merged first before attempting the defer series.
> >- some basic testing on 64k page size.
> >- lots of general use.
> >
>
> Two links are missing. I got them from previous version.
>
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/

Oh whoops, ill make sure they are there in the followup

>
> And the test in [1] is a performance test. I am thinking whether we want a
> functional test in selftests.

It also works as a functional test in some regards. The reason i never
pursued self-tests is that I naively thought this was getting merged
6(?) months ago and at the time the selftests infrastructure didn't
support it well. Baolin included patches to clean that up in his shmem
mTHP support patches and added tests for both features. Let's repost
and re-merge this first; then, I will follow up in one or two weeks
regarding self-tests. I'm currently on PTO and only have time to
complete, test, and return the v18 changes to Andrew before they
create a huge merge headache and we miss yet another window.

>
> I did a quick try with following change and some hack.

Thanks Ill use that as a base!

>
> @@ -744,6 +765,51 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
>         ksft_test_result_report(exit_status, "%s\n", __func__);
>  }
>
> +static void collapse_mth_ptes(struct collapse_context *c, struct mem_ops *ops)
> +{
> +       struct thp_settings settings = *thp_current_settings();
> +       void *p;
> +       int i;
> +
> +       /* Disable mthp on fault */
> +       for (i = 0; i < NR_ORDERS; i++) {
> +               settings.hugepages[i].enabled = THP_NEVER;
> +       }
> +       thp_push_settings(&settings);
> +
> +       p = ops->setup_area(1);
> +
> +       ops->fault(p, 0, hpage_pmd_size);
> +
> +       /* Expect all order-0 folio after fault */
> +       memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
> +       expected_orders[0] = hpage_pmd_nr;
> +       if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
> +                                          kpageflags_fd, expected_orders,
> +                                          (pmd_order + 1)))
> +               ksft_exit_fail_msg("Unexpected huge page at fault\n");
> +
> +       /* Enable mthp before collapse */
> +       thp_pop_settings();
> +       settings.hugepages[2].enabled = THP_ALWAYS;
> +       thp_push_settings(&settings);
> +
> +       c->collapse("Collapse fully populated PTE table with order 2", p, 1,
> +                   ops, true);
> +
> +       /* Expect all order-2 folio after collapse */
> +       memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
> +       expected_orders[2] = 1 << (pmd_order - 2);
> +       if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
> +                                          kpageflags_fd, expected_orders,
> +                                          (pmd_order + 1)))
> +               ksft_exit_fail_msg("Unexpected page order\n");
> +
> +       ops->cleanup_area(p, hpage_pmd_size);
> +       thp_pop_settings();
> +       ksft_test_result_report(exit_status, "%s\n", __func__);
> +}
> +
>  static void collapse_swapin_single_pte(struct collapse_context *c, struct mem_ops *ops)
>  {
>         void *p;
>
> This leverage check_after_split_folio_orders() in split_huge_page_test.c to
> check folio order in PMD range.
>
> --
> Wei Yang
> Help you, Help me
>


^ permalink raw reply

* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Nico Pache @ 2026-05-19 19:05 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand (Arm), Wei Yang, Lance Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <agtpK1x27B-E7mMo@lucifer>

On Mon, May 18, 2026 at 1:33 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 03:16:11PM +0200, David Hildenbrand (Arm) wrote:
> > On 5/14/26 05:10, Wei Yang wrote:
> > > On Tue, May 12, 2026 at 03:42:02PM +0800, Lance Yang wrote:
> > >>
> > >> On Mon, May 11, 2026 at 12:58:04PM -0600, Nico Pache wrote:
> > >>> generalize the order of the __collapse_huge_page_* and collapse_max_*
> > >>> functions to support future mTHP collapse.
> > >>>
> > >>> The current mechanism for determining collapse with the
> > >>> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> > >>> raises a key design issue: if we support user defined max_pte_none values
> > >>> (even those scaled by order), a collapse of a lower order can introduces
> > >>> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> > >>> than HPAGE_PMD_NR / 2. [1]
> > >>>
> > >>> With this configuration, a successful collapse to order N will populate
> > >>> enough pages to satisfy the collapse condition on order N+1 on the next
> > >>> scan. This leads to unnecessary work and memory churn.
> > >>>
> > >>> To fix this issue introduce a helper function that will limit mTHP
> > >>> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> > >>> This effectively supports two modes: [2]
> > >>>
> > >>> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
> > >>>  that maps the shared zeropage. Consequently, no memory bloat.
> > >>> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
> > >>>  available mTHP order.
> > >>>
> > >>> This removes the possiblilty of "creep", while not modifying any uAPI
> > >>> expectations. A warning will be emitted if any non-supported
> > >>> max_ptes_none value is configured with mTHP enabled.
> > >>>
> > >>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> > >>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> > >>> shared or swapped entry.
> > >>>
> > >>> No functional changes in this patch; however it defines future behavior
> > >>> for mTHP collapse.
> > >>>
> > >>> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
> > >>> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
> > >>>
> > >>> Co-developed-by: Dev Jain <dev.jain@arm.com>
> > >>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> > >>> Signed-off-by: Nico Pache <npache@redhat.com>
> > >>> ---
> > >>> include/trace/events/huge_memory.h |   3 +-
> > >>> mm/khugepaged.c                    | 117 ++++++++++++++++++++---------
> > >>> 2 files changed, 85 insertions(+), 35 deletions(-)
> > >>>
> > >>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > >>> index bcdc57eea270..443e0bd13fdb 100644
> > >>> --- a/include/trace/events/huge_memory.h
> > >>> +++ b/include/trace/events/huge_memory.h
> > >>> @@ -39,7 +39,8 @@
> > >>>   EM( SCAN_STORE_FAILED,          "store_failed")                 \
> > >>>   EM( SCAN_COPY_MC,               "copy_poisoned_page")           \
> > >>>   EM( SCAN_PAGE_FILLED,           "page_filled")                  \
> > >>> - EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
> > >>> + EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")     \
> > >>> + EMe(SCAN_INVALID_PTES_NONE,     "invalid_ptes_none")
> > >>>
> > >>> #undef EM
> > >>> #undef EMe
> > >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > >>> index f68853b3caa7..27465161fa6d 100644
> > >>> --- a/mm/khugepaged.c
> > >>> +++ b/mm/khugepaged.c
> > >>> @@ -61,6 +61,7 @@ enum scan_result {
> > >>>   SCAN_COPY_MC,
> > >>>   SCAN_PAGE_FILLED,
> > >>>   SCAN_PAGE_DIRTY_OR_WRITEBACK,
> > >>> + SCAN_INVALID_PTES_NONE,
> > >>> };
> > >>>
> > >>> #define CREATE_TRACE_POINTS
> > >>> @@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
> > >>>  * PTEs for the given collapse operation.
> > >>>  * @cc: The collapse control struct
> > >>>  * @vma: The vma to check for userfaultfd
> > >>> + * @order: The folio order being collapsed to
> > >>>  *
> > >>>  * Return: Maximum number of none-page or zero-page PTEs allowed for the
> > >>>  * collapse operation.
> > >>>  */
> > >>> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> > >>> -         struct vm_area_struct *vma)
> > >>> +static int collapse_max_ptes_none(struct collapse_control *cc,
> > >>> +         struct vm_area_struct *vma, unsigned int order)
> > >>> {
> > >>> + unsigned int max_ptes_none = khugepaged_max_ptes_none;
> > >>>   // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
> > >>
> > >> One thing I still want to call out: kernel code usually uses C-style
> > >> comments :)
> > >>
> > >>>   if (vma && userfaultfd_armed(vma))
> > >>>           return 0;
> > >>>   // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> > >>>   if (!cc->is_khugepaged)
> > >>>           return HPAGE_PMD_NR;
> > >>> - // For all other cases repect the user defined maximum.
> > >>> - return khugepaged_max_ptes_none;
> > >>> + // for PMD collapse, respect the user defined maximum.
> > >>> + if (is_pmd_order(order))
> > >>> +         return max_ptes_none;
> > >>> + /* Zero/non-present collapse disabled. */
> > >>> + if (!max_ptes_none)
> > >>> +         return 0;
> > >>> + // for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
> > >>> + // scale the maximum number of PTEs to the order of the collapse.
> > >>> + if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> > >>> +         return (1 << order) - 1;
> > >>> +
> > >>> + // We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
> > >>> + // Emit a warning and return -EINVAL.
> > >>> + pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
> > >>> +               KHUGEPAGED_MAX_PTES_LIMIT);
> > >>
> > >> Maybe fallback to 0 instead, as David suggested earlier?
> > >>
> > >
> > > It looks reasonable to fallback to 0.
> > >
> > > But as the updated Document says in patch 14:
> > >
> > >   For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other
> > >   value will emit a warning and no mTHP collapse will be attempted.
> > >
> > > This is why it does like this now.
> > >
> > >     mthp_collapse()
> > >         max_ptes_none = collapse_max_ptes_none();
> > >         if (max_ptes_none < 0)
> > >             return collapsed;
> > >
> > >> max_ptes_none is mostly legacy PMD THP behavior. mTHP is new, and any
> > >> intermediate value in (0, KHUGEPAGED_MAX_PTES_LIMIT) would implicitly
> > >> disable it :(
> > >>
> > >
> > > So it depends on what we want to do here :-)
> > >
> > > For me, I would vote for fallback to 0.
> >
> > At this point I'll prefer to not return errors from collapse_max_ptes_none().
> > It's just rather awkward to return an error deep down in collapse code for a
> > configuration problem.
> >
> > For mthp collapse, we only support max_ptes_none==0 and
> > max_ptes_none=="HPAGE_PMD_NR - 1" (default).
> >
> > If another value is specified while collapsing mTHP, print a warning and treat
> > it as 0 (save value, no creep, no memory waste).
> >
> > In a sense, this is similar to how we handle max_ptes_shared + max_ptes_swap:
> > for mTHP: we always treat them as being 0 for mTHP collapse (and don't issue a
> > warning, because we would issue a warning with the default settings).
> >
> > @Lorenzo, fine with you?
>
> Yes 100%, this sounds sensible both in terms of the error and the default. Let's
> keep our lives simple(-ish) please :)

Ok thank you im glad we finally came to consensus on this! phew!

>
> >
> > --
> > Cheers,
> >
> > David
>
> Cheers, Lorenzo
>


^ permalink raw reply

* Re: [PATCH mm-unstable v17 02/14] mm/khugepaged: generalize alloc_charge_folio()
From: Nico Pache @ 2026-05-19 19:03 UTC (permalink / raw)
  To: Lance Yang, Usama Arif
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <51db205d-77cf-416f-bfe5-fd9d0b12c433@linux.dev>

On Mon, May 18, 2026 at 8:50 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2026/5/18 19:55, Usama Arif wrote:
> [...]
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 979885694351..f0e29d5c7b1f 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -1068,21 +1068,26 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> >>   }
> >>
> >>   static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >> -            struct collapse_control *cc)
> >> +            struct collapse_control *cc, unsigned int order)
> >>   {
> >>      gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> >>                   GFP_TRANSHUGE);
> >>      int node = collapse_find_target_node(cc);
> >>      struct folio *folio;
> >>
> >> -    folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> >> +    folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
> >>      if (!folio) {
> >>              *foliop = NULL;
> >> -            count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> >> +            if (is_pmd_order(order))
> >> +                    count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> >> +            count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
> >>              return SCAN_ALLOC_HUGE_PAGE_FAIL;
> >>      }
> >>
> >> -    count_vm_event(THP_COLLAPSE_ALLOC);
> >> +    if (is_pmd_order(order))
> >> +            count_vm_event(THP_COLLAPSE_ALLOC);
> >> +    count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
> >> +
> >
> > The vmstat THP_COLLAPSE_ALLOC counter is pmd order only.
> > But after this we have
> >
> >       count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1);
> >
> > which is not being guarded with is_pmd_order().
>
> Good catch!
>
> >
> > I think we want this to be pmd order only as well so that
> > the meaning of the vmstat and cgroup counter remains the same?
>
> Agreed. THP_COLLAPSE_ALLOC should remain PMD order only for
> vmstat and memcg events.
>
> So this should be guarded with is_pmd_order() as well :)

Thanks Usama, I added that.

>
> Cheers, Lance
>


^ permalink raw reply

* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Nico Pache @ 2026-05-19 18:21 UTC (permalink / raw)
  To: Lance Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260512044444.71798-1-lance.yang@linux.dev>

On Mon, May 11, 2026 at 10:45 PM Lance Yang <lance.yang@linux.dev> wrote:
>
>
> On Mon, May 11, 2026 at 12:58:03PM -0600, Nico Pache wrote:
> >The following cleanup reworks all the max_ptes_* handling into helper
> >functions. This increases the code readability and will later be used to
> >implement the mTHP handling of these variables.
> >
> >With these changes we abstract all the madvise_collapse() special casing
> >(dont respect the sysctls) away from the functions that utilize them. And
>
> Nit: s/dont/do not/
>
> >will be used later in this series to cleanly restrict the mTHP collapse
> >behavior.
> >
> >No functional change is intended; however, we are now only reading the
> >sysfs variables once per scan, whereas before these variables were being
> >read on each loop iteration.
> >
> >Suggested-by: David Hildenbrand <david@kernel.org>
> >Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> >Acked-by: Usama Arif <usama.arif@linux.dev>
> >Signed-off-by: Nico Pache <npache@redhat.com>
> >---
> > mm/khugepaged.c | 118 +++++++++++++++++++++++++++++++++---------------
> > 1 file changed, 82 insertions(+), 36 deletions(-)
> >
> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >index f0e29d5c7b1f..f68853b3caa7 100644
> >--- a/mm/khugepaged.c
> >+++ b/mm/khugepaged.c
> >@@ -348,6 +348,62 @@ static bool pte_none_or_zero(pte_t pte)
> >       return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
> > }
> >
> >+/**
> >+ * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page
> >+ * PTEs for the given collapse operation.
> >+ * @cc: The collapse control struct
> >+ * @vma: The vma to check for userfaultfd
> >+ *
> >+ * Return: Maximum number of none-page or zero-page PTEs allowed for the
> >+ * collapse operation.
> >+ */
> >+static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> >+              struct vm_area_struct *vma)
> >+{
> >+      // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
> >+      if (vma && userfaultfd_armed(vma))
> >+              return 0;
> >+      // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> >+      if (!cc->is_khugepaged)
> >+              return HPAGE_PMD_NR;
> >+      // For all other cases repect the user defined maximum.
> >+      return khugepaged_max_ptes_none;
>
> Nit: kernel code usually uses C-style comments. This could be:
>
> /* For all other cases, respect the user-defined maximum. */
>
> Also, s/repect/respect/.
>
> >+}
> >+
> >+/**
> >+ * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
> >+ * anonymous pages for the given collapse operation.
> >+ * @cc: The collapse control struct
> >+ *
> >+ * Return: Maximum number of PTEs that map shared anonymous pages for the
> >+ * collapse operation
> >+ */
> >+static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> >+{
> >+      // for MADV_COLLAPSE, do not restrict the number of PTEs that map shared
> >+      // anonymous pages.
>
> Ditto.
>
> >+      if (!cc->is_khugepaged)
> >+              return HPAGE_PMD_NR;
> >+      return khugepaged_max_ptes_shared;
> >+}
> >+
> >+/**
> >+ * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
> >+ * maximum allowed non-present pagecache entries for the given collapse operation.
> >+ * @cc: The collapse control struct
> >+ *
> >+ * Return: Maximum number of non-present PTEs or the maximum allowed non-present
> >+ * pagecache entries for the collapse operation.
> >+ */
> >+static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> >+{
> >+      // for MADV_COLLAPSE, do not restrict the number PTEs entries or
> >+      // pagecache entries that are non-present.
>
> Same here.
>
> >+      if (!cc->is_khugepaged)
> >+              return HPAGE_PMD_NR;
> >+      return khugepaged_max_ptes_swap;
> >+}
> >+
> > int hugepage_madvise(struct vm_area_struct *vma,
> >                    vm_flags_t *vm_flags, int advice)
> > {
> >@@ -546,21 +602,19 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >       pte_t *_pte;
> >       int none_or_zero = 0, shared = 0, referenced = 0;
> >       enum scan_result result = SCAN_FAIL;
> >+      unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> >+      unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>
> Nit: could these be const, as David suggested earlier?
>
> Nothing else jumped out at me. LGTM!
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>

Ack on all the above thank you !

>


^ permalink raw reply

* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Nico Pache @ 2026-05-19 18:21 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif
In-Reply-To: <4566170a-7e3d-49e4-baab-ba2790c198db@kernel.org>

On Tue, May 12, 2026 at 1:30 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 5/11/26 20:58, Nico Pache wrote:
> > The following cleanup reworks all the max_ptes_* handling into helper
> > functions. This increases the code readability and will later be used to
> > implement the mTHP handling of these variables.
> >
> > With these changes we abstract all the madvise_collapse() special casing
> > (dont respect the sysctls) away from the functions that utilize them. And
> > will be used later in this series to cleanly restrict the mTHP collapse
> > behavior.
> >
> > No functional change is intended; however, we are now only reading the
> > sysfs variables once per scan, whereas before these variables were being
> > read on each loop iteration.
> >
> > Suggested-by: David Hildenbrand <david@kernel.org>
> > Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>
> Some nits when re-reading:
>
> > Acked-by: Usama Arif <usama.arif@linux.dev>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 118 +++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 82 insertions(+), 36 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index f0e29d5c7b1f..f68853b3caa7 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -348,6 +348,62 @@ static bool pte_none_or_zero(pte_t pte)
> >       return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
> >  }
> >
> > +/**
> > + * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page
>
> I know, it's painful, but ...
>
> There is no "none-page".

Yeah I think you mentioned that last review... sorry!

>
> Calculate maximum allowed empty PTEs or PTEs mapping the shared zeropage ... ?
>
> > + * PTEs for the given collapse operation.
>
> We usually indent here (second line of subject), I think. Same applies to the
> other doc below.

Hmm tbh I couldn't find a example of what you meant here. There are
some that put a space between the first sentence and the @ list.

>
> > + * @cc: The collapse control struct
> > + * @vma: The vma to check for userfaultfd
> > + *
> > + * Return: Maximum number of none-page or zero-page PTEs allowed for the
> > + * collapse operation.
>
> Same here.
>
> > + */
> > +static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> > +             struct vm_area_struct *vma)
> > +{
> > +     // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
>
> Lance commented on the comment style.
>
> Is this comment really required? It's pretty self-documenting already.

Dropped it, thanks.

>
> > +     if (vma && userfaultfd_armed(vma))
> > +             return 0;
> > +     // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> > +     if (!cc->is_khugepaged)
> > +             return HPAGE_PMD_NR;
> > +     // For all other cases repect the user defined maximum.
> > +     return khugepaged_max_ptes_none;
> > +}
> > +
> --
> Cheers,
>
> David
>


^ permalink raw reply

* Re: [PATCH v2 0/2] selftests/mm: separate GUP microbenchmarking from functional testing
From: Andrew Morton @ 2026-05-19 18:20 UTC (permalink / raw)
  To: Sarthak Sharma
  Cc: David Hildenbrand, Jonathan Corbet, Jason Gunthorpe, John Hubbard,
	Peter Xu, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-mm, linux-kselftest, linux-kernel, linux-doc
In-Reply-To: <20260519120506.184512-1-sarthak.sharma@arm.com>

On Tue, 19 May 2026 17:35:04 +0530 Sarthak Sharma <sarthak.sharma@arm.com> wrote:

> gup_test.c currently serves two distinct purposes: microbenchmarking
> (GUP_FAST_BENCHMARK, PIN_FAST_BENCHMARK, PIN_LONGTERM_BENCHMARK) and
> functional correctness testing (GUP_BASIC_TEST, PIN_BASIC_TEST,
> DUMP_USER_PAGES_TEST). Mixing these in a single binary means functional
> tests cannot be run or reported individually, and run_vmtests.sh must
> invoke the binary multiple times with different flag combinations to
> cover all configurations. This patch series separates the two concerns:
> tools/mm/gup_bench for benchmarking and tools/testing/selftests/mm/gup_test
> for functional testing.
> 
> Patch 1 adds tools/mm/gup_bench.c, a standalone microbenchmark for
> GUP_FAST, PIN_FAST and PIN_LONGTERM via the CONFIG_GUP_TEST debugfs
> interface. It runs the same matrix of configurations as the old
> run_gup_matrix() shell function (all three commands, read/write,
> private/shared, four page counts, THP on/off, hugetlb), but as a
> standalone C program under tools/mm with no dependency on kselftest.
> 
> Patch 2 rewrites gup_test.c as a kselftest harness-based selftest. It
> covers all five GUP kernel functions (get_user_pages, get_user_pages_fast,
> pin_user_pages, pin_user_pages_fast, pin_user_pages with FOLL_LONGTERM)
> plus DUMP_USER_PAGES_TEST, across 12 mapping configurations (THP on,
> THP off and hugetlb, each across private/shared and read/write variants)
> and four batch sizes (1, 512, 123, all pages). Results are reported as
> standard TAP output with no command-line arguments required.

Thanks.  AI review asked a few things which seem fairly minor to me,
but probably legitimate:
	https://sashiko.dev/#/patchset/20260519120506.184512-1-sarthak.sharma@arm.com

^ permalink raw reply

* Re: [PATCH v5 00/14] module: Introduce hash-based integrity checking
From: Thomas Weißschuh @ 2026-05-19 18:19 UTC (permalink / raw)
  To: Sami Tolvanen
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Nathan Chancellor,
	Nicolas Schier, Arnd Bergmann, Luis Chamberlain, Petr Pavlu,
	Daniel Gomez, Paul Moore, James Morris, Serge E. Hallyn,
	Jonathan Corbet, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Naveen N Rao, Mimi Zohar, Roberto Sassu,
	Dmitry Kasatkin, Eric Snowberg, Nicolas Schier, Daniel Gomez,
	Aaron Tomlin, Christophe Leroy (CS GROUP), Nicolas Bouchinet,
	Xiu Jianfeng, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jiri Olsa, bpf, Fabian Grünbichler, Arnout Engelen,
	Mattia Rizzolo, kpcyrd, Christian Heusel, Câju Mihai-Drosi,
	Eric Biggers, Sebastian Andrzej Siewior, linux-kbuild,
	linux-kernel, linux-arch, linux-modules, linux-security-module,
	linux-doc, linuxppc-dev, linux-integrity, debian-kernel
In-Reply-To: <20260518215543.GA1878854@google.com>

Hi Sami,

On 2026-05-18 21:55:43+0000, Sami Tolvanen wrote:
> On Tue, May 05, 2026 at 11:05:04AM +0200, Thomas Weißschuh wrote:
> > The current signature-based module integrity checking has some drawbacks
> > in combination with reproducible builds. Either the module signing key
> > is generated at build time, which makes the build unreproducible, or a
> > static signing key is used, which precludes rebuilds by third parties
> > and makes the whole build and packaging process much more complicated.
> > 
> > The goal is to reach bit-for-bit reproducibility. Excluding certain
> > parts of the build output from the reproducibility analysis would be
> > error-prone and force each downstream consumer to introduce new tooling.
> > 
> > Introduce a new mechanism to ensure only well-known modules are loaded
> > by embedding a merkle tree root of all modules built as part of the full
> > kernel build into vmlinux.
> 
> I noticed Sashiko had a few concerns about the build changes. Would you
> mind taking a look to see if they're valid?
> 
> https://sashiko.dev/#/patchset/20260505-module-hashes-v5-0-e174a5a49fce%40weissschuh.net

I definitively have these on my list. Unfortunately I am busy with
something else right now. But this series and the Sashiko comments
are next.


Thomas

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-19 18:07 UTC (permalink / raw)
  To: Christian König
  Cc: Albert Esteve, Christian Brauner, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <01b6eefc-c107-4f8c-9d7c-3b86f54cabaa@amd.com>

On Tue, May 19, 2026 at 12:19 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/19/26 01:39, T.J. Mercier wrote:
> > On Mon, May 18, 2026 at 7:07 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 5/18/26 14:50, Albert Esteve wrote:
> >>> On Mon, May 18, 2026 at 9:20 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>>
> >>>> On 5/15/26 19:06, T.J. Mercier wrote:
> >>>>> On Fri, May 15, 2026 at 6:53 AM Christian Brauner <brauner@kernel.org> wrote:
> >>>>>>
> >>>>>> On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
> >>>>>>> On embedded platforms a central process often allocates dma-buf
> >>>>>>> memory on behalf of client applications. Without a way to
> >>>>>>> attribute the charge to the requesting client's cgroup, the
> >>>>>>> cost lands on the allocator, making per-cgroup memory limits
> >>>>>>> ineffective for the actual consumers.
> >>>>>>>
> >>>>>>> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> >>>>>>
> >>>>>> Please be aware that pidfds come in two flavors:
> >>>>>>
> >>>>>> thread-group pidfds and thread-specific pidfds. Make sure that your API
> >>>>>> doesn't implicitly depend on this distinction not existing.
> >>>>>
> >>>>> Hi Christian,
> >>>>>
> >>>>> Memcg is not a controller that supports "thread mode" so all threads
> >>>>> in a group should belong to the same memcg.
> >>>>
> >>>> BTW: Exactly that is the requirement automotive has with their native context use case.
> >>>>
> >>>> The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
> >>>>
> >>>> At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
> >>>>
> >>>> Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
> >>>
> >>> Hi Christian,
> >>>
> >>> Thanks for sharing this atuomotive usecase. If I understand correctly,
> >>> the actual requirement is attributing dma-buf charges to the right
> >>> client, not putting each daemon thread in a different cgroup?
> >>
> >> Nope, exactly that's the difference.
> >>
> >> The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.
> >>
> >> Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.
> >>
> >> The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.
> >>
> >> The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.
> >>
> >> So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.
> >>
> >>> If so,
> >>> the `charge_pid_fd` approach achieves this directly by passing the
> >>> client's `pid_fd`, without needing to add per-thread cgroup
> >>> infrastructure.
> >>
> >> Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.
> >>
> >> Doing that automatically for CPU and I/O time would just be nice to have additionally.
> >>
> >> Regards,
> >> Christian.
> >
> > Hopefully I'm following correctly here.... So you are duplicating the
> > GPU driver stack to achieve remote accounting on a per-thread basis?
>
> Not quite, we are duplicating the handling cgroup provides in the kernel in userspace.
>
> For this memory usage information as well as execution times of the GPU kernel driver is exposed in fdinfo for example.

Oh I see, thanks.

> > Does this mean for GPU allocations you currently have some GFP_ACCOUNT
> > magic in your driver to attribute GPU memory to the correct remote
> > client?
>
> No, we just expose what the kernel driver has allocated for itself. E.g. page tables, buffers etc...
>
> When userspace allocates something using memfd_create() for example we just ignore that.
>
> > So this series would close the gap for dma-buf allocations,
> > but what about private GPU driver memory allocated on behalf of a
> > client?
>
> Well we would need a cgroup which isn't associated with any process were we could charge the GPU driver allocations against.
>
> But good point, charging against a pid wouldn't work in this use case.

It would be pretty low overhead to put a process doing while(1)
pause(); in a separate cgroup for this purpose, but I guess a fd for
the actual cgroup would be a little cleaner in this case.

> Regards,
> Christian.

^ permalink raw reply

* Re: [Linaro-mm-sig] Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-19 18:06 UTC (permalink / raw)
  To: Christian König
  Cc: Barry Song, Albert Esteve, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <8a13b1ad-f1be-4ef4-905e-0d9828ae8cb5@amd.com>

On Tue, May 19, 2026 at 12:10 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/19/26 01:00, Barry Song wrote:
> > On Mon, May 18, 2026 at 3:34 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 5/16/26 11:19, Barry Song wrote:
> >>> On Thu, May 14, 2026 at 12:35 AM T.J. Mercier <tjmercier@google.com> wrote:
> >>> [...]
> >>>>>> I have a question about this part. Albert I guess you are interested
> >>>>>> only in accounting dmabuf-heap allocations, or do you expect to add
> >>>>>> __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other
> >>>>>> non-dmabuf-heap exporters?
> >>>>>
> >>>>> We're scoping this to dma-buf heaps for now. CMA heaps and the dmem
> >>>>> controller are on the radar for follow-up/parallel work (there will be
> >>>>> dragons and will surely need discussion). For DRM and V4L2 the
> >>>>> long-term intent is migration to heaps, which would make direct
> >>>>> accounting on those paths unnecessary.
> >>>>
> >>>> Ah I see. GEM buffers exported to dmabufs are what I had in mind. I
> >>>> guess this would only leave the odd non-DRM driver with the need to
> >>>> add their own accounting calls, which I don't expect would be a big
> >>>> problem.
> >>>>
> >>>
> >>> sounds like we still have a long way to go to correctly account for
> >>> various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in
> >>> dma_buf_export(), so I guess it covers all dma-buf types except
> >>> dma_heap, but the problem is that it has no remote charging support at
> >>> all?
> >>
> >> No, just the other way around
> >>
> >> DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.
> >>
> >> dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.
> >>
> >
> > Hi Christian,
> >
> > Thanks very much for your explanation. So basically it seems that
> > dma_buf_export() is not the proper place to charge, since it may end up
> > mixing in non-system-memory accounting?
>
> Yes, exactly that.
>
> > My question is also about the global view for both heap and non-heap cases.
> > After reading the discussion, I’ve tried to summarize it—please let me know
> > if my understanding is correct.
> >
> > for dma_heap, we have the ioctl DMA_HEAP_IOCTL_ALLOC, where users can pass a
> > remote pidfd or similar information to indicate where the dma-buf should be
> > charged, as in Albert's patchset.
>
> Well that's the current proposal, but I think we need to come up with something more general.
>
> > For non-dma_heap dma-bufs, we don’t have an obvious userspace entry point that
> > triggers the allocation. So we likely need other approaches. We could either
> > move more drivers over to dma-heap, or introduce something like
> > DMA_BUF_IOCTL_XFER_CHARGE, as you are discussing, to let userspace explicitly
> > declare a charge.
>
> Yeah but that's not only for DMA-buf, we need that for file descriptors returned by memfd_create() as well.

memfds get charged on fault, so an allocator shouldn't currently be
charged just for creating the fd. Unlike system/CMA heap buffers, the
shmem backing a memfd / udmabuf is LRU memory, and swapping the memcg
owner of those pages is a more-involved process which is not supported
by memcg v2. There used to be some support in memcg v1, but it was
removed. Commit e548ad4a7cbf ("mm: memcg: move charge migration code
to memcontrol-v1.c ") said, "It's a fairly large and complicated code
which created a number of problems in the past." So I'm not sure how
much appetite there would be to support it in v2 for this.

^ permalink raw reply

* Re: [PATCH 1/6] alloc_tag: add ioctl to /proc/allocinfo
From: Suren Baghdasaryan @ 2026-05-19 17:42 UTC (permalink / raw)
  To: Hao Ge
  Cc: Abhishek Bapat, Shuah Khan, Jonathan Corbet, linux-doc,
	linux-kernel, linux-mm, Sourav Panda, Kent Overstreet,
	Andrew Morton
In-Reply-To: <c627136d-8060-4e2d-8473-0fe322ce1e6c@linux.dev>

On Mon, May 18, 2026 at 7:53 PM Hao Ge <hao.ge@linux.dev> wrote:
>
> Hi Abhishek
>
>
> Thanks for the follow-up.
>
>
> On 2026/5/19 07:41, Abhishek Bapat wrote:
> > On Wed, May 13, 2026 at 9:38 PM Hao Ge<hao.ge@linux.dev>  wrote:
> >> Hi Suren and Abhishek
> >>
> >>
> >> Thanks for the patch! A couple of minor comments below.
> >>
> >>
> >> On 2026/5/5 07:36, Abhishek Bapat wrote:
> >>> From: Suren Baghdasaryan<surenb@google.com>
> >>>
> >>> Add the following ioctl commands for /proc/allocinfo file:
> >>>
> >>> ALLOCINFO_IOC_CONTENT_ID - gets content identifier which can be used
> >>> to check whether the file content has changed specifically due to module
> >>> load/unload. Every time a module is loaded / unloaded, the returned
> >>> value will be different. By comparing the identifier value at the
> >>> beginning and at the end of the content retrieval operation, users can
> >>> validate retrieved information for consistency.
> >>>
> >>> ALLOCINFO_IOC_GET_AT - gets the record at the specified position. This
> >>> is the position of a record in /proc/allocinfo.
> >>>
> >>> ALLOCINFO_IOC_GET_NEXT - gets the record next to the last retrieved
> >>> one. If no records were previously retrieved, returns the first
> >>> record.
> >>>
> >>> Signed-off-by: Suren Baghdasaryan<surenb@google.com>
> >>> Signed-off-by: Abhishek Bapat<abhishekbapat@google.com>
> >>> ---
> >>>    .../userspace-api/ioctl/ioctl-number.rst      |   2 +
> >>>    include/linux/codetag.h                       |   1 +
> >>>    include/uapi/linux/alloc_tag.h                |  54 ++++++
> >>>    lib/alloc_tag.c                               | 178 +++++++++++++++++-
> >>>    lib/codetag.c                                 |  11 ++
> >>>    5 files changed, 244 insertions(+), 2 deletions(-)
> >>>    create mode 100644 include/uapi/linux/alloc_tag.h
> >>>
> >>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> index 331223761fff..84f6808a8578 100644
> >>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> @@ -349,6 +349,8 @@ Code  Seq#    Include File                                             Comments
> >>>                                                                           <mailto:luzmaximilian@gmail.com>
> >>>    0xA5  20-2F  linux/surface_aggregator/dtx.h                            Microsoft Surface DTX driver
> >>>                                                                           <mailto:luzmaximilian@gmail.com>
> >>> +0xA6  00-0F  uapi/linux/alloc_tag.h                                    Memory allocation profiling
> >>> +<mailto:surenb@google.com>
> >>>    0xAA  00-3F  linux/uapi/linux/userfaultfd.h
> >>>    0xAB  00-1F  linux/nbd.h
> >>>    0xAC  00-1F  linux/raw.h
> >>> diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> >>> index 8ea2a5f7c98a..2bcd4e7c809e 100644
> >>> --- a/include/linux/codetag.h
> >>> +++ b/include/linux/codetag.h
> >>> @@ -76,6 +76,7 @@ struct codetag_iterator {
> >>>
> >>>    void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
> >>>    bool codetag_trylock_module_list(struct codetag_type *cttype);
> >>> +unsigned long codetag_get_content_id(struct codetag_type *cttype);
> >>>    struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
> >>>    struct codetag *codetag_next_ct(struct codetag_iterator *iter);
> >>>
> >>> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> >>> new file mode 100644
> >>> index 000000000000..e9a5b55fcc7a
> >>> --- /dev/null
> >>> +++ b/include/uapi/linux/alloc_tag.h
> >>> @@ -0,0 +1,54 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> >>> +/*
> >>> + *  include/linux/alloc_tag.h
> >>> + */
> >>> +
> >>> +#ifndef _UAPI_ALLOC_TAG_H
> >>> +#define _UAPI_ALLOC_TAG_H
> >>> +
> >>> +#include <linux/types.h>
> >>> +
> >>> +#define ALLOCINFO_STR_SIZE   64
> >>> +
> >>> +struct allocinfo_content_id {
> >>> +     __u64 id;
> >>> +};
> >>> +
> >>> +struct allocinfo_tag {
> >>> +     /* Longer names are trimmed */
> >>> +     char modname[ALLOCINFO_STR_SIZE];
> >>> +     char function[ALLOCINFO_STR_SIZE];
> >>> +     char filename[ALLOCINFO_STR_SIZE];
> >>> +     __u64 lineno;
> >>> +};
> >>> +
> >>> +struct allocinfo_counter {
> >>> +     __u64 bytes;
> >>> +     __u64 calls;
> >>> +     __u8 accurate;
> >>> +     __u8 pad[7]; /* Add alignment to not break the 32-bit compatible interface */
> >>> +};
> >>> +
> >>> +struct allocinfo_tag_data {
> >>> +     struct allocinfo_tag tag;
> >>> +     struct allocinfo_counter counter;
> >>> +};
> >>> +
> >>> +struct allocinfo_get_at {
> >>> +     __u64 pos;      /* input */
> >>> +     struct allocinfo_tag_data data;
> >>> +};
> >>> +
> >>> +#define _ALLOCINFO_IOC_CONTENT_ID    0
> >>> +#define _ALLOCINFO_IOC_GET_AT                1
> >>> +#define _ALLOCINFO_IOC_GET_NEXT              2
> >>> +
> >>> +#define ALLOCINFO_IOC_BASE           0xA6
> >>> +#define ALLOCINFO_IOC_CONTENT_ID     _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_CONTENT_ID,     \
> >>> +                                          struct allocinfo_content_id)
> >>> +#define ALLOCINFO_IOC_GET_AT         _IOWR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_AT,        \
> >>> +                                           struct allocinfo_get_at)
> >>> +#define ALLOCINFO_IOC_GET_NEXT               _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_NEXT,       \
> >>> +                                          struct allocinfo_tag_data)
> >>> +
> >>> +#endif /* _UAPI_ALLOC_TAG_H */
> >>> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> >>> index ed1bdcf1f8ab..5c24d2f954d4 100644
> >>> --- a/lib/alloc_tag.c
> >>> +++ b/lib/alloc_tag.c
> >>> @@ -14,6 +14,7 @@
> >>>    #include <linux/string_choices.h>
> >>>    #include <linux/vmalloc.h>
> >>>    #include <linux/kmemleak.h>
> >>> +#include <uapi/linux/alloc_tag.h>
> >>>
> >>>    #define ALLOCINFO_FILE_NAME         "allocinfo"
> >>>    #define MODULE_ALLOC_TAG_VMAP_SIZE  (100000UL * sizeof(struct alloc_tag))
> >>> @@ -46,6 +47,9 @@ int alloc_tag_ref_offs;
> >>>    struct allocinfo_private {
> >>>        struct codetag_iterator iter;
> >>>        bool print_header;
> >>> +     /* ioctl uses a separate iterator not to interfere with reads */
> >>> +     struct codetag_iterator ioctl_iter;
> >>> +     bool positioned; /* seq_open_private() sets to 0 */
> >>>    };
> >>>
> >>>    static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> >>> @@ -125,6 +129,177 @@ static const struct seq_operations allocinfo_seq_op = {
> >>>        .show   = allocinfo_show,
> >>>    };
> >>>
> >>> +static int allocinfo_open(struct inode *inode, struct file *file)
> >>> +{
> >>> +     return seq_open_private(file, &allocinfo_seq_op,
> >>> +                             sizeof(struct allocinfo_private));
> >>> +}
> >>> +
> >>> +static int allocinfo_release(struct inode *inode, struct file *file)
> >>> +{
> >>> +     return seq_release_private(inode, file);
> >>> +}
> >>> +
> >>> +static const char *allocinfo_str(const char *str)
> >>> +{
> >>> +     size_t len = strlen(str);
> >>> +
> >>> +     /* Keep an extra space for the trailing NULL. */
> >>> +     if (len >= ALLOCINFO_STR_SIZE)
> >>> +             str += (len - ALLOCINFO_STR_SIZE) + 1;
> >>> +     return str;
> >>> +}
> >>> +
> >>> +/* Copy a string and trim from the beginning if it's too long */
> >>> +static void allocinfo_copy_str(char *dest, const char *src)
> >>> +{
> >>> +     strscpy(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE);
> >>> +}
> >>> +
> >>> +static void allocinfo_to_params(struct codetag *ct,
> >>> +                             struct allocinfo_tag_data *data)
> >>> +{
> >>> +     struct alloc_tag *tag = ct_to_alloc_tag(ct);
> >>> +     struct alloc_tag_counters counter = alloc_tag_read(tag);
> >>> +
> >>> +     if (ct->modname)
> >>> +             allocinfo_copy_str(data->tag.modname, ct->modname);
> >>> +     else
> >>> +             data->tag.modname[0] = '\0';
> >> Minor nit about allocinfo_to_params():
> >>
> >> When modname is NULL (built-in kernel code), the current code sets it
> >>
> >> to an empty string:
> >>
> >>       if (ct->modname)
> >>
> >>           allocinfo_copy_str(data->tag.modname, ct->modname);
> >>
> >>       else
> >>
> >>           data->tag.modname[0] = '\0';
> >>
> >> This is of course workable in userspace by checking for an empty
> >>
> >> string, but I was wondering if it would be cleaner to use "vmlinux"
> >>
> >> as a default:
> >>
> >> else
> >>
> >>             allocinfo_copy_str(data->tag.modname, "vmlinux");
> >>
> >>
> >> For some context, in our memory analysis workflow we often group
> >>
> >> allocations by module to get a quick overview of where memory goes,
> >>
> >> for example:
> >>
> >> vmlinux:    2.1 GB    (kernel core)
> >>
> >> nvidia:     1.2 GB    (GPU driver)
> >>
> >> iwlwifi:    800 MB    (WiFi driver)
> >>
> >> ext4:       500 MB    (filesystem)
> >>
> >> Having a consistent identifier for kernel built-in allocations would
> >>
> >> avoid each userspace tool needing to handle the empty string as a
> >>
> >> special case. Totally fine if this is intentional though.
> >>
> > Thanks for bringing this up, I can certainly make this change.
> > However, the information is not currently exposed this way through
> > /proc/allocinfo. /proc/allocinfo does not categorize kernel non-module
> > allocations as vmlinux, so there will a delta between how IOCTL and
> > /proc/allocinfo behave. Suren, could you comment on whether this
> > recommendation is fine by you?
> >
> Right, /proc/allocinfo indeed doesn't categorize them as vmlinux currently.
>
> It's just that in practice we often group allocations by module, so
> having "vmlinux" as a default
>
> would be convenient. Let's wait for Suren's input.

Hi Folks,
I would prefer to keep it empty because vmlinux is not really a module
and hardcoding this name also seems suboptimal (in case it ever
changes). Empty string also aligns with how we output /proc/allocinfo
data. If the symbol is in the kernel itself, we do not display the
module name at all. So, all in all, unless there is a strong reason
against it, I think we should keep it empty.

>
> >>> +     allocinfo_copy_str(data->tag.function, ct->function);
> >>> +     allocinfo_copy_str(data->tag.filename, ct->filename);
> >>> +     data->tag.lineno = ct->lineno;
> >>> +     data->counter.bytes = counter.bytes;
> >>> +     data->counter.calls = counter.calls;
> >>> +     data->counter.accurate = !alloc_tag_is_inaccurate(tag);
> >>> +}
> >>> +
> >>> +static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
> >>> +{
> >>> +     struct allocinfo_content_id params;
> >>> +
> >>> +     codetag_lock_module_list(alloc_tag_cttype, true);
> >>> +     params.id = codetag_get_content_id(alloc_tag_cttype);
> >>> +     codetag_lock_module_list(alloc_tag_cttype, false);
> >>> +     if (copy_to_user(arg, &params, sizeof(params)))
> >>> +             return -EFAULT;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> >>> +{
> >>> +     struct allocinfo_private *priv;
> >>> +     struct codetag *ct;
> >>> +     __u64 pos;
> >>> +     struct allocinfo_get_at params = {0};
> >>> +
> >>> +     if (copy_from_user(&params, arg, sizeof(params)))
> >>> +             return -EFAULT;
> >>> +
> >>> +     priv = (struct allocinfo_private *)m->private;
> >>> +     pos = params.pos;
> >>> +
> >>> +     codetag_lock_module_list(alloc_tag_cttype, true);
> >>> +
> >>> +     /* Find the codetag */
> >>> +     priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> >>> +     ct = codetag_next_ct(&priv->ioctl_iter);
> >>> +     while (ct && pos--)
> >>> +             ct = codetag_next_ct(&priv->ioctl_iter);
> >> I noticed that codetag_next_ct(&priv->ioctl_iter) and
> >>
> >> priv->positioned are accessed without serialization in the ioctl
> >>
> >> path. Concurrent ioctl calls on the same fd could race on these
> >>
> >> fields. Just something I spotted while reading the code.
> >>
> >>
> >> Thanks
> >>
> >> Best Regards
> >>
> >> Hao
> >>
> > I believe this should be prevented by `codetag_lock_module_list`; am I
> > wrong in my understanding?
>
> Thanks for the explanation! codetag_lock_module_list is designed to
> protect the module list from concurrent load/unload, which it does
>
> correctly. However, it doesn't cover the race between concurrent ioctl
> calls on the same fd, since it acquires cttype->mod_lock via
>
> down_read() and rwsem read locks allow multiple readers to proceed
> concurrently:
>
> Thread A: ALLOCINFO_IOC_GET_AT
>
> down_read(&cttype->mod_lock)              // read lock acquired
>
> priv->ioctl_iter = codetag_get_ct_iter(...)
>
> ct = codetag_next_ct(&priv->ioctl_iter)
>
> priv->positioned = true;
>
> Thread B: ALLOCINFO_IOC_GET_NEXT            // concurrent ioctl on same fd
>
> down_read(&cttype->mod_lock)              // read locks don't exclude
> each other
>
> if (!priv->positioned) {                  // sees partial state from
> Thread A
>
> priv->ioctl_iter = ...                // overwrites Thread A's iterator
>
> }
>
> ct = codetag_next_ct(&priv->ioctl_iter)   // corrupted iterator
>
> priv->ioctl_iter and priv->positioned are per-fd state with no
> serialization in the ioctl path.

Yep, you are right. codetag_lock_module_list() is not enough here to
protect from such races. I guess allocinfo_private would need another
lock.
Thanks,
Suren.


>
> Just something I spotted.
>
> Thanks
>
> Best Regards
>
> Hao
>
> >>> +     if (ct) {
> >>> +             allocinfo_to_params(ct, &params.data);
> >>> +             priv->positioned = true;
> >>> +     }
> >>> +
> >>> +     codetag_lock_module_list(alloc_tag_cttype, false);
> >>> +
> >>> +     if (!ct)
> >>> +             return -ENOENT;
> >>> +
> >>> +     if (copy_to_user(arg, &params, sizeof(params)))
> >>> +             return -EFAULT;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
> >>> +{
> >>> +     struct allocinfo_private *priv;
> >>> +     struct codetag *ct;
> >>> +     struct allocinfo_tag_data params = {0};
> >>> +     int ret = 0;
> >>> +
> >>> +     priv = (struct allocinfo_private *)m->private;
> >>> +
> >>> +     codetag_lock_module_list(alloc_tag_cttype, true);
> >>> +
> >>> +     if (!priv->positioned) {
> >>> +             priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> >>> +             priv->positioned = true;
> >>> +     }
> >>> +
> >>> +     ct = codetag_next_ct(&priv->ioctl_iter);
> >>> +     if (ct)
> >>> +             allocinfo_to_params(ct, &params);
> >>> +
> >>> +     if (!ct) {
> >>> +             priv->positioned = false;
> >>> +             ret = -ENOENT;
> >>> +     }
> >>> +     codetag_lock_module_list(alloc_tag_cttype, false);
> >>> +
> >>> +     if (ret == 0) {
> >>> +             if (copy_to_user(arg, &params, sizeof(params)))
> >>> +                     return -EFAULT;
> >>> +     }
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static long allocinfo_ioctl(struct file *file, unsigned int cmd,
> >>> +                         unsigned long __arg)
> >>> +{
> >>> +     void __user *arg = (void __user *)__arg;
> >>> +     int ret;
> >>> +
> >>> +     switch (cmd) {
> >>> +     case ALLOCINFO_IOC_CONTENT_ID:
> >>> +             ret = allocinfo_ioctl_get_content_id(file->private_data, arg);
> >>> +             break;
> >>> +     case ALLOCINFO_IOC_GET_AT:
> >>> +             ret = allocinfo_ioctl_get_at(file->private_data, arg);
> >>> +             break;
> >>> +     case ALLOCINFO_IOC_GET_NEXT:
> >>> +             ret = allocinfo_ioctl_get_next(file->private_data, arg);
> >>> +             break;
> >>> +     default:
> >>> +             ret = -ENOIOCTLCMD;
> >>> +             break;
> >>> +     }
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +#ifdef CONFIG_COMPAT
> >>> +static long allocinfo_compat_ioctl(struct file *file, unsigned int cmd,
> >>> +                                unsigned long arg)
> >>> +{
> >>> +     return allocinfo_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
> >>> +}
> >>> +#endif
> >>> +
> >>> +static const struct proc_ops allocinfo_proc_ops = {
> >>> +     .proc_open              = allocinfo_open,
> >>> +     .proc_read_iter         = seq_read_iter,
> >>> +     .proc_lseek             = seq_lseek,
> >>> +     .proc_release           = allocinfo_release,
> >>> +     .proc_ioctl             = allocinfo_ioctl,
> >>> +#ifdef CONFIG_COMPAT
> >>> +     .proc_compat_ioctl      = allocinfo_compat_ioctl,
> >>> +#endif
> >>> +
> >>> +};
> >>> +
> >>>    size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sleep)
> >>>    {
> >>>        struct codetag_iterator iter;
> >>> @@ -946,8 +1121,7 @@ static int __init alloc_tag_init(void)
> >>>                return 0;
> >>>        }
> >>>
> >>> -     if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
> >>> -                                  sizeof(struct allocinfo_private), NULL)) {
> >>> +     if (!proc_create(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_proc_ops)) {
> >>>                pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
> >>>                shutdown_mem_profiling(false);
> >>>                return -ENOMEM;
> >>> diff --git a/lib/codetag.c b/lib/codetag.c
> >>> index 304667897ad4..93aa30991563 100644
> >>> --- a/lib/codetag.c
> >>> +++ b/lib/codetag.c
> >>> @@ -48,6 +48,17 @@ bool codetag_trylock_module_list(struct codetag_type *cttype)
> >>>        return down_read_trylock(&cttype->mod_lock) != 0;
> >>>    }
> >>>
> >>> +unsigned long codetag_get_content_id(struct codetag_type *cttype)
> >>> +{
> >>> +     lockdep_assert_held(&cttype->mod_lock);
> >>> +
> >>> +     /*
> >>> +      * next_mod_seq is updated on every load, so can be used to identify
> >>> +      * content changes.
> >>> +      */
> >>> +     return cttype->next_mod_seq;
> >>> +}
> >>> +
> >>>    struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
> >>>    {
> >>>        struct codetag_iterator iter = {
> > Note, I will be following up with a v2 patchset with your feedback
> > included. Please bring up any other points you'd want to clarify so
> > that I can include all the changes in the v2 patchset. Thanks for
> > reviewing!
>

^ permalink raw reply

* Re: [PATCH 4/8] drm/panthor: Add support for protected memory allocation in panthor
From: Boris Brezillon @ 2026-05-19 17:29 UTC (permalink / raw)
  To: Chia-I Wu
  Cc: Ketil Johnsen, Liviu Dudau, Marcin Ślusarz, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian König, Steven Price, Daniel Almeida, Alice Ryhl,
	Matthias Brugger, AngeloGioacchino Del Regno, dri-devel,
	linux-doc, linux-kernel, linux-media, linaro-mm-sig,
	linux-arm-kernel, linux-mediatek, Florent Tomasin, nd
In-Reply-To: <CAPaKu7T7JZRmsS+D_3zFZtyhJk9mNXjL=xpAQ-UNGbm0vztyRg@mail.gmail.com>

On Tue, 19 May 2026 10:07:02 -0700
Chia-I Wu <olvaffe@gmail.com> wrote:

> On Tue, May 19, 2026 at 1:49 AM Ketil Johnsen <ketil.johnsen@arm.com> wrote:
> >
> > On 19/05/2026 09:39, Boris Brezillon wrote:  
> > > On Mon, 18 May 2026 17:36:40 -0700
> > > Chia-I Wu <olvaffe@gmail.com> wrote:
> > >  
> > >> On Mon, May 18, 2026 at 12:16 AM Boris Brezillon
> > >> <boris.brezillon@collabora.com> wrote:  
> > >>>
> > >>> On Wed, 13 May 2026 12:31:32 -0700
> > >>> Chia-I Wu <olvaffe@gmail.com> wrote:
> > >>>  
> > >>>> On Tue, May 12, 2026 at 8:39 AM Liviu Dudau <liviu.dudau@arm.com> wrote:  
> > >>>>>
> > >>>>> On Tue, May 12, 2026 at 04:11:11PM +0200, Boris Brezillon wrote:  
> > >>>>>> On Tue, 12 May 2026 14:47:27 +0100
> > >>>>>> Liviu Dudau <liviu.dudau@arm.com> wrote:
> > >>>>>>  
> > >>>>>>> On Thu, May 07, 2026 at 01:53:56PM +0200, Boris Brezillon wrote:  
> > >>>>>>>> On Thu, 7 May 2026 11:02:26 +0200
> > >>>>>>>> Marcin Ślusarz <marcin.slusarz@arm.com> wrote:
> > >>>>>>>>  
> > >>>>>>>>> On Tue, May 05, 2026 at 06:15:23PM +0200, Boris Brezillon wrote:  
> > >>>>>>>>>>> @@ -277,9 +286,21 @@ int panthor_device_init(struct panthor_device *ptdev)
> > >>>>>>>>>>>                      return ret;
> > >>>>>>>>>>>      }
> > >>>>>>>>>>>
> > >>>>>>>>>>> +   /* If a protected heap name is specified but not found, defer the probe until created */
> > >>>>>>>>>>> +   if (protected_heap_name && strlen(protected_heap_name)) {  
> > >>>>>>>>>>
> > >>>>>>>>>> Do we really need this strlen() > 0? Won't dma_heap_find() fail is the
> > >>>>>>>>>> name is "" already?  
> > >>>>>>>>>
> > >>>>>>>>> If dma_heap_find() will fail, then the whole probe with fail too.
> > >>>>>>>>> This check prevents that.  
> > >>>>>>>>
> > >>>>>>>> Yeah, that's also a questionable design choice. I mean, we can
> > >>>>>>>> currently probe and boot the FW even though we never setup the
> > >>>>>>>> protected FW sections, so why should we defer the probe here? Can't we
> > >>>>>>>> just retry the next time a group with the protected bit is created and
> > >>>>>>>> fail if we can find a protected heap?  
> > >>>>>>>
> > >>>>>>> The problem we have with the current firmware is that it does a number of setup steps at "boot"
> > >>>>>>> time only. One of the steps is preparing its internal structures for when it enters protected
> > >>>>>>> mode and it stores them in the buffer passed in at firmware loading. We cannot later run the
> > >>>>>>> process when we have a group with protected mode set.  
> > >>>>>>
> > >>>>>> No, but we can force a full/slow reset and have that thing
> > >>>>>> re-initialized, can't we? I mean, that's basically what we do when a
> > >>>>>> fast reset fails: we re-initialize all the sections and reset again, at
> > >>>>>> which point the FW should start from a fresh state, and be able to
> > >>>>>> properly initialize the protected-related stuff if protected sections
> > >>>>>> are populated. Am I missing something?  
> > >>>>>
> > >>>>> Right, we can do that. For some reason I keep associating the reset with the
> > >>>>> error handling and not with "normal" operations.  
> > >>>> I kind of hope we end up with either
> > >>>>
> > >>>>   - panthor knows the exact heap to use and fails with EPROBE_DEFER if
> > >>>> the heap is missing, or
> > >>>>   - panthor gets a dma-buf from userspace and does the full reset
> > >>>>     - userspace also needs to provide a dma-buf for each protected
> > >>>> group for the suspend buffer
> > >>>>
> > >>>> than something in-between. The latter is more ad-hoc and basically
> > >>>> kicks the issue to the userspace.  
> > >>>
> > >>> Indeed, the second option is more ad-hoc, but when you think about it,
> > >>> userspace has to have this knowledge, because it needs to know the
> > >>> dma-heap to use for buffer allocation that cross a device boundary
> > >>> anyway. Think about frames produced by a video decoder, and composited
> > >>> by the GPU into a protected scanout buffer that's passed to the KMS
> > >>> device. Why would the GPU driver be source of truth when it comes to
> > >>> choosing the heap to use to allocate protected buffers for the video
> > >>> decoder or those used for the display?  
> > >> I don't think the GPU driver is ever the source of truth. If the
> > >> system integrator wants to specify the source of truth (SoT) from
> > >> kernel space, they should use the device tree (or module params /
> > >> config options). If they want to specify the SoT in userspace, then we
> > >> don't really care how it is done other than providing an ioctl.
> > >> Panthor is always on the receiving end.  
> > >
> > > Okay, we're on the same page then.
> > >  
> > >>
> > >> If we don't want to delay this functionality, but it takes time to
> > >> converge on SoT, maybe a solution that is not a long-term promise can
> > >> work? Of the options on the table (dt, module params, kconfig options,
> > >> ioctls), a kconfig option, potentially marked as experimental, seems
> > >> like a good candidate.  
> > >
> > > If Panthor is only a consumer, I actually think it'd be easier to just
> > > let userspace pass the protected FW section as an imported buffer
> > > through an ioctl for now. It means we don't need any of the
> > > modifications to the dma_heap API in this series, and userspace is free
> > > to choose its SoT (efuse, DT, ...) and pass the info back to mesa/GBM
> > > somehow (envvar, driconf, ...). The only thing we need to ensure is if
> > > lazy protected FW section allocation is going to work, but given the
> > > current code purely and simply ignores those sections, and the FW is
> > > still able to boot and act properly (at least on v10-v13), I'm pretty
> > > confident this is okay, unless there's some trick the MCU can do to
> > > detect that the protected section isn't mapped (which I doubt, because
> > > the MCU doesn't know it lives behind an MMU).  
> I set up MMU to map non-protected memory to the protected section the
> other day. The FW still booted fine. I didn't get access violation
> until the FW executed PROT_REGION and panthor requested
> GLB_PROTM_ENTER in response.

Ah, thanks for testing! We still don't have a setup with proper
protected heap, but that was on my list of things to test.

> 
> This was on v13, but I also doubt it will become an issue. Can ARM help clarify?
> 
> > >
> > > Of course, once we have a consensus on how to describe this in the DT,
> > > we can switch Panthor over to "protected dma_heap selection through DT",
> > > and reflect that through the ioctl that exposes whether protected
> > > support is ready or not (would be a DEV_QUERY), such that userspace can
> > > skip this "PROTM initialization" step.
> > >
> > > We're talking about an extra ioctl to set those buffers, and a
> > > DEV_QUERY to query the state (ready or not), the size of the global
> > > protected buffer (protected FW section) and the size of the protected
> > > suspend buffer. The protected suspend buffer would be allocated and
> > > passed at group creation time (extra arg passed to the existing
> > > GROUP_CREATE ioctl). So, overall, I don't consider it a huge liability
> > > in term of maintenance cost.  
> >
> > If we can avoid the dma-heap changes, then that would surely help!
> > I can try to implement this in the next version unless someone finds a
> > reason why it is a bad idea.  
> Yeah, that sounds good to me too.
> 
> Will the extra ioctl require root?

The PROTM_INIT ioctl will certainly require high privilege
CAP_SYS_<something>, dunno yet what that <something> would be though.

> On a system with true protected
> memory, the FW cannot write to non-protected memory. It seems ok to
> allow any client to make the ioctl call. But on systems without true
> protected memory, it can be problematic.

Yep, I agree we shouldn't let random users pretend they initialized
protected mode if the system as a whole doesn't have proper the proper
bit hooked up to set that up.

^ permalink raw reply

* Re: [PATCH 00/12] misc/syncobj: add /dev/syncobj device
From: Xaver Hugl @ 2026-05-19 17:08 UTC (permalink / raw)
  To: Christian König
  Cc: Julian Orth, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Sumit Semwal, Jonathan Corbet,
	Shuah Khan, Arnd Bergmann, Greg Kroah-Hartman, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, linux-doc,
	wayland-devel, Michel Dänzer
In-Reply-To: <dff60378-4e47-4753-8878-feec6e1c2690@amd.com>

> > The part where we get this independent of attached hardware is quite
> > important for us though, since we can't just ignore explicit sync once
> > the device we previously imported the syncobj into is disconnected.
>
> Can you elaborate more on this?

In Wayland, the client is allowed to attach dmabuf and syncobj
independently, they don't have to be from the same device (and the
compositor wouldn't be able to verify the opposite anyways). The
compositor will usually import both into the same drm device, but
especially with compositors that render on multiple devices, that's
not necessarily the case either.

If for example we had a system with one internal GPU and one external
GPU, the client renders on the internal GPU and the compositor uses
the external one. Now when the user yanks the USB C cable, afaiu
- the buffers from the client stay valid
- the syncobj stays valid on the client side
- the syncobj becomes invalid on the compositor side

"invalid" there means either
- the acquire point of the client is marked as signaled, before
rendering on the client side is completed
- the acquire point of the client is never signaled. Since the
compositor waits for the acquire point, the Wayland surface is stuck
forever

Afaik the latter is currently the case. The former wouldn't be much
better though, not when it's preventable.

This is admittedly an edge case, but GPU hotunplug is something we try
to support as well as possible in Plasma, and all the edge cases cause
a lot of problems in combination and are a lot of headaches to handle
(or really work around) in the compositor.
Another edge case is when the client asks the compositor to import the
syncobj, which can fail when a hotunplug is in process, and ends up
disconnecting the client for no fault of either client or compositor.

> >>> 3. It removes the need to translate between syncobjs fds and handles.
> >>
> >> That's a pretty big no-go as well. The differentiation between FDs and handles is completely intentional.
> > Could you expand on why it's needed? For compositors, the handle is
> > just an intermediary thing when translating between file descriptors.
>
> Well what we could do is to add an IOCTL to directly attach an syncobj file descriptor to an eventfd.
That would be nice.

- Xaver

^ permalink raw reply

* Re: [PATCH 4/8] drm/panthor: Add support for protected memory allocation in panthor
From: Chia-I Wu @ 2026-05-19 17:07 UTC (permalink / raw)
  To: Ketil Johnsen
  Cc: Boris Brezillon, Liviu Dudau, Marcin Ślusarz, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian König, Steven Price, Daniel Almeida, Alice Ryhl,
	Matthias Brugger, AngeloGioacchino Del Regno, dri-devel,
	linux-doc, linux-kernel, linux-media, linaro-mm-sig,
	linux-arm-kernel, linux-mediatek, Florent Tomasin, nd
In-Reply-To: <8f0b1750-a853-4895-9672-73a75f6dbd84@arm.com>

On Tue, May 19, 2026 at 1:49 AM Ketil Johnsen <ketil.johnsen@arm.com> wrote:
>
> On 19/05/2026 09:39, Boris Brezillon wrote:
> > On Mon, 18 May 2026 17:36:40 -0700
> > Chia-I Wu <olvaffe@gmail.com> wrote:
> >
> >> On Mon, May 18, 2026 at 12:16 AM Boris Brezillon
> >> <boris.brezillon@collabora.com> wrote:
> >>>
> >>> On Wed, 13 May 2026 12:31:32 -0700
> >>> Chia-I Wu <olvaffe@gmail.com> wrote:
> >>>
> >>>> On Tue, May 12, 2026 at 8:39 AM Liviu Dudau <liviu.dudau@arm.com> wrote:
> >>>>>
> >>>>> On Tue, May 12, 2026 at 04:11:11PM +0200, Boris Brezillon wrote:
> >>>>>> On Tue, 12 May 2026 14:47:27 +0100
> >>>>>> Liviu Dudau <liviu.dudau@arm.com> wrote:
> >>>>>>
> >>>>>>> On Thu, May 07, 2026 at 01:53:56PM +0200, Boris Brezillon wrote:
> >>>>>>>> On Thu, 7 May 2026 11:02:26 +0200
> >>>>>>>> Marcin Ślusarz <marcin.slusarz@arm.com> wrote:
> >>>>>>>>
> >>>>>>>>> On Tue, May 05, 2026 at 06:15:23PM +0200, Boris Brezillon wrote:
> >>>>>>>>>>> @@ -277,9 +286,21 @@ int panthor_device_init(struct panthor_device *ptdev)
> >>>>>>>>>>>                      return ret;
> >>>>>>>>>>>      }
> >>>>>>>>>>>
> >>>>>>>>>>> +   /* If a protected heap name is specified but not found, defer the probe until created */
> >>>>>>>>>>> +   if (protected_heap_name && strlen(protected_heap_name)) {
> >>>>>>>>>>
> >>>>>>>>>> Do we really need this strlen() > 0? Won't dma_heap_find() fail is the
> >>>>>>>>>> name is "" already?
> >>>>>>>>>
> >>>>>>>>> If dma_heap_find() will fail, then the whole probe with fail too.
> >>>>>>>>> This check prevents that.
> >>>>>>>>
> >>>>>>>> Yeah, that's also a questionable design choice. I mean, we can
> >>>>>>>> currently probe and boot the FW even though we never setup the
> >>>>>>>> protected FW sections, so why should we defer the probe here? Can't we
> >>>>>>>> just retry the next time a group with the protected bit is created and
> >>>>>>>> fail if we can find a protected heap?
> >>>>>>>
> >>>>>>> The problem we have with the current firmware is that it does a number of setup steps at "boot"
> >>>>>>> time only. One of the steps is preparing its internal structures for when it enters protected
> >>>>>>> mode and it stores them in the buffer passed in at firmware loading. We cannot later run the
> >>>>>>> process when we have a group with protected mode set.
> >>>>>>
> >>>>>> No, but we can force a full/slow reset and have that thing
> >>>>>> re-initialized, can't we? I mean, that's basically what we do when a
> >>>>>> fast reset fails: we re-initialize all the sections and reset again, at
> >>>>>> which point the FW should start from a fresh state, and be able to
> >>>>>> properly initialize the protected-related stuff if protected sections
> >>>>>> are populated. Am I missing something?
> >>>>>
> >>>>> Right, we can do that. For some reason I keep associating the reset with the
> >>>>> error handling and not with "normal" operations.
> >>>> I kind of hope we end up with either
> >>>>
> >>>>   - panthor knows the exact heap to use and fails with EPROBE_DEFER if
> >>>> the heap is missing, or
> >>>>   - panthor gets a dma-buf from userspace and does the full reset
> >>>>     - userspace also needs to provide a dma-buf for each protected
> >>>> group for the suspend buffer
> >>>>
> >>>> than something in-between. The latter is more ad-hoc and basically
> >>>> kicks the issue to the userspace.
> >>>
> >>> Indeed, the second option is more ad-hoc, but when you think about it,
> >>> userspace has to have this knowledge, because it needs to know the
> >>> dma-heap to use for buffer allocation that cross a device boundary
> >>> anyway. Think about frames produced by a video decoder, and composited
> >>> by the GPU into a protected scanout buffer that's passed to the KMS
> >>> device. Why would the GPU driver be source of truth when it comes to
> >>> choosing the heap to use to allocate protected buffers for the video
> >>> decoder or those used for the display?
> >> I don't think the GPU driver is ever the source of truth. If the
> >> system integrator wants to specify the source of truth (SoT) from
> >> kernel space, they should use the device tree (or module params /
> >> config options). If they want to specify the SoT in userspace, then we
> >> don't really care how it is done other than providing an ioctl.
> >> Panthor is always on the receiving end.
> >
> > Okay, we're on the same page then.
> >
> >>
> >> If we don't want to delay this functionality, but it takes time to
> >> converge on SoT, maybe a solution that is not a long-term promise can
> >> work? Of the options on the table (dt, module params, kconfig options,
> >> ioctls), a kconfig option, potentially marked as experimental, seems
> >> like a good candidate.
> >
> > If Panthor is only a consumer, I actually think it'd be easier to just
> > let userspace pass the protected FW section as an imported buffer
> > through an ioctl for now. It means we don't need any of the
> > modifications to the dma_heap API in this series, and userspace is free
> > to choose its SoT (efuse, DT, ...) and pass the info back to mesa/GBM
> > somehow (envvar, driconf, ...). The only thing we need to ensure is if
> > lazy protected FW section allocation is going to work, but given the
> > current code purely and simply ignores those sections, and the FW is
> > still able to boot and act properly (at least on v10-v13), I'm pretty
> > confident this is okay, unless there's some trick the MCU can do to
> > detect that the protected section isn't mapped (which I doubt, because
> > the MCU doesn't know it lives behind an MMU).
I set up MMU to map non-protected memory to the protected section the
other day. The FW still booted fine. I didn't get access violation
until the FW executed PROT_REGION and panthor requested
GLB_PROTM_ENTER in response.

This was on v13, but I also doubt it will become an issue. Can ARM help clarify?

> >
> > Of course, once we have a consensus on how to describe this in the DT,
> > we can switch Panthor over to "protected dma_heap selection through DT",
> > and reflect that through the ioctl that exposes whether protected
> > support is ready or not (would be a DEV_QUERY), such that userspace can
> > skip this "PROTM initialization" step.
> >
> > We're talking about an extra ioctl to set those buffers, and a
> > DEV_QUERY to query the state (ready or not), the size of the global
> > protected buffer (protected FW section) and the size of the protected
> > suspend buffer. The protected suspend buffer would be allocated and
> > passed at group creation time (extra arg passed to the existing
> > GROUP_CREATE ioctl). So, overall, I don't consider it a huge liability
> > in term of maintenance cost.
>
> If we can avoid the dma-heap changes, then that would surely help!
> I can try to implement this in the next version unless someone finds a
> reason why it is a bad idea.
Yeah, that sounds good to me too.

Will the extra ioctl require root? On a system with true protected
memory, the FW cannot write to non-protected memory. It seems ok to
allow any client to make the ioctl call. But on systems without true
protected memory, it can be problematic.

>
> >>>> For the former, expressing the relation in DT seems to be the best,
> >>>> but only if possible :-). Otherwise, a kconfig option (instead of
> >>>> module param) should be easier to work with.
> >>>>
> >>>> Looking at the userspace implementation, can we also have an panthor
> >>>> ioctl to return the heap to userspace?
> >>>
> >>> Yes, it's something we can add, but again, I'm questioning the
> >>> usefulness of this: how can we ensure the heap used by panthor to
> >>> allocate its protected FW buffers is suitable for scanout buffers
> >>> (buffers that can be used by display drivers). There needs to be a glue
> >>> leaving in usersland and taking the decision, and I'm not too sure
> >>> trusting any of the component in the chain (vdec, gpu, display) is the
> >>> right thing to do.
> >> The heap returned by panthor is only for panfrost/panvk. It says
> >> nothing about compatibility with other components on the system.
> >
> > Okay, if it's used only for internal buffers, I guess that's fine.
>
> --
> Ketil

^ permalink raw reply

* Re: [PATCH v1 1/1] kernel-doc: Issue warnings that were silently discarded
From: Randy Dunlap @ 2026-05-19 16:55 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Jonathan Corbet, Mauro Carvalho Chehab, linux-doc, linux-kernel
In-Reply-To: <aRC5NjhOmuGIpdPA@smile.fi.intel.com>

Hi,

I'm still seeing duplicated warning (logging) messages coming from
kernel-doc. Is there any progress on this?
I thought that there were some patches for this...

Thanks.

On 11/9/25 7:54 AM, Andy Shevchenko wrote:
> On Sat, Nov 08, 2025 at 04:03:15PM -0800, Randy Dunlap wrote:
>> On 11/5/25 10:12 AM, Jonathan Corbet wrote:
>>> [Heads up to Stephen: this change will add a bunch of warnings that had
>>> been dropped before.]
>>> Andy Shevchenko <andriy.shevchenko@linux.intel.com> writes:
>>>
>>>> When kernel-doc parses the sections for the documentation some errors
>>>> may occur. In many cases the warning is simply stored to the current
>>>> "entry" object. However, in the most of such cases this object gets
>>>> discarded and there is no way for the output engine to even know about
>>>> that. To avoid that, check if the "entry" is going to be discarded and
>>>> if there warnings have been collected, issue them to the current logger
>>>> as is and then flush the "entry". This fixes the problem that original
>>>> Perl implementation doesn't have.
>>>
>>> I would really like to redo how some of that logging is done, but that
>>> is an exercise for another day.  For now, I have applied this one,
>>> thanks.
>>
>> I think that this patch is causing a (large) problem.
>>
>> With this patch:
>> $ make mandocs &>mandocs.out
>>
>> Without this patch:
>> $ make mandocs &>mandocsnoas.out
>>
>> $ wc mandocs.out mandocsnoas.out
>>   29544  267393 3229456 mandocs.out
>>   10052   95948 1208101 mandocsnoas.out
>>
>> so it appears that this patch causes lots of extra output.
>> Some of that may be what the patch was trying to do, but
>> with this patch, "mandocs.out" above has lots of duplicated
>> Warning: lines.
>>
>> $ sort mandocs.out | uniq > mandocsuq.out
>> $ wc mandocsuq.out
>>   18012  167689 1994145 mandocsuq.out
>>
>> $ grep -c "^Warning:"  mandocs.out mandocsnoas.out  mandocsuq.out 
>> mandocs.out:25273
>> mandocsnoas.out:10022
>> mandocsuq.out:15252
> 
> Yes, that's what Mauro explained, that we may have the dups.
> 
>> In mandocs.out above (29544 lines), this line:
>> Warning: ../sound/soc/sprd/sprd-mcdt.h:48 struct member 'dma_chan' not described in 'sprd_mcdt_chan'
>>
>> is found at lines 7 and 29122.
>>
>> So maybe the logging output needs to be repaired sooner
>> than later.
> 
> Right! But I'm not familiar with this, so I can help only with testing,
> and not with real fix development.
> 

-- 
~Randy


^ permalink raw reply

* [PATCH] docs: pt_BR: Translate process/kernel-docs.rst into Portuguese
From: Daniel Pereira @ 2026-05-19 16:34 UTC (permalink / raw)
  To: linux-doc; +Cc: corbet, Daniel Pereira

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 17550 bytes --]

Translate Documentation/process/kernel-docs.rst into Portuguese (pt_BR)
and update the main index.

The content was adapted following the RST formatting rules and the
appropriate technical terminology for Brazilian Portuguese.

Signed-off-by: Daniel Pereira <danielmaraboo@gmail.com>
---
 Documentation/translations/pt_BR/index.rst    |   1 +
 .../pt_BR/process/kernel-docs.rst             | 373 ++++++++++++++++++
 2 files changed, 374 insertions(+)
 create mode 100644 Documentation/translations/pt_BR/process/kernel-docs.rst

diff --git a/Documentation/translations/pt_BR/index.rst b/Documentation/translations/pt_BR/index.rst
index 77c1a1cdc..76936710b 100644
--- a/Documentation/translations/pt_BR/index.rst
+++ b/Documentation/translations/pt_BR/index.rst
@@ -67,6 +67,7 @@ kernel e sobre como ver seu trabalho integrado.
    :maxdepth: 1
 
    Introdução <process/1.Intro>
+   Index de documentos do Kernel <process/kernel-docs>
    Regras de licenciamento <process/license-rules>
    Como começar <process/howto>
    Requisitos mínimos <process/changes>
diff --git a/Documentation/translations/pt_BR/process/kernel-docs.rst b/Documentation/translations/pt_BR/process/kernel-docs.rst
new file mode 100644
index 000000000..3c8d80ffa
--- /dev/null
+++ b/Documentation/translations/pt_BR/process/kernel-docs.rst
@@ -0,0 +1,373 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Índice de Documentação Adicional do Kernel
+==========================================
+
+A necessidade de um documento como este tornou-se evidente na lista de discussão
+linux-kernel, uma vez que as mesmas perguntas, solicitando referências de
+informações, apareciam repetidamente.
+
+Felizmente, à medida que cada vez mais pessoas chegam ao GNU/Linux, mais pessoas
+se interessam pelo Kernel. No entanto, ler o código-fonte nem sempre é o
+suficiente. É fácil entender o código, mas perder os conceitos, a filosofia
+e as decisões de design por trás dele.
+
+Infelizmente, não há muitos documentos disponíveis para iniciantes começarem.
+E, mesmo quando existem, não havia um local "bem conhecido" que os centralizasse.
+Estas linhas tentam suprir essa falta.
+
+POR FAVOR, se você conhece algum artigo não listado aqui ou se escrever um novo
+documento, inclua uma referência a ele aqui, seguindo o processo de envio de
+patches do kernel. Quaisquer correções, ideias ou comentários também são
+bem-vindos.
+
+Todos os documentos estão catalogados com os seguintes campos: o "Título" do
+documento, o(s) "Autor(es)", a "URL" onde podem ser encontrados, algumas
+"Palavras-chave" úteis para pesquisar tópicos específicos e uma breve
+"Descrição" do documento.
+
+.. note::
+
+   Os documentos em cada seção deste documento estão ordenados por sua data de
+   publicação, do mais recente para o mais antigo. O(s) mantenedor(es) deve(m)
+   remover periodicamente recursos à medida que se tornem obsoletos ou
+   desatualizados; com exceção de livros fundamentais.
+
+Documentação na árvore do Kernel
+--------------------------------
+
+Os manuais Sphinx devem ser compilados com ``make {htmldocs | pdfdocs | epubdocs}``.
+
+    * Nome: **linux/Documentation**
+
+      :Autor: Muitos.
+      :Localização: Documentation/
+      :Palavras-chave: arquivos de texto, Sphinx.
+      :Descrição: Documentação que acompanha o código-fonte do kernel,
+        dentro do diretório Documentation. Algumas páginas deste documento
+        (incluindo este próprio documento) foram movidas para lá e podem
+        estar mais atualizadas do que a versão web.
+
+Documentação on-line
+--------------------
+
+    * Título: **Linux Kernel Mailing List Glossary**
+
+      :Autor: diversos
+      :URL: https://kernelnewbies.org/KernelGlossary
+      :Data: versão contínua (rolling)
+      :Palavras-chave: glossário, termos, linux-kernel.
+      :Descrição: Da introdução: "Este glossário destina-se a ser uma breve
+        descrição de algumas das siglas e termos que você poderá ouvir durante
+        as discussões sobre o kernel Linux".
+
+    * Título: **The Linux Kernel Module Programming Guide**
+
+      :Autor: Peter Jay Salzman, Michael Burian, Ori Pomerantz, Bob Mottram,
+        Jim Huang.
+      :URL: https://sysprog21.github.io/lkmpg/
+      :Data: 2021
+      :Palavras-chave: módulos, livro GPL, /proc, ioctls, chamadas de sistema,
+        manipuladores de interrupção.
+      :Descrição: Um excelente livro sob licença GPL sobre o tópico de
+        programação de módulos. Repleto de exemplos. Atualmente, a nova versão
+        está sendo mantida ativamente em https://github.com/sysprog21/lkmpg.
+
+Livros Publicados
+-----------------
+
+    * Title: **The Linux Memory Manager**
+
+      :Autor: Lorenzo Stoakes
+      :Editora: No Starch Press
+      :Data: Fevereiro 2025
+      :Páginas: 1300
+      :ISBN: 978-1718504462
+      :Notas: Gerenciamento de memória. Rascunho completo disponível como acesso
+        antecipado para ré-venda, lançamento completo agendado para o
+        outono de 2025. Veja https://nostarch.com/linux-memory-manager
+        para mais informações.
+
+    * Title: **Practical Linux System Administration: A Guide to Installation, Configuration, and Management, 1st Edition**
+
+      :Autor: Kenneth Hess
+      :Editora: O'Reilly Media
+      :Data: Maio, 2023
+      :Páginas: 246
+      :ISBN: 978-1098109035
+      :Notas: Administração de sistemas
+
+    * Title: **Linux Kernel Debugging: Leverage proven tools and advanced techniques to effectively debug Linux kernels and kernel modules**
+
+      :Autor: Kaiwan N Billimoria
+      :Editora: Packt Publishing Ltd
+      :Data: Agosto, 2022
+      :Páginas: 638
+      :ISBN: 978-1801075039
+      :Notas: Livro sobre depuração (debugging)
+
+    * Title: **Linux Kernel Programming: A Comprehensive Guide to Kernel Internals, Writing Kernel Modules, and Kernel Synchronization**
+
+      :Autor: Kaiwan N Billimoria
+      :Editora: Packt Publishing Ltd
+      :Data: Março, 2021 (Segunda edição publicada em 2024)
+      :Páginas: 754
+      :ISBN: 978-1789953435 (O ISBN da segunda edição é 978-1803232225)
+
+    * Title: **Linux Kernel Programming Part 2 - Char Device Drivers and Kernel Synchronization: Create user-kernel interfaces, work with peripheral I/O, and handle hardware interrupts**
+
+      :Autor: Kaiwan N Billimoria
+      :Editora: Packt Publishing Ltd
+      :Data: Março, 2021
+      :Páginas: 452
+      :ISBN: 978-1801079518
+
+    * Title: **Linux System Programming: Talking Directly to the Kernel and C Library**
+
+      :Autor: Robert Love
+      :Editora: O'Reilly Media
+      :Data: Junho, 2013
+      :Páginas: 456
+      :ISBN: 978-1449339531
+      :Notas: Livro fundamental
+
+    * Título: **Linux Kernel Development, 3rd Edition**
+
+      :Autor: Robert Love
+      :Editora: Addison-Wesley
+      :Data: Julho de 2010
+      :Páginas: 440
+      :ISBN: 978-0672329463
+      :Notas: Livro fundamental
+
+    * Título: **Linux Device Drivers, 3rd Edition**
+
+      :Autores: Jonathan Corbet, Alessandro Rubini e Greg Kroah-Hartman
+      :Editora: O'Reilly & Associates
+      :Data: 2005
+      :Páginas: 636
+      :ISBN: 0-596-00590-3
+      :Notas: Livro fundamental. Mais informações em
+        http://www.oreilly.com/catalog/linuxdrive3/
+        Formato PDF, URL: https://lwn.net/Kernel/LDD3/
+
+    * Título: **The Design of the UNIX Operating System**
+
+      :Autor: Maurice J. Bach
+      :Editora: Prentice Hall
+      :Data: 1986
+      :Páginas: 471
+      :ISBN: 0-13-201757-1
+      :Notas: Livro fundamental
+
+Diversos
+--------
+
+    * Nome: **Cross-Referencing Linux**
+
+      :URL: https://elixir.bootlin.com/
+      :Palavras-chave: Navegação em código-fonte.
+      :Descrição: Outro navegador web para o código-fonte do kernel Linux.
+        Possui muitas referências cruzadas para variáveis e funções. Você pode
+        ver onde elas são definidas e onde são utilizadas.
+
+    * Nome: **Linux Weekly News**
+
+      :URL: https://lwn.net
+      :Palavras-chave: últimas notícias do kernel.
+      :Descrição: O título diz tudo. Há uma seção fixa sobre o kernel que
+        resume o trabalho dos desenvolvedores, correções de bugs, novos recursos
+        e versões produzidas durante a semana.
+
+    * Nome: **The home page of Linux-MM**
+
+      :Autor: A equipe Linux-MM.
+      :URL: https://linux-mm.org/
+      :Palavras-chave: gerenciamento de memória, Linux-MM, mm patches, TODO,
+        docs, mailing list.
+      :Descrição: Site dedicado ao desenvolvimento do Gerenciamento de Memória
+        do Linux. Patches relacionados à memória, HOWTOs, links, desenvolvedores
+        mm... Não perca se você estiver interessado no desenvolvimento do
+        gerenciamento de memória!
+
+    * Nome: **Kernel Newbies IRC Channel and Website**
+
+      :URL: https://www.kernelnewbies.org
+      :Palavras-chave: IRC, novatos, canal, tirar dúvidas.
+      :Descrição: #kernelnewbies em irc.oftc.net.
+        O canal #kernelnewbies é uma rede de IRC dedicada ao hacker de kernel
+        "novato" (newbie). O público consiste principalmente de pessoas que estão
+        aprendendo sobre o kernel, trabalhando em projetos do kernel ou hackers
+        profissionais que desejam ajudar pessoas menos experientes.
+        O #kernelnewbies está na rede de IRC OFTC.
+        Tente acessar irc.oftc.net como seu servidor e então digite /join #kernelnewbies.
+        O site kernelnewbies também hospeda artigos, documentos, FAQs...
+
+    * Nome: **linux-kernel mailing list archives and search engines**
+
+      :URL: https://subspace.kernel.org
+      :URL: https://lore.kernel.org
+      :Palavras-chave: linux-kernel, arquivos, busca.
+      :Descrição: Alguns dos arquivadores da lista de discussão linux-kernel.
+        Se você conhece algum outro (ou um melhor), por favor, me avise.
+
+    * Nome: **The Linux Foundation YouTube channel**
+
+      :URL: https://www.youtube.com/user/thelinuxfoundation
+      :Palavras-chave: linux, vídeos, linux-foundation, youtube.
+      :Descrição: A Linux Foundation faz o upload de gravações de vídeo de seus
+        eventos colaborativos, conferências de Linux (incluindo a LinuxCon) e
+        outras pesquisas originais e conteúdos relacionados ao Linux e ao
+        desenvolvimento de software.
+
+Rust
+----
+
+    * Título: **Rust for Linux**
+
+      :Autor: diversos
+      :URL: https://rust-for-linux.com/
+      :Data: versão contínua (rolling)
+      :Palavras-chave: glossário, termos, linux-kernel, rust.
+      :Descrição Do site: "Rust for Linux é o projeto que adiciona suporte à
+        linguagem Rust ao kernel Linux. Este site pretende ser um hub de links,
+        documentação e recursos relacionados ao projeto".
+
+    * Título: **Learn Rust the Dangerous Way**
+
+      :Autor: Cliff L. Biffle
+      :URL: https://cliffle.com/p/dangerust/
+      :Data: Acessado em 11 de setembro de 2024
+      :Palavras-chave: rust, blog.
+      :Descrição: Do site: "LRtDW é uma série de artigos que coloca os recursos
+        do Rust em contexto para programadores C de baixo nível que talvez não
+        tenham uma formação formal em Ciência da Computação, o tipo de pessoa
+        que trabalha com firmware, engines de jogos, kernels de SO e afins.
+        Basicamente, pessoas como eu.". O site ilustra conversões de linha por
+        linha de C para Rust.
+
+    * Título: **The Rust Book**
+
+      :Autor: Steve Klabnik e Carol Nichols, com contribuições da comunidade Rust
+      :URL: https://doc.rust-lang.org/book/
+      :Data: Acessado em 11 de setembro de 2024
+      :Palavras-chave: rust, livro.
+      :Descrição: Do site: "Este livro abraça totalmente o potencial do Rust para
+        capacitar seus usuários. É um texto amigável e acessível destinado a
+        ajudá-lo a elevar não apenas seu conhecimento de Rust, mas também seu
+        alcance e confiança como programador em geral. Então mergulhe de cabeça,
+        prepare-se para aprender e bem-vindo à comunidade Rust!".
+
+    * Título: **Rust for the Polyglot Programmer**
+
+      :Autor: Ian Jackson
+      :URL: https://www.chiark.greenend.org.uk/~ianmdlvl/rust-polyglot/index.html
+      :Data: Dezembro de 2022
+      :Palavras-chave: rust, blog, tooling.
+      :Descrição: Do site: "Existem muitos guias e introduções ao Rust. Este é
+        algo diferente: destina-se ao programador experiente que já conhece
+        muitas outras linguagens de programação. Tento ser abrangente o suficiente
+        para servir de ponto de partida para qualquer área do Rust, mas evito
+        entrar em detalhes excessivos, exceto onde as coisas não são como você
+        poderia esperar. Além disso, este guia não é inteiramente isento de
+        opiniões, incluindo recomendações de bibliotecas (crates), ferramentas, etc.".
+
+    * Título: **Fasterthanli.me**
+
+      :Autor: Amos Wenger
+      :URL: https://fasterthanli.me/
+      :Data: Acessado em 11 de setembro de 2024
+      :Palavras-chave: rust, blog, notícias.
+      :Descrição: Do site: "Eu crio artigos e vídeos sobre como os computadores
+        funcionam. Meu conteúdo é de formato longo, didático e exploratório
+        e frequentemente uma desculpa para ensinar Rust!".
+
+    * Título: **Comprehensive Rust**
+
+      :Autor: Equipe Android do Google
+      :URL: https://google.github.io/comprehensive-rust/
+      :Data: Acessado em 13 de setembro de 2024
+      :Palavras-chave: rust, blog.
+      :Descrição: Do site: "O curso cobre todo o espectro do Rust, desde a
+        sintaxe básica até tópicos avançados como genéricos e tratamento de erros".
+
+    * Título: **The Embedded Rust Book**
+
+      :Autor: Múltiplos colaboradores, principalmente Jorge Aparicio
+      :URL: https://docs.rust-embedded.org/book/
+      :Data: Acessado em 13 de setembro de 2024
+      :Palavras-chave: rust, blog.
+      :Descrição: Do site: "Um livro introdutório sobre o uso da linguagem de
+        programação Rust em sistemas embarcados 'Bare Metal', como microcontroladores".
+
+    * Título: **Experiment: Improving the Rust Book**
+
+      :Autor: Cognitive Engineering Lab na Brown University
+      :URL: https://rust-book.cs.brown.edu/
+      :Data: Acessado em 22 de setembro de 2024
+      :Palavras-chave: rust, blog.
+      :Descrição: Do site: "O objetivo deste experimento é avaliar e melhorar o
+        conteúdo do Rust Book para ajudar as pessoas a aprenderem Rust de forma
+        mais eficaz".
+
+    * Título: **New Rustacean** (podcast)
+
+      :Autor: Chris Krycho
+      :URL: https://newrustacean.com/
+      :Data: Acessado em 22 de setembro de 2024
+      :Palavras-chave: rust, podcast.
+      :Descrição: Do site: "Este é um podcast sobre aprender a linguagem de
+        programação Rust do zero! Além desta página inicial elegante, todo o
+        conteúdo do site é construído com as próprias ferramentas de documentação
+        do Rust".
+
+    * Título: **Opsem-team** (repositório)
+
+      :Autor: Equipe de semântica operacional (Operational semantics team)
+      :URL: https://github.com/rust-lang/opsem-team/tree/main
+      :Data: Acessado em 22 de setembro de 2024
+      :Palavras-chave: rust, repositório.
+      :Descrição: Do README: "A equipe opsem é a sucessora do grupo de trabalho
+        unsafe-code-guidelines e é responsável por responder a muitas das perguntas
+        difíceis sobre a semântica do Rust inseguro (unsafe Rust)".
+
+    * Título: **You Can't Spell Trust Without Rust**
+
+      :Autor: Alexis Beingessner
+      :URL: https://repository.library.carleton.ca/downloads/1j92g820w?locale=en
+      :Data: 2015
+      :Palavras-chave: rust, mestrado, tese.
+      :Descrição: Esta tese foca no sistema de propriedade (ownership) do Rust,
+        que garante a segurança de memória ao controlar a manipulação de dados e
+        o tempo de vida, enquanto também destaca suas limitações e o compara a
+        sistemas semelhantes no Cyclone e C++.
+
+    * Nome: **Apresentações de Rust no Linux Plumbers (LPC) 2024**
+
+      :Título: Rust microconference
+      :URL: https://lpc.events/event/18/sessions/186/#20240918
+      :Título: Rust for Linux
+      :URL: https://lpc.events/event/18/contributions/1912/
+      :Título: Journey of a C kernel engineer starting a Rust driver project
+      :URL: https://lpc.events/event/18/contributions/1911/
+      :Título: Crafting a Linux kernel scheduler that runs in user-space using Rust
+      :URL: https://lpc.events/event/18/contributions/1723/
+      :Título: openHCL: A Linux and Rust based paravisor
+      :URL: https://lpc.events/event/18/contributions/1956/
+      :Palavras-chave: rust, lpc, apresentações.
+      :Descrição: Uma série de palestras do LPC relacionadas ao Rust.
+
+    * Nome: **The Rustacean Station Podcast**
+
+      :URL: https://rustacean-station.org/
+      :Palavras-chave: rust, podcasts.
+      :Descrição: Um projeto comunitário para a criação de conteúdo em podcast
+        sobre a linguagem de programação Rust.
+
+-------
+
+Este documento foi originalmente baseado em:
+
+https://www.dit.upm.es/~jmseyas/linux/kernel/hackers-docs.html
+
+e escrito por Juan-Mariano de Goyeneche.
-- 
2.47.3


^ permalink raw reply related

* Re: [Linaro-mm-sig] Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Maxime Ripard @ 2026-05-19 16:32 UTC (permalink / raw)
  To: Christian König
  Cc: Albert Esteve, Barry Song, T.J. Mercier, Tejun Heo,
	Johannes Weiner, Michal Koutný, Jonathan Corbet, Shuah Khan,
	Sumit Semwal, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Benjamin Gaignard, Brian Starkey,
	John Stultz, Christian Brauner, Paul Moore, James Morris,
	Serge E. Hallyn, Stephen Smalley, Ondrej Mosnacek, Shuah Khan,
	cgroups, linux-doc, linux-kernel, linux-media, dri-,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, echanude
In-Reply-To: <9cc79977-9a42-40eb-bfa7-460881c1e10f@amd.com>

[-- Attachment #1: Type: text/plain, Size: 6159 bytes --]

Hi Chritian,

On Tue, May 19, 2026 at 09:53:19AM +0200, Christian König wrote:
> On 5/18/26 14:06, Albert Esteve wrote:
> >>>>> udmabufs are already
> >>>>> memcg-charged, so adding a separate MEMCG_DMABUF would double count.
> >>>>> Are there any other exporters you had in mind that would benefit from
> >>>>> this approach?
> >>
> >> Well apart from DMA-buf memfd_create() is one of the things which as broken our neck in the past a couple of times.
> >>
> >> But thinking more about it what if instead of making this DMA-buf heaps specific what if we have a general cgroups function which allows to change accounting of a buffer referenced by a file descriptor to a different process?
> >>
> >> That would cover not only the DMA-buf heaps use case, but also all other DMA-buf with dmem and whatever we come up in the future as well.
> > 
> > I removed a draft adding an ioctl for charge transfer from the series
> > before sending because I wanted to focus on the charge_pid_fd approach
> > and keep things simple, deferring the recharge path to a follow-up
> > depending on feedback.
> > 
> > The main difference between my removed draft and what you're
> > describing, iiuc, is scope and layer: my draft was an explicit ioctl
> > on the dma-buf fd that the consumer calls to claim the charge (see
> > below), while you seem to be suggesting a more general kernel-internal
> > function that could work across buffer types and cgroup controllers,
> > so not necessarily userspace-initiated? A kernel-internal function
> > will need a way to identify the target process, which sounds similar
> > to the binder-backed approach from TJ [1]. For everything else, the
> > receiver still needs to declare itself, which the ioctl accomplishes.
> > 
> > ```
> > # When an app imports a daemon-allocated buffer, it can transfer the
> > charge to itself:
> > int buf_fd = receive_dmabuf_from_daemon();
> > ioctl(buf_fd, DMA_BUF_IOCTL_XFER_CHARGE); /* charge now attributed to
> > apps's cgroup */
> 
> Well that thinking goes into the right direction, but the requirements are still not completely
> covered as far as I can see.
> 
> Let me explain below a bit more.
> 
> > 
> > [1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.com/
> > 
> >>
> >> The only drawback I can see is that DMA-buf heap allocations would be temporarily accounted to the memory allocation daemon, but I don't think that this would be a problem.
> > 
> > The main reasons we moved away from TJ's transfer-based approach
> > toward `charge_pid_fd` are: avoid the transient charge window on the
> > daemon's cgroup; and to decouple from Binder, allowing any allocator
> > to use it.
> 
> Yeah those concerns are completely correct.
> 
> The application should not volunteering says 'Charge that buffer to
> me.', but rather that the daemon says force charge that buffer to this
> application and tell me when the application is over its limit.

I would agree, but with a caveat: how do we want to deal with malicious
applications here? The application should have expressed that it's okay
for it to be charged by a different process, otherwise it becomes
trivial for a malicious app to create arbitrary charges against another
application in the system and DoS it.

But then, that means that an application could arbitrarily charge the
daemon as well if it doesn't opt-in but asks for allocations.

So maybe we should have an opt-in for the caller, and a way for the
daemon to check if the caller has indeed opted in before performing the
allocation (and the charge transfer)?

> > Technically, both approaches could coexist, though. Of the three
> > scenarios TJ described:
> > - Scenario 2 is directly addressed by charge_pid_fd approach without
> > any transient charge on the daemon at the cost of one extra field in
> > the heap ioctl uAPI struct.
> 
> Yeah extending the uAPI to pass in the pid on allocation time is not
> much of a problem, but you also need to modify the whole stack above
> it and that is a bit more trickier.
> 
> > - Scenario 3 can be handled by the charge transfer function without
> > changes to SurfaceFlinger. The app or dequeueBuffer claims the charge
> > for itself or the app, respectively (depending on whether we include a
> > pid_fd field in the transfer ioctl). It also covers non-heap
> > exporters. The con in both variants is the transient charge window on
> > the daemon.
> 
> It should be trivial for the deamon to charge the buffer to an
> application before handing it out.
> 
> > Both approaches shift the responsibility for correct charging
> > attribution to userspace: first, 'charge_pid_fd` on the allocator's
> > side, and the transfer charge on the consumer's side.
> 
> Yeah that's why I said it would be better if we do that without any
> uAPI change, but with all the uAPI we have to transfer file
> descriptors (dup(), fork(), passing FDs over sockets etc...) it could
> be really tricky to implement that.
> 
> > Deciding on one, the other or both depends on how much we value
> > avoiding transient attribution, and how much we need a non-heap
> > generic solution. With the XFER_CHARGE we can cover both. Thus, the
> > `charge_pid_fd` approach in this RFC can be seen as a
> > performance/strictness optimisation, eliminating transient charges to
> > the daemon at the cost of a permanent uAPI addition to the heap ioctl
> > struct, but not strictly required for correctness.
> 
> Well all we need is a uAPI which says charge this buffer (file
> descriptor) to that cgroup (pidfd).
> 
> With this at hand we should be able to handle all use cases at the
> same time.
> 
> > On the other hand, if we agree on the end goal of migrating other
> > exporters to use dma-buf heaps
> 
> That won't work. DMA-buf heaps is actually only a rather small and
> Anroid specific use case.

I don't think that's true anymore. heaps are used in lots of different
use cases now in the embedded space, including in regular, generic,
components not specifically used for embedded systems.

Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 273 bytes --]

^ permalink raw reply

* Re: [PATCH v2 1/2] arm64/cpufeature: Define hwcaps for 2025 dpISA features
From: Mark Brown @ 2026-05-19 16:03 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Jonathan Corbet, Shuah Khan, linux-arm-kernel,
	linux-kernel, linux-doc, linux-kselftest
In-Reply-To: <agyAs0UXGulhFXga@willie-the-truck>

[-- Attachment #1: Type: text/plain, Size: 1049 bytes --]

On Tue, May 19, 2026 at 04:24:35PM +0100, Will Deacon wrote:
> On Mon, May 18, 2026 at 04:07:29PM +0100, Mark Brown wrote:

> > +HWCAP3_F16F32MM
> > +    Functionality implied by ID_AA64ISAR0_EL1.FHM == 0b0011

> > +HWCAP3_SVE_LUT6
> > +    Functionality implied by ID_AA64ISAR2_EL1.LUT == 0b0010 and
> > +    ID_AA64PFR0_EL1.SVE == 0b0001.

> I've queued this, but I'm curious why you've called out the
> 'ID_AA64PFR0_EL1.SVE == 0b0001' part here and not for any of the other
> SVE caps you're adding?

It was mostly due to the possibility of ID_AA64ISAR2_EL1.LUT getting a
new non-SVE value, now you mention it I should go back and add the same
restriction for the others due to the use of ID_AA64ZFR0_EL1 for SME
only systems.  It's the implemented behaviour.

>                         It's also formatted inconsistently from
> pre-existing entries (such as HWCAP2_SVE_B16B16) which put the
> ID_AA64PFR0_EL1.SVE part of the antecedent first.

No real reason for that, there just weren't other examples on screen at
the time I was editing this.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH v2 1/2] mm/memcontrol: add dmem charge/uncharge functions
From: Eric Chanudet @ 2026-05-19 15:59 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Maarten Lankhorst, Maxime Ripard,
	Natalie Vock, Tejun Heo, Michal Koutný, Jonathan Corbet,
	Shuah Khan
  Cc: cgroups, linux-mm, linux-kernel, dri-devel, T.J. Mercier,
	Christian König, Maxime Ripard, Albert Esteve, Dave Airlie,
	linux-doc, Eric Chanudet
In-Reply-To: <20260519-cgroup-dmem-memcg-double-charge-v2-0-db4d1407062b@redhat.com>

Add mem_cgroup_dmem_charge() and mem_cgroup_dmem_uncharge() to allow
dmem pool allocations to optionally be double-charged against the memory
controller. Take the struct cgroup from the dmem pool's css as there is
no convenient object exported to represent these allocations. These will
resolve the effective memory css from that cgroup and perform the
charge.

Introduce a MEMCG_DMEM stat counter to memory.stat to make the cgroup's
dmem charge visible.

Signed-off-by: Eric Chanudet <echanude@redhat.com>
---
 include/linux/memcontrol.h | 16 ++++++++++++
 mm/memcontrol.c            | 65 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dc3fa687759b45748b2acee6d7f43da325eb50c1..8e1d49b87fb64e6114f3eb920293e14920290fe7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -39,6 +39,7 @@ enum memcg_stat_item {
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
 	MEMCG_ZSWAP_INCOMP,
+	MEMCG_DMEM,
 	MEMCG_NR_STAT,
 };
 
@@ -1872,6 +1873,21 @@ static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
 }
 #endif
 
+#if defined(CONFIG_MEMCG) && defined(CONFIG_CGROUP_DMEM)
+bool mem_cgroup_dmem_charge(struct cgroup *cgrp, unsigned int nr_pages,
+			    gfp_t gfp_mask);
+void mem_cgroup_dmem_uncharge(struct cgroup *cgrp, unsigned int nr_pages);
+#else
+static inline bool mem_cgroup_dmem_charge(struct cgroup *cgrp,
+					  unsigned int nr_pages, gfp_t gfp_mask)
+{
+	return true;
+}
+static inline void mem_cgroup_dmem_uncharge(struct cgroup *cgrp,
+					    unsigned int nr_pages)
+{
+}
+#endif
 
 /* Cgroup v1-related declarations */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c03d4787d466803db49cdaa90e6d6ba426b7afe2..91a7ac16b6eac2d6c3700b6885a068bf8b640706 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -433,6 +433,7 @@ static const unsigned int memcg_stat_items[] = {
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
 	MEMCG_ZSWAP_INCOMP,
+	MEMCG_DMEM,
 };
 
 #define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
@@ -1606,6 +1607,9 @@ static const struct memory_stat memory_stats[] = {
 #ifdef CONFIG_NUMA_BALANCING
 	{ "pgpromote_success",		PGPROMOTE_SUCCESS	},
 #endif
+#ifdef CONFIG_CGROUP_DMEM
+	{ "dmem",			MEMCG_DMEM		},
+#endif
 };
 
 /* The actual unit of the state item, not the same as the output unit */
@@ -5909,6 +5913,67 @@ static struct cftype zswap_files[] = {
 };
 #endif /* CONFIG_ZSWAP */
 
+#ifdef CONFIG_CGROUP_DMEM
+/**
+ * mem_cgroup_dmem_charge - charge memcg for a dmem pool allocation
+ * @cgrp: cgroup of the dmem pool
+ * @nr_pages: number of pages to charge
+ * @gfp_mask: reclaim mode
+ *
+ * Charges @nr_pages to @memcg. Returns %true if the charge fit within
+ * @memcg's configured limit, %false if it doesn't.
+ */
+bool mem_cgroup_dmem_charge(struct cgroup *cgrp, unsigned int nr_pages,
+			    gfp_t gfp_mask)
+{
+	struct cgroup_subsys_state *mem_css;
+	struct mem_cgroup *memcg;
+
+	/* CGROUP_DMEM and MEMCG guarantees this cannot be NULL. */
+	mem_css = cgroup_get_e_css(cgrp, &memory_cgrp_subsys);
+
+	/* Use the memcg, if any, of the dmem cgroup. */
+	memcg = mem_cgroup_from_css(mem_css);
+	if (!memcg || mem_cgroup_is_root(memcg)) {
+		css_put(mem_css);
+		return false;
+	}
+
+	if (try_charge_memcg(memcg, gfp_mask, nr_pages)) {
+		css_put(mem_css);
+		return false;
+	}
+
+	mod_memcg_state(memcg, MEMCG_DMEM, nr_pages);
+	css_put(mem_css);
+	return true;
+}
+
+/**
+ * mem_cgroup_dmem_uncharge - uncharge memcg from a dmem pool allocation
+ * @cgrp: cgroup of the dmem pool
+ * @nr_pages: number of pages to uncharge
+ */
+void mem_cgroup_dmem_uncharge(struct cgroup *cgrp, unsigned int nr_pages)
+{
+	struct cgroup_subsys_state *mem_css;
+	struct mem_cgroup *memcg;
+
+	/* CGROUP_DMEM and MEMCG guarantees this cannot be NULL. */
+	mem_css = cgroup_get_e_css(cgrp, &memory_cgrp_subsys);
+
+	memcg = mem_cgroup_from_css(mem_css);
+	if (!memcg || mem_cgroup_is_root(memcg)) {
+		css_put(mem_css);
+		return;
+	}
+
+	mod_memcg_state(memcg, MEMCG_DMEM, -nr_pages);
+	refill_stock(memcg, nr_pages);
+	css_put(mem_css);
+}
+#endif /* CONFIG_CGROUP_DMEM */
+
 static int __init mem_cgroup_swap_init(void)
 {
 	if (mem_cgroup_disabled())

-- 
2.52.0


^ permalink raw reply related

* [PATCH v2 0/2] cgroup/dmem: allow double-charging dmem allocations to memcg
From: Eric Chanudet @ 2026-05-19 15:59 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Maarten Lankhorst, Maxime Ripard,
	Natalie Vock, Tejun Heo, Michal Koutný, Jonathan Corbet,
	Shuah Khan
  Cc: cgroups, linux-mm, linux-kernel, dri-devel, T.J. Mercier,
	Christian König, Maxime Ripard, Albert Esteve, Dave Airlie,
	linux-doc, Eric Chanudet

Following suggestion[1], offer a cgroupfs entry to allow an
administrator to request that a dmem controlled region also charges to
the memory controller.

Add mem_cgroup_dmem_charge/uncharge helpers to resolve the effective
cgroup from a dmem pool's cgroup, perform the charge and update a
MEMCG_DMEM stat counter.

Add a "dmem.memcg" control file at the root level to configure memcg
charging per region. The setting is disabled by default and locked on
first charge attempt.

[1] https://lore.kernel.org/all/a446b598-5041-450b-aaa9-3c39a09ff6a0@amd.com/

Signed-off-by: Eric Chanudet <echanude@redhat.com>
---
Changes in v2:
- Use mem_cgroup_dmem_{,un}charge to account for memcg pages instead of
  exposing raw nr_pages functions. Use it to centralize where to find
  the effective cgroup from the pool's cgroup (Johannes)
- Set depends_on for cgrp_memory if CONFIG_MEMCG by having a memory
  controller in children cgroup (Michal)
- Move dmem.memcg to the root level as it applies by region for all
  cgroups
- Add a dmem memory.stats entry for reporting memcg charges for dmem
  allocations.
- Wrap the memcg enable/disable/lock configuration under a single state
  to avoid toctou races and simplify transitions.
- Link to v1: https://lore.kernel.org/r/20260403-cgroup-dmem-memcg-double-charge-v1-0-c371d155de2a@redhat.com

---
Eric Chanudet (2):
      mm/memcontrol: add dmem charge/uncharge functions
      cgroup/dmem: add dmem.memcg control file for double-charging to memcg

 Documentation/admin-guide/cgroup-v2.rst |  23 +++++
 include/linux/memcontrol.h              |  16 ++++
 kernel/cgroup/dmem.c                    | 158 +++++++++++++++++++++++++++++++-
 mm/memcontrol.c                         |  65 +++++++++++++
 4 files changed, 259 insertions(+), 3 deletions(-)
---
base-commit: d989f135f71699294bb2ffd4726b526456e2db68
change-id: 20260327-cgroup-dmem-memcg-double-charge-0f100a9ffbf2

Best regards,
-- 
Eric Chanudet <echanude@redhat.com>


^ permalink raw reply

* Re: [PATCH 00/12] misc/syncobj: add /dev/syncobj device
From: Christian König @ 2026-05-19 16:00 UTC (permalink / raw)
  To: Xaver Hugl
  Cc: Julian Orth, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Sumit Semwal, Jonathan Corbet,
	Shuah Khan, Arnd Bergmann, Greg Kroah-Hartman, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, linux-doc,
	wayland-devel, Michel Dänzer
In-Reply-To: <CAFZQkGwmeipZnvmBkcE7KhvUSMkSE=fzLBZtiMyhv3mM04Vudg@mail.gmail.com>

On 5/19/26 17:31, Xaver Hugl wrote:
> Am Di., 19. Mai 2026 um 15:29 Uhr schrieb Christian König
> <christian.koenig@amd.com>:
>>> 1. This series makes the ability to manipulate syncobjs available
>>> independently of attached hardware.
>>> 2. It makes it available under a consistent path /dev/syncobj.
>>
>> Exactly that is a big no-go. This has to be under /dev/dri.
> FWIW udmabuf is also under /dev directly, but I don't think any
> compositor developer would complain about a different path.
> What are the rules for that? Could this simply be put in /dev/dri/syncobj?

The syncobj are actually the DRM specific way of doing things. The general kernel wide way is to use sync files (see drivers/dma-buf/sync_file.c).

But there has already been tons of problems with those sync files. E.g. they doesn't support your use case at all since they don't have wait before submit behavior.

So there are already ways to do this, but the Linux kernel so far told everybody that this is forbidden. The DRM syncobj wait before signal functionality is much better, but then basically the second try to do this.

> The part where we get this independent of attached hardware is quite
> important for us though, since we can't just ignore explicit sync once
> the device we previously imported the syncobj into is disconnected.

Can you elaborate more on this?

> Buffers can be from any device or allocated in system memory and
> access should be synchronized properly in all cases.
> 
> How exactly it's made available isn't all that critical.
> 
>>> 3. It removes the need to translate between syncobjs fds and handles.
>>
>> That's a pretty big no-go as well. The differentiation between FDs and handles is completely intentional.
> Could you expand on why it's needed? For compositors, the handle is
> just an intermediary thing when translating between file descriptors.

Well what we could do is to add an IOCTL to directly attach an syncobj file descriptor to an eventfd.

> FTR for me at least, this part would be merely nice to have, since it
> slightly reduces the amount of ioctls a compositor needs to call, but
> it's not important.
> 
>>>> What about using VGEM for this?
>>>
>>> If the vgem render node were made available unconditionally under,
>>
>> Software rendering is a complete corner case, I don't think that this will be enabled by default.
> That simply makes vgem unsuitable for solving the problems we face in
> compositors.

Thinking more about it vgem also has the same issues as sync file mentioned above. So that is really also not doable.

Maybe Simona or David have another idea.

Regards,
Christian.

> 
> - Xaver


^ permalink raw reply

* [PATCH v2 2/2] cgroup/dmem: add dmem.memcg control file for double-charging to memcg
From: Eric Chanudet @ 2026-05-19 15:59 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Maarten Lankhorst, Maxime Ripard,
	Natalie Vock, Tejun Heo, Michal Koutný, Jonathan Corbet,
	Shuah Khan
  Cc: cgroups, linux-mm, linux-kernel, dri-devel, T.J. Mercier,
	Christian König, Maxime Ripard, Albert Esteve, Dave Airlie,
	linux-doc, Eric Chanudet
In-Reply-To: <20260519-cgroup-dmem-memcg-double-charge-v2-0-db4d1407062b@redhat.com>

Add a root-only cgroupfs file "dmem.memcg" that lets an administrator
configure whether allocations in a dmem region should also be charged to
the memory controller.

To handle inheritance, dmem adds a depends_on the memory controller,
unless MEMCG isn't configured in.

Double-charging is disabled by default. Once a charge is attempted, the
setting is locked to prevent inconsistent accounting by a small 4-state
machine (off, on, locked off, locked on).

The memcg to charge is derived from the pool's cgroup, since the pool
holds a reference to the dmem cgroup state that keeps the cgroup alive
until it gets uncharged.

Signed-off-by: Eric Chanudet <echanude@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  23 +++++
 kernel/cgroup/dmem.c                    | 158 +++++++++++++++++++++++++++++++-
 2 files changed, 178 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed995b1550317662bc1b56c7a7f3db23..1d2fa55ddf0faa17baa916a8914d3033e8e42359 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2828,6 +2828,29 @@ DMEM Interface Files
 	  drm/0000:03:00.0/vram0 12550144
 	  drm/0000:03:00.0/stolen 8650752
 
+  dmem.memcg
+	A readwrite nested-keyed file that exists only on the root
+	cgroup. It configures whether allocations in a dmem region
+	should also be charged to the memory controller.
+
+	Upon the first charge to a region, its setting can no longer be changed
+	and is reported as "[true|false] (locked)".
+
+	Charges to the memory controller are visible in ``memory.stat`` as the
+	``dmem`` entry, reported in bytes.
+
+	An example read output follows::
+
+	  drm/0000:03:00.0/vram0 false
+	  drm/0000:03:00.0/stolen false (locked)
+
+	Writing uses the same nested-keyed format::
+
+	  echo "drm/0000:03:00.0/vram0 true" > dmem.memcg
+
+	This file is only available when the kernel is built with
+	``CONFIG_MEMCG``.
+
 HugeTLB
 -------
 
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 1ab1fb47f2711ecc60dd13e611a8a4920b48f3e9..e07b20b8025c528f190f84c76b088cb8a32a7f5e 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -17,6 +17,14 @@
 #include <linux/refcount.h>
 #include <linux/rculist.h>
 #include <linux/slab.h>
+#include <linux/memcontrol.h>
+
+enum dmem_memcg_status {
+	DMEM_MEMCG_OFF,
+	DMEM_MEMCG_ON,
+	DMEM_MEMCG_LOCKED_OFF,
+	DMEM_MEMCG_LOCKED_ON,
+};
 
 struct dmem_cgroup_region {
 	/**
@@ -51,6 +59,14 @@ struct dmem_cgroup_region {
 	 * No new pools should be added to the region afterwards.
 	 */
 	bool unregistered;
+
+	/**
+	 * @memcg_status: Whether allocation in this region should charge memcg.
+	 * DMEM_MEMCG_OFF/DMEM_MEMCG_ON or
+	 * DMEM_MEMCG_LOCKED_OFF/DMEM_MEMCG_LOCKED_ON, frozen after first allocation.
+	 * Transitions to a locked state are one-way.
+	 */
+	atomic_t memcg_status;
 };
 
 struct dmemcg_state {
@@ -609,6 +625,34 @@ get_cg_pool_unlocked(struct dmemcg_state *cg, struct dmem_cgroup_region *region)
 	return pool;
 }
 
+static bool apply_memcg_charge(atomic_t *status)
+{
+	int state = atomic_read(status);
+
+	for (;;) {
+		switch (state) {
+		case DMEM_MEMCG_OFF:
+			state = atomic_cmpxchg(status, DMEM_MEMCG_OFF,
+					       DMEM_MEMCG_LOCKED_OFF);
+			if (state != DMEM_MEMCG_OFF)
+				continue;
+			return false;
+		case DMEM_MEMCG_LOCKED_OFF:
+			return false;
+		case DMEM_MEMCG_ON:
+			state = atomic_cmpxchg(status, DMEM_MEMCG_ON,
+					       DMEM_MEMCG_LOCKED_ON);
+			if (state != DMEM_MEMCG_ON)
+				continue;
+			return true;
+		case DMEM_MEMCG_LOCKED_ON:
+			return true;
+		}
+		WARN_ONCE(1, "Invalid memcg_status (%#x).\n", state);
+		return false;
+	}
+}
+
 /**
  * dmem_cgroup_uncharge() - Uncharge a pool.
  * @pool: Pool to uncharge.
@@ -624,6 +668,12 @@ void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size)
 		return;
 
 	page_counter_uncharge(&pool->cnt, size);
+
+	if (atomic_read(&pool->region->memcg_status) == DMEM_MEMCG_LOCKED_ON &&
+	    !WARN_ON_ONCE(size > (u64)UINT_MAX << PAGE_SHIFT))
+		mem_cgroup_dmem_uncharge(pool->cs->css.cgroup,
+					 PAGE_ALIGN(size) >> PAGE_SHIFT);
+
 	css_put(&pool->cs->css);
 	dmemcg_pool_put(pool);
 }
@@ -655,6 +705,8 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
 	struct dmemcg_state *cg;
 	struct dmem_cgroup_pool_state *pool;
 	struct page_counter *fail;
+	unsigned long nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	bool charge_memcg;
 	int ret;
 
 	*ret_pool = NULL;
@@ -670,7 +722,28 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
 	pool = get_cg_pool_unlocked(cg, region);
 	if (IS_ERR(pool)) {
 		ret = PTR_ERR(pool);
-		goto err;
+		goto err_css_put;
+	}
+
+	charge_memcg = apply_memcg_charge(&region->memcg_status);
+	if (charge_memcg) {
+		/* mem_cgroup_dmem_charge limitation from try_charge_memcg */
+		if (size > (u64)UINT_MAX << PAGE_SHIFT) {
+			ret = -EINVAL;
+			dmemcg_pool_put(pool);
+			goto err_css_put;
+		}
+
+		if (!mem_cgroup_dmem_charge(pool->cs->css.cgroup, nr_pages,
+					    GFP_KERNEL)) {
+			/*
+			 * No dmem_cgroup_state_evict_valuable() could help,
+			 * there's no ret_limit_pool to return.
+			 */
+			ret = -ENOMEM;
+			dmemcg_pool_put(pool);
+			goto err_css_put;
+		}
 	}
 
 	if (!page_counter_try_charge(&pool->cnt, size, &fail)) {
@@ -681,14 +754,17 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
 		}
 		dmemcg_pool_put(pool);
 		ret = -EAGAIN;
-		goto err;
+		goto err_uncharge_memcg;
 	}
 
 	/* On success, reference from get_current_dmemcs is transferred to *ret_pool */
 	*ret_pool = pool;
 	return 0;
 
-err:
+err_uncharge_memcg:
+	if (charge_memcg)
+		mem_cgroup_dmem_uncharge(pool->cs->css.cgroup, nr_pages);
+err_css_put:
 	css_put(&cg->css);
 	return ret;
 }
@@ -845,6 +921,71 @@ static ssize_t dmem_cgroup_region_max_write(struct kernfs_open_file *of,
 	return dmemcg_limit_write(of, buf, nbytes, off, set_resource_max);
 }
 
+#ifdef CONFIG_MEMCG
+static int dmem_cgroup_memcg_show(struct seq_file *sf, void *v)
+{
+	struct dmem_cgroup_region *region;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(region, &dmem_cgroup_regions, region_node) {
+		int state = atomic_read(&region->memcg_status);
+
+		seq_printf(sf, "%s %s\n", region->name,
+			   state == DMEM_MEMCG_ON ? "true" :
+			   state == DMEM_MEMCG_OFF ? "false" :
+			   state == DMEM_MEMCG_LOCKED_ON ? "true (locked)" :
+			   state == DMEM_MEMCG_LOCKED_OFF ? "false (locked)" :
+			   "(invalid)");
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
+static ssize_t dmem_cgroup_memcg_write(struct kernfs_open_file *of, char *buf,
+				       size_t nbytes, loff_t off)
+{
+	while (buf) {
+		struct dmem_cgroup_region *region;
+		char *options, *name;
+		bool flag;
+
+		options = buf;
+		buf = strchr(buf, '\n');
+		if (buf)
+			*buf++ = '\0';
+
+		options = strstrip(options);
+		if (!options[0])
+			continue;
+
+		name = strsep(&options, " \t");
+		if (!name[0])
+			continue;
+
+		if (!options || !options[0])
+			return -EINVAL;
+
+		if (kstrtobool(options, &flag))
+			return -EINVAL;
+
+		rcu_read_lock();
+		region = dmemcg_get_region_by_name(name);
+		rcu_read_unlock();
+		if (!region)
+			return -ENODEV;
+
+		atomic_cmpxchg(&region->memcg_status,
+			       flag ? DMEM_MEMCG_OFF : DMEM_MEMCG_ON,
+			       flag ? DMEM_MEMCG_ON : DMEM_MEMCG_OFF);
+		/* Continue if a region is already locked. */
+
+		kref_put(&region->ref, dmemcg_free_region);
+	}
+
+	return nbytes;
+}
+#endif
+
 static struct cftype files[] = {
 	{
 		.name = "capacity",
@@ -873,6 +1014,14 @@ static struct cftype files[] = {
 		.seq_show = dmem_cgroup_region_max_show,
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
+#ifdef CONFIG_MEMCG
+	{
+		.name = "memcg",
+		.write = dmem_cgroup_memcg_write,
+		.seq_show = dmem_cgroup_memcg_show,
+		.flags = CFTYPE_ONLY_ON_ROOT,
+	},
+#endif
 	{ } /* Zero entry terminates. */
 };
 
@@ -882,4 +1031,7 @@ struct cgroup_subsys dmem_cgrp_subsys = {
 	.css_offline	= dmemcs_offline,
 	.legacy_cftypes	= files,
 	.dfl_cftypes	= files,
+#ifdef CONFIG_MEMCG
+	.depends_on	= 1 << memory_cgrp_id,
+#endif
 };

-- 
2.52.0


^ permalink raw reply related

* [PATCH v2] tpm: tpm_tis: Add optional delay after relinquish
From: Jim Broadus @ 2026-05-19 15:45 UTC (permalink / raw)
  To: linux-integrity, linux-kernel, linux-doc
  Cc: peterhuewe, jarkko, jgg, Jim Broadus

Some TPMs fail to grant locality when requested immediately after being
relinquished. In this case, the TPM_ACCESS_REQUEST_USE bit of the
TPM_ACCESS register is cleared immediately without setting
TPM_ACCESS_ACTIVE_LOCALITY.

This issue can be seen at boot since tpm_chip_start, called right
after locality is relinquished, fails. This causes the probe to fail:

tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -1

This occurs on some older Dell Latitudes and maybe others. To work
around this, add a "settle" boolean param to tpm_tis. When this is
enabled, a delay is added after locality is relinquished.

Signed-off-by: Jim Broadus <jbroadus@gmail.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 7 +++++++
 drivers/char/tpm/tpm_tis.c                      | 7 +++++++
 drivers/char/tpm/tpm_tis_core.c                 | 3 +++
 drivers/char/tpm/tpm_tis_core.h                 | 1 +
 4 files changed, 18 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 4d0f545fb3ec..5b7111033fbb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7651,6 +7651,13 @@ Kernel parameters
 			defined by Trusted Computing Group (TCG) see
 			https://trustedcomputinggroup.org/resource/pc-client-platform-tpm-profile-ptp-specification/
 
+	tpm_tis.settle= [HW,TPM]
+			Format: <bool>
+			When enabled, this adds a delay after locality is
+			relinquished. Some TPMs will fail to grant locality if
+			requested immediately after being relinquished. This
+			causes the probe to fail.
+
 	tp_printk	[FTRACE]
 			Have the tracepoints sent to printk as well as the
 			tracing ring buffer. This is useful for early boot up
diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c
index 9aa230a63616..8ac0ea78570e 100644
--- a/drivers/char/tpm/tpm_tis.c
+++ b/drivers/char/tpm/tpm_tis.c
@@ -101,6 +101,10 @@ module_param(force, bool, 0444);
 MODULE_PARM_DESC(force, "Force device probe rather than using ACPI entry");
 #endif
 
+static bool settle;
+module_param(settle, bool, 0444);
+MODULE_PARM_DESC(settle, "Add settle time after relinquish");
+
 #if defined(CONFIG_PNP) && defined(CONFIG_ACPI)
 static int has_hid(struct acpi_device *dev, const char *hid)
 {
@@ -242,6 +246,9 @@ static int tpm_tis_init(struct device *dev, struct tpm_info *tpm_info)
 	if (itpm || is_itpm(ACPI_COMPANION(dev)))
 		set_bit(TPM_TIS_ITPM_WORKAROUND, &phy->priv.flags);
 
+	if (settle)
+		set_bit(TPM_TIS_SETTLE_AFTER_RELINQUISH, &phy->priv.flags);
+
 	return tpm_tis_core_init(dev, &phy->priv, irq, &tpm_tcg,
 				 ACPI_HANDLE(dev));
 }
diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
index 21d79ad3b164..fbeee085098e 100644
--- a/drivers/char/tpm/tpm_tis_core.c
+++ b/drivers/char/tpm/tpm_tis_core.c
@@ -171,6 +171,9 @@ static int __tpm_tis_relinquish_locality(struct tpm_tis_data *priv, int l)
 {
 	tpm_tis_write8(priv, TPM_ACCESS(l), TPM_ACCESS_ACTIVE_LOCALITY);
 
+	if (test_bit(TPM_TIS_SETTLE_AFTER_RELINQUISH, &priv->flags))
+		tpm_msleep(TPM_TIMEOUT);
+
 	return 0;
 }
 
diff --git a/drivers/char/tpm/tpm_tis_core.h b/drivers/char/tpm/tpm_tis_core.h
index 6c3aa480396b..413cac5e0f31 100644
--- a/drivers/char/tpm/tpm_tis_core.h
+++ b/drivers/char/tpm/tpm_tis_core.h
@@ -90,6 +90,7 @@ enum tpm_tis_flags {
 	TPM_TIS_DEFAULT_CANCELLATION	= 2,
 	TPM_TIS_IRQ_TESTED		= 3,
 	TPM_TIS_STATUS_VALID_RETRY	= 4,
+	TPM_TIS_SETTLE_AFTER_RELINQUISH	= 5,
 };
 
 struct tpm_tis_data {
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 2/2] iio: dac: Add AD5529R DAC driver support
From: Janani Sunil @ 2026-05-19 15:42 UTC (permalink / raw)
  To: Lars-Peter Clausen, Michael Hennerich, Jonathan Cameron,
	David Lechner, Nuno Sá, Andy Shevchenko, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Philipp Zabel, Jonathan Corbet,
	Shuah Khan
  Cc: linux-iio, devicetree, linux-kernel, linux-doc, Janani Sunil,
	Janani Sunil
In-Reply-To: <20260519-ad5529r-driver-v3-0-267c0731aa68@analog.com>

Add support for AD5529R 16-channel, 12/16 bit Digital to Analog Converter

Signed-off-by: Janani Sunil <janani.sunil@analog.com>
---
 MAINTAINERS               |   1 +
 drivers/iio/dac/Kconfig   |  17 ++
 drivers/iio/dac/Makefile  |   1 +
 drivers/iio/dac/ad5529r.c | 527 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 546 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 320e84765ce6..143714e27d51 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1513,6 +1513,7 @@ L:	linux-iio@vger.kernel.org
 S:	Supported
 W:	https://ez.analog.com/linux-software-drivers
 F:	Documentation/devicetree/bindings/iio/dac/adi,ad5529r.yaml
+F:	drivers/iio/dac/ad5529r.c
 
 ANALOG DEVICES INC AD5706R DRIVER
 M:	Alexis Czezar Torreno <alexisczezar.torreno@analog.com>
diff --git a/drivers/iio/dac/Kconfig b/drivers/iio/dac/Kconfig
index 657c68e75542..bb1d59889a2a 100644
--- a/drivers/iio/dac/Kconfig
+++ b/drivers/iio/dac/Kconfig
@@ -134,6 +134,23 @@ config AD5449
 	  To compile this driver as a module, choose M here: the
 	  module will be called ad5449.
 
+config AD5529R
+	tristate "Analog Devices AD5529R High Voltage DAC driver"
+	depends on SPI_MASTER
+	select REGMAP_SPI
+	help
+	  Say yes here to build support for Analog Devices AD5529R
+	  16-Channel, 12-Bit/16-Bit, 40V High Voltage Precision Digital to Analog
+	  Converter.
+
+	  The device features multiple output voltage ranges from -20V to +20V,
+	  built-in 4.096V voltage reference, and digital functions including
+	  toggle, dither, and ramp modes. Supports both 12-bit and 16-bit
+	  resolution variants.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called ad5529r.
+
 config AD5592R_BASE
 	tristate
 
diff --git a/drivers/iio/dac/Makefile b/drivers/iio/dac/Makefile
index 003431798498..f35e060b3643 100644
--- a/drivers/iio/dac/Makefile
+++ b/drivers/iio/dac/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_AD5446) += ad5446.o
 obj-$(CONFIG_AD5446_SPI) += ad5446-spi.o
 obj-$(CONFIG_AD5446_I2C) += ad5446-i2c.o
 obj-$(CONFIG_AD5449) += ad5449.o
+obj-$(CONFIG_AD5529R) += ad5529r.o
 obj-$(CONFIG_AD5592R_BASE) += ad5592r-base.o
 obj-$(CONFIG_AD5592R) += ad5592r.o
 obj-$(CONFIG_AD5593R) += ad5593r.o
diff --git a/drivers/iio/dac/ad5529r.c b/drivers/iio/dac/ad5529r.c
new file mode 100644
index 000000000000..9bb63030db95
--- /dev/null
+++ b/drivers/iio/dac/ad5529r.c
@@ -0,0 +1,527 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AD5529R Digital-to-Analog Converter Driver
+ * 16-Channel, 12/16-Bit, 40V High Voltage Precision DAC
+ *
+ * Copyright 2026 Analog Devices Inc.
+ * Author: Janani Sunil <janani.sunil@analog.com>
+ */
+
+#include <linux/array_size.h>
+#include <linux/bits.h>
+#include <linux/delay.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/iio/iio.h>
+#include <linux/mod_devicetable.h>
+#include <linux/module.h>
+#include <linux/property.h>
+#include <linux/regmap.h>
+#include <linux/regulator/consumer.h>
+#include <linux/reset.h>
+#include <linux/spi/spi.h>
+
+#define AD5529R_REG_INTERFACE_CONFIG_A		0x00
+#define AD5529R_REG_DEVICE_CONFIG		0x02
+#define AD5529R_REG_CHIP_GRADE			0x06
+#define AD5529R_REG_SCRATCH_PAD			0x0A
+#define AD5529R_REG_SPI_REVISION		0x0B
+#define AD5529R_REG_VENDOR_H			0x0D
+#define AD5529R_REG_STREAM_MODE			0x0E
+#define AD5529R_REG_INTERFACE_STATUS_A		0x11
+#define AD5529R_REG_MULTI_DAC_CH_SEL		0x14
+#define AD5529R_REG_OUT_RANGE_BASE		0x3C
+#define AD5529R_REG_OUT_RANGE(ch)		(AD5529R_REG_OUT_RANGE_BASE + (ch) * 2)
+#define AD5529R_REG_DAC_INPUT_A_BASE		0x148
+#define AD5529R_REG_DAC_INPUT_A(ch)		(AD5529R_REG_DAC_INPUT_A_BASE + (ch) * 2)
+#define AD5529R_REG_DAC_DATA_READBACK_BASE	0x16A
+#define AD5529R_REG_TSENS_ALERT_FLAG		0x18C
+#define AD5529R_REG_TSENS_SHTD_FLAG		0x18E
+#define AD5529R_REG_FUNC_BUSY			0x1A0
+#define AD5529R_REG_REF_SEL			0x1A2
+#define AD5529R_REG_INIT_CRC_ERR_STAT		0x1A4
+#define AD5529R_REG_MULTI_DAC_HOTPATH_SW_LDAC	0x1A8
+
+#define   AD5529R_INTERFACE_CONFIG_A_SW_RESET	(BIT(7) | BIT(0))
+#define   AD5529R_INTERFACE_CONFIG_A_ADDR_ASCENSION	BIT(5)
+#define   AD5529R_INTERFACE_CONFIG_A_SDO_ENABLE	BIT(4)
+#define   AD5529R_REF_SEL_MASK			BIT(0)
+#define   AD5529R_MAX_REGISTER			0x232
+#define   AD5529R_8BIT_REG_MAX			0x13
+#define   AD5529R_SPI_READ_FLAG			0x80
+
+struct ad5529r_model_data {
+	const char *model_name;
+	unsigned int resolution;
+	const struct iio_chan_spec *channels;
+	unsigned int num_channels;
+};
+
+#define AD5529R_DAC_CHANNEL(chan, bits) {			\
+	.type = IIO_VOLTAGE,					\
+	.indexed = 1,						\
+	.output = 1,						\
+	.channel = (chan),					\
+	.info_mask_separate = BIT(IIO_CHAN_INFO_RAW) |		\
+			      BIT(IIO_CHAN_INFO_SCALE) |	\
+			      BIT(IIO_CHAN_INFO_OFFSET),	\
+	.scan_type = {						\
+		.format = 'u',					\
+		.realbits = (bits),				\
+		.storagebits = 16,				\
+	},							\
+}
+
+static const char * const ad5529r_supply_names[] = {
+	"vdd",
+	"avdd",
+	"hvdd",
+};
+
+static const struct iio_chan_spec ad5529r_channels_16bit[] = {
+	AD5529R_DAC_CHANNEL(0, 16),
+	AD5529R_DAC_CHANNEL(1, 16),
+	AD5529R_DAC_CHANNEL(2, 16),
+	AD5529R_DAC_CHANNEL(3, 16),
+	AD5529R_DAC_CHANNEL(4, 16),
+	AD5529R_DAC_CHANNEL(5, 16),
+	AD5529R_DAC_CHANNEL(6, 16),
+	AD5529R_DAC_CHANNEL(7, 16),
+	AD5529R_DAC_CHANNEL(8, 16),
+	AD5529R_DAC_CHANNEL(9, 16),
+	AD5529R_DAC_CHANNEL(10, 16),
+	AD5529R_DAC_CHANNEL(11, 16),
+	AD5529R_DAC_CHANNEL(12, 16),
+	AD5529R_DAC_CHANNEL(13, 16),
+	AD5529R_DAC_CHANNEL(14, 16),
+	AD5529R_DAC_CHANNEL(15, 16),
+};
+
+static const struct iio_chan_spec ad5529r_channels_12bit[] = {
+	AD5529R_DAC_CHANNEL(0, 12),
+	AD5529R_DAC_CHANNEL(1, 12),
+	AD5529R_DAC_CHANNEL(2, 12),
+	AD5529R_DAC_CHANNEL(3, 12),
+	AD5529R_DAC_CHANNEL(4, 12),
+	AD5529R_DAC_CHANNEL(5, 12),
+	AD5529R_DAC_CHANNEL(6, 12),
+	AD5529R_DAC_CHANNEL(7, 12),
+	AD5529R_DAC_CHANNEL(8, 12),
+	AD5529R_DAC_CHANNEL(9, 12),
+	AD5529R_DAC_CHANNEL(10, 12),
+	AD5529R_DAC_CHANNEL(11, 12),
+	AD5529R_DAC_CHANNEL(12, 12),
+	AD5529R_DAC_CHANNEL(13, 12),
+	AD5529R_DAC_CHANNEL(14, 12),
+	AD5529R_DAC_CHANNEL(15, 12),
+};
+
+static const struct ad5529r_model_data ad5529r_16bit_model_data = {
+	.model_name = "ad5529r-16",
+	.resolution = 16,
+	.channels = ad5529r_channels_16bit,
+	.num_channels = ARRAY_SIZE(ad5529r_channels_16bit),
+};
+
+static const struct ad5529r_model_data ad5529r_12bit_model_data = {
+	.model_name = "ad5529r-12",
+	.resolution = 12,
+	.channels = ad5529r_channels_12bit,
+	.num_channels = ARRAY_SIZE(ad5529r_channels_12bit),
+};
+
+enum ad5529r_output_range {
+	AD5529R_RANGE_0V_5V,
+	AD5529R_RANGE_0V_10V,
+	AD5529R_RANGE_0V_20V,
+	AD5529R_RANGE_0V_40V,
+	AD5529R_RANGE_NEG5V_5V,
+	AD5529R_RANGE_NEG10V_10V,
+	AD5529R_RANGE_NEG15V_15V,
+	AD5529R_RANGE_NEG20V_20V,
+};
+
+static const s32 ad5529r_output_ranges_mv[8][2] = {
+	[AD5529R_RANGE_0V_5V] = { 0, 5000 },
+	[AD5529R_RANGE_0V_10V] = { 0, 10000 },
+	[AD5529R_RANGE_0V_20V] = { 0, 20000 },
+	[AD5529R_RANGE_0V_40V] = { 0, 40000 },
+	[AD5529R_RANGE_NEG5V_5V] = { -5000, 5000 },
+	[AD5529R_RANGE_NEG10V_10V] = { -10000, 10000 },
+	[AD5529R_RANGE_NEG15V_15V] = { -15000, 15000 },
+	[AD5529R_RANGE_NEG20V_20V] = { -20000, 20000 },
+};
+
+struct ad5529r_state {
+	struct spi_device *spi;
+	const struct ad5529r_model_data *model_data;
+	struct regmap *regmap_8bit;
+	struct regmap *regmap_16bit;
+	struct regulator *vref_regulator;
+	enum ad5529r_output_range output_range_idx[16];
+};
+
+static const struct regmap_range ad5529r_8bit_readable_ranges[] = {
+	regmap_reg_range(AD5529R_REG_INTERFACE_CONFIG_A, AD5529R_REG_CHIP_GRADE),
+	regmap_reg_range(AD5529R_REG_SCRATCH_PAD, AD5529R_REG_VENDOR_H),
+	regmap_reg_range(AD5529R_REG_STREAM_MODE, AD5529R_REG_INTERFACE_STATUS_A),
+};
+
+static const struct regmap_range ad5529r_16bit_readable_ranges[] = {
+	regmap_reg_range(AD5529R_REG_MULTI_DAC_CH_SEL, AD5529R_REG_INIT_CRC_ERR_STAT),
+	regmap_reg_range(AD5529R_REG_MULTI_DAC_HOTPATH_SW_LDAC, AD5529R_MAX_REGISTER),
+};
+
+static const struct regmap_access_table ad5529r_8bit_readable_table = {
+	.yes_ranges = ad5529r_8bit_readable_ranges,
+	.n_yes_ranges = ARRAY_SIZE(ad5529r_8bit_readable_ranges),
+};
+
+static const struct regmap_access_table ad5529r_16bit_readable_table = {
+	.yes_ranges = ad5529r_16bit_readable_ranges,
+	.n_yes_ranges = ARRAY_SIZE(ad5529r_16bit_readable_ranges),
+};
+
+static const struct regmap_range ad5529r_8bit_read_only_ranges[] = {
+	regmap_reg_range(AD5529R_REG_DEVICE_CONFIG, AD5529R_REG_CHIP_GRADE),
+	regmap_reg_range(AD5529R_REG_SPI_REVISION, AD5529R_REG_VENDOR_H),
+};
+
+static const struct regmap_range ad5529r_16bit_read_only_ranges[] = {
+	regmap_reg_range(AD5529R_REG_DAC_DATA_READBACK_BASE,
+			 (AD5529R_REG_DAC_DATA_READBACK_BASE + 15 * 2)),
+	regmap_reg_range(AD5529R_REG_TSENS_ALERT_FLAG, AD5529R_REG_TSENS_SHTD_FLAG),
+	regmap_reg_range(AD5529R_REG_FUNC_BUSY, AD5529R_REG_FUNC_BUSY),
+	regmap_reg_range(AD5529R_REG_INIT_CRC_ERR_STAT, AD5529R_REG_INIT_CRC_ERR_STAT),
+};
+
+static const struct regmap_access_table ad5529r_8bit_writeable_table = {
+	.no_ranges = ad5529r_8bit_read_only_ranges,
+	.n_no_ranges = ARRAY_SIZE(ad5529r_8bit_read_only_ranges),
+};
+
+static const struct regmap_access_table ad5529r_16bit_writeable_table = {
+	.no_ranges = ad5529r_16bit_read_only_ranges,
+	.n_no_ranges = ARRAY_SIZE(ad5529r_16bit_read_only_ranges),
+};
+
+static const struct regmap_config ad5529r_regmap_8bit_config = {
+	.name = "ad5529r-8bit",
+	.reg_bits = 16,
+	.val_bits = 8,
+	.max_register = AD5529R_8BIT_REG_MAX,
+	.read_flag_mask = AD5529R_SPI_READ_FLAG,
+	.rd_table = &ad5529r_8bit_readable_table,
+	.wr_table = &ad5529r_8bit_writeable_table,
+};
+
+static const struct regmap_config ad5529r_regmap_16bit_config = {
+	.name = "ad5529r-16bit",
+	.reg_bits = 16,
+	.val_bits = 16,
+	.max_register = AD5529R_MAX_REGISTER,
+	.read_flag_mask = AD5529R_SPI_READ_FLAG,
+	.val_format_endian = REGMAP_ENDIAN_LITTLE,
+	.rd_table = &ad5529r_16bit_readable_table,
+	.wr_table = &ad5529r_16bit_writeable_table,
+	.reg_stride = 2,
+};
+
+static struct regmap *ad5529r_get_regmap(struct ad5529r_state *st,
+					 unsigned int reg)
+{
+	if (reg <= AD5529R_8BIT_REG_MAX)
+		return st->regmap_8bit;
+
+	return st->regmap_16bit;
+}
+
+static int ad5529r_reset(struct ad5529r_state *st)
+{
+	struct reset_control *rst;
+	int ret;
+
+	rst = devm_reset_control_get_optional_exclusive(&st->spi->dev, NULL);
+	if (IS_ERR(rst))
+		return PTR_ERR(rst);
+
+	if (rst) {
+		ret = reset_control_deassert(rst);
+		if (ret)
+			return ret;
+	} else {
+		ret = regmap_write(st->regmap_8bit, AD5529R_REG_INTERFACE_CONFIG_A,
+				   AD5529R_INTERFACE_CONFIG_A_SW_RESET);
+		if (ret)
+			return ret;
+	}
+
+	fsleep(10000);
+
+	return regmap_write(st->regmap_8bit, AD5529R_REG_INTERFACE_CONFIG_A,
+			    AD5529R_INTERFACE_CONFIG_A_SDO_ENABLE |
+			    AD5529R_INTERFACE_CONFIG_A_ADDR_ASCENSION);
+}
+
+static int ad5529r_read_raw(struct iio_dev *indio_dev,
+			    struct iio_chan_spec const *chan,
+			    int *val, int *val2, long mask)
+{
+	struct ad5529r_state *st = iio_priv(indio_dev);
+	unsigned int reg_addr, reg_val_h;
+	int ret, range_idx, span_mv;
+
+	switch (mask) {
+	case IIO_CHAN_INFO_RAW:
+		reg_addr = AD5529R_REG_DAC_INPUT_A(chan->channel);
+		ret = regmap_read(st->regmap_16bit, reg_addr, &reg_val_h);
+		if (ret)
+			return ret;
+
+		*val = reg_val_h;
+
+		return IIO_VAL_INT;
+	case IIO_CHAN_INFO_SCALE:
+		range_idx = st->output_range_idx[chan->channel];
+
+		span_mv = ad5529r_output_ranges_mv[range_idx][1] -
+			  ad5529r_output_ranges_mv[range_idx][0];
+		*val = span_mv;
+		*val2 = st->model_data->resolution;
+
+		return IIO_VAL_FRACTIONAL_LOG2;
+	case IIO_CHAN_INFO_OFFSET:
+		range_idx = st->output_range_idx[chan->channel];
+
+		if (ad5529r_output_ranges_mv[range_idx][0] < 0)
+			*val = -(1 << (st->model_data->resolution - 1));
+		else
+			*val = 0;
+
+		return IIO_VAL_INT;
+	default:
+		return -EINVAL;
+	}
+}
+
+static int ad5529r_write_raw(struct iio_dev *indio_dev,
+			     struct iio_chan_spec const *chan,
+			     int val, int val2, long mask)
+{
+	struct ad5529r_state *st = iio_priv(indio_dev);
+	unsigned int reg_addr;
+
+	switch (mask) {
+	case IIO_CHAN_INFO_RAW:
+		if (val < 0 || val > GENMASK(st->model_data->resolution - 1, 0))
+			return -EINVAL;
+
+		reg_addr = AD5529R_REG_DAC_INPUT_A(chan->channel);
+
+		return regmap_write(st->regmap_16bit, reg_addr, val);
+	default:
+		return -EINVAL;
+	}
+}
+
+static int ad5529r_find_output_range(const s32 *vals)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(ad5529r_output_ranges_mv); i++) {
+		if (vals[0] == ad5529r_output_ranges_mv[i][0] * 1000 &&
+		    vals[1] == ad5529r_output_ranges_mv[i][1] * 1000)
+			return i;
+	}
+
+	return -EINVAL;
+}
+
+static int ad5529r_parse_channel_ranges(struct device *dev,
+					struct ad5529r_state *st)
+{
+	int ret, ch, range_idx;
+	s32 vals[2];
+
+	device_for_each_child_node_scoped(dev, child) {
+		range_idx = AD5529R_RANGE_0V_5V;
+
+		ret = fwnode_property_read_u32(child, "reg", &ch);
+		if (ret)
+			return dev_err_probe(dev, ret,
+					     "Missing reg property in channel node\n");
+
+		if (ch >= 16)
+			return dev_err_probe(dev, -EINVAL,
+					     "Invalid channel number: %d\n", ch);
+
+		if (!fwnode_property_read_u32_array(child,
+						    "adi,output-range-microvolt",
+						    vals, 2)) {
+			range_idx = ad5529r_find_output_range(vals);
+			if (range_idx < 0)
+				return dev_err_probe(dev, range_idx,
+						     "Invalid range [%d %d] for ch %d\n",
+						     vals[0], vals[1], ch);
+		}
+
+		st->output_range_idx[ch] = range_idx;
+		ret = regmap_write(st->regmap_16bit,
+				   AD5529R_REG_OUT_RANGE(ch), range_idx);
+		if (ret)
+			return dev_err_probe(dev, ret,
+					     "Failed to configure range for ch %d\n",
+					     ch);
+	}
+
+	return 0;
+}
+
+static int ad5529r_debugfs_reg_read(struct ad5529r_state *st, unsigned int reg,
+				    unsigned int *val)
+{
+	return regmap_read(ad5529r_get_regmap(st, reg), reg, val);
+}
+
+static int ad5529r_debugfs_reg_write(struct ad5529r_state *st, unsigned int reg,
+				     unsigned int val)
+{
+	return regmap_write(ad5529r_get_regmap(st, reg), reg, val);
+}
+
+static int ad5529r_reg_access(struct iio_dev *indio_dev,
+			      unsigned int reg,
+			      unsigned int writeval,
+			      unsigned int *readval)
+{
+	struct ad5529r_state *st = iio_priv(indio_dev);
+
+	if (readval)
+		return ad5529r_debugfs_reg_read(st, reg, readval);
+
+	return ad5529r_debugfs_reg_write(st, reg, writeval);
+}
+
+static void ad5529r_disable_regulator(void *regulator)
+{
+	regulator_disable(regulator);
+}
+
+static const struct iio_info ad5529r_info = {
+	.read_raw = ad5529r_read_raw,
+	.write_raw = ad5529r_write_raw,
+	.debugfs_reg_access = ad5529r_reg_access,
+};
+
+static int ad5529r_probe(struct spi_device *spi)
+{
+	struct device *dev = &spi->dev;
+	struct iio_dev *indio_dev;
+	struct ad5529r_state *st;
+	int ret;
+
+	indio_dev = devm_iio_device_alloc(dev, sizeof(*st));
+	if (!indio_dev)
+		return -ENOMEM;
+
+	st = iio_priv(indio_dev);
+
+	st->spi = spi;
+
+	st->model_data = spi_get_device_match_data(spi);
+	if (!st->model_data)
+		return dev_err_probe(dev, -EINVAL, "Failed to identify device variant\n");
+
+	ret = devm_regulator_bulk_get_enable(dev, ARRAY_SIZE(ad5529r_supply_names),
+					     ad5529r_supply_names);
+	if (ret)
+		return dev_err_probe(dev, ret,
+				     "Failed to get and enable regulators\n");
+
+	ret = devm_regulator_get_enable_optional(dev, "hvss");
+	if (ret)
+		return dev_err_probe(dev, ret,
+				     "Failed to get and enable hvss regulator\n");
+
+	st->vref_regulator = devm_regulator_get_optional(dev, "vref");
+	if (IS_ERR(st->vref_regulator)) {
+		if (PTR_ERR(st->vref_regulator) != -ENODEV)
+			return dev_err_probe(dev, PTR_ERR(st->vref_regulator),
+					     "Failed to get vref regulator\n");
+		st->vref_regulator = NULL;
+	}
+
+	if (st->vref_regulator) {
+		ret = regulator_enable(st->vref_regulator);
+		if (ret)
+			return dev_err_probe(dev, ret,
+					     "Failed to enable vref regulator\n");
+
+		ret = devm_add_action_or_reset(dev, ad5529r_disable_regulator,
+					       st->vref_regulator);
+		if (ret)
+			return dev_err_probe(dev, ret,
+					     "Failed to add vref regulator cleanup\n");
+	}
+
+	st->regmap_8bit = devm_regmap_init_spi(spi, &ad5529r_regmap_8bit_config);
+	if (IS_ERR(st->regmap_8bit))
+		return dev_err_probe(dev, PTR_ERR(st->regmap_8bit),
+				     "Failed to initialize 8-bit regmap\n");
+
+	st->regmap_16bit = devm_regmap_init_spi(spi, &ad5529r_regmap_16bit_config);
+	if (IS_ERR(st->regmap_16bit))
+		return dev_err_probe(dev, PTR_ERR(st->regmap_16bit),
+				     "Failed to initialize 16-bit regmap\n");
+
+	ret = ad5529r_reset(st);
+	if (ret)
+		return dev_err_probe(dev, ret, "Failed to reset device\n");
+
+	ret = regmap_update_bits(st->regmap_16bit, AD5529R_REG_REF_SEL,
+				 AD5529R_REF_SEL_MASK,
+				 st->vref_regulator ? 0 : AD5529R_REF_SEL_MASK);
+	if (ret)
+		return dev_err_probe(dev, ret, "Failed to configure reference\n");
+
+	ret = ad5529r_parse_channel_ranges(dev, st);
+	if (ret)
+		return ret;
+
+	indio_dev->name = st->model_data->model_name;
+	indio_dev->info = &ad5529r_info;
+	indio_dev->modes = INDIO_DIRECT_MODE;
+	indio_dev->channels = st->model_data->channels;
+	indio_dev->num_channels = st->model_data->num_channels;
+
+	return devm_iio_device_register(dev, indio_dev);
+}
+
+static const struct of_device_id ad5529r_of_match[] = {
+	{ .compatible = "adi,ad5529r-16", .data = &ad5529r_16bit_model_data },
+	{ .compatible = "adi,ad5529r-12", .data = &ad5529r_12bit_model_data },
+	{ }
+};
+MODULE_DEVICE_TABLE(of, ad5529r_of_match);
+
+static const struct spi_device_id ad5529r_id[] = {
+	{ "ad5529r-16", .driver_data = (kernel_ulong_t)&ad5529r_16bit_model_data },
+	{ "ad5529r-12", .driver_data = (kernel_ulong_t)&ad5529r_12bit_model_data },
+	{ }
+};
+MODULE_DEVICE_TABLE(spi, ad5529r_id);
+
+static struct spi_driver ad5529r_driver = {
+	.driver = {
+		.name = "ad5529r",
+		.of_match_table = ad5529r_of_match,
+	},
+	.probe = ad5529r_probe,
+	.id_table = ad5529r_id,
+};
+module_spi_driver(ad5529r_driver);
+
+MODULE_AUTHOR("Janani Sunil <janani.sunil@analog.com>");
+MODULE_DESCRIPTION("Analog Devices AD5529R 12/16-bit DAC driver");
+MODULE_LICENSE("GPL");

-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox