Linux Security Modules development
 help / color / mirror / Atom feed
* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-18 23:57 UTC (permalink / raw)
  To: Song Liu
  Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAPhsuW5jQOzRTi1ea+=UPhx5W9bkBdivPagRE=O=nx0zf_vb8w@mail.gmail.com>

On Mon, May 18, 2026 at 7:23 PM Song Liu <song@kernel.org> wrote:
> On Mon, May 18, 2026 at 2:29 PM Paul Moore <paul@paul-moore.com> wrote:
> [...]
> > In my opinion, making killswitch an LSM is more of a procedural item
> > that deals with how we view a capability like killswitch.  I
> > personally view killswitch as somewhat similar to Lockdown, which is
> > why I made the suggestion.
> >
> > The use of kprobes, while an interesting idea, presents problems as
> > allowing any kernel symbol to be killed introduces the potential for
> > security regressions.  As a reminder, some LSMs, as well as other
> > kernel subsystems, have mechanisms in place to restrict root and/or
> > enforce one-way configuration locks; while many people equate "root"
> > with full control, in many cases today that is not strictly correct.
> >
> > Yes, kprobes have been around for some time, this is not a new
> > problem, but killswitch makes it far more convenient and accessible to
> > do dangerous things with kprobes.  If killswitch makes it past the RFC
> > stage without any significant changes to its kill mechanism, we may
> > need to start considering more liberal usage of NOKPROBE_SYMBOL()
> > which I think would be an unfortunate casualty.
>
> I don't think we can use NOKPROBE_SYMBOL(). There are functions
> that we don't want to killswitch, but still want to trace.

That was exactly my point, but we need to figure something out so
killswitch doesn't make it easier to cause a regression.

-- 
paul-moore.com

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-18 23:39 UTC (permalink / raw)
  To: Barry Song
  Cc: Albert Esteve, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Sumit Semwal, Christian König,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CAGsJ_4xwJ7SAhKPJyRtMTw6psTO7H1EcFFpDw0po1W8PX4FE8g@mail.gmail.com>

On Mon, May 18, 2026 at 3:43 PM Barry Song <baohua@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 8:16 PM Albert Esteve <aesteve@redhat.com> wrote:
> >
> > On Sat, May 16, 2026 at 9:37 AM Barry Song <baohua@kernel.org> wrote:
> > >
> > > On Tue, May 12, 2026 at 5:18 PM Albert Esteve <aesteve@redhat.com> wrote:
> > > >
> > > > On embedded platforms a central process often allocates dma-buf
> > > > memory on behalf of client applications. Without a way to
> > > > attribute the charge to the requesting client's cgroup, the
> > > > cost lands on the allocator, making per-cgroup memory limits
> > > > ineffective for the actual consumers.
> > > >
> > > > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > > > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > > > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > > > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > > > the mem_accounting module parameter enabled, the buffer is charged
> > > > to the allocator's own cgroup.
> > > >
> > > > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > > > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > > > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > > > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > > > all accounting through a single MEMCG_DMABUF path.
> > > >
> > > [...]
> > >
> > > > -               if (mem_accounting)
> > > > -                       flags |= __GFP_ACCOUNT;
> > >
> > > Hi Albert,
> > >
> > > would it be better to move this and its description to patch 1? It
> > > looks like patch 1 already introduces the double accounting changes,
> > > and patch 2 is mainly just supporting remote charging.
> >
> > Hi Barry,
> >
> > Thanks for looking into this series! Yes, in my head I was trying to
> > keep patch 1, which was taken from a previous, different series, and
> > then diverge from it starting with patch 2. This would clarify the
> > difference between the two. But I can see it just added some confusion
> > (for example, patch 1 charges on dma_buf_export() and then it is moved
> > to dma_heap_buffer_alloc() in patch 2). I will reorganize it better
> > for the next version, including your suggestion.
>
> Yep, I understand the situation now. I also understand
> that you were referring to T.J.'s patch, which caused
> some back-and-forth confusion for readers when reading
> patches 1 and 2.

Albert, please don't feel obligated to keep my patch intact if
integrating it into other patches simplifies the series.

> > > Also, mem_accounting is only used by system_heap.c; has this patchset
> > > also eliminated its need?
> >
> > No, mem_accounting is still handled in this patch for the general case
> > where no `charge_pid_fd` is used. See dma_heap_buffer_alloc() code:
> >
> > +       if (memcg)
> > +               css_get(&memcg->css);
> > +       else if (mem_accounting)
> > +               memcg = get_mem_cgroup_from_mm(current->mm);
>
> I see. What feels a bit odd to me is that mem_accounting
> could either be dropped (with unconditional charging), or
> it should cover both remote and local charge cases.
>
> I don’t have a strong opinion here—it just feels a bit
> strange, since its description is quite generic for memcg:
>
> "Enable cgroup-based memory accounting for dma-buf heap
> allocations (default=false)."
>
> Best Regards
> Barry

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-18 23:39 UTC (permalink / raw)
  To: Barry Song
  Cc: Christian König, Albert Esteve, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CAGsJ_4y=Gsv=FSUjJ5+99Gg6ULUnv0LRexCGOGetzChR3YA44Q@mail.gmail.com>

On Mon, May 18, 2026 at 3:19 PM Barry Song <baohua@kernel.org> wrote:
>
> On Tue, May 19, 2026 at 5:17 AM T.J. Mercier <tjmercier@google.com> wrote:
> [...]
> > > > > Yeah I think this might work. I know of 3 cases, and it trivially
> > > > > solves the first two. The third requires some work on our end to
> > > > > extend our userspace interfaces to include the pidfd but it seems
> > > > > doable. I'm checking with our graphics folks.
> > > > >
> > > > > 1) Direct allocation from user (e.g. app -> allocation ioctl on
> > > > > /dev/dma_heap/foo)
> > > > > No changes required to userspace. mem_accounting=1 charges the app.
> > > > >
> > > > > 2) Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
> > > > > -> gralloc)
> > > > > gralloc has the caller's pid as described in the commit message. Open
> > > > > a pidfd and pass it in the dma_heap_allocation_data.
> > > > >
> > > > > 3) Double hop remote allocation (e.g. app -> dequeueBuffer ->
> > > > > SurfaceFlinger -> gralloc)
> > > > > In this case gralloc knows SurfaceFlinger's pid, but not the app's. So
> > > > > we need to add the app's pidfd to the SurfaceFlinger -> gralloc
> > > > > interface, or transfer the memcg charge from SurfaceFlinger to the app
> > > > > after the allocation.
> > > > > It'd be nice to avoid the charge transfer option entirely, but if we
> > > > > need it that doesn't seem so bad in this case because it's a bulk
> > > > > charge for the entire dmabuf rather than per-page. So the exporter
> > > > > doesn't need to get involved (we wouldn't need a new dma_buf_op) and
> > > > > we wouldn't have to worry about looping and locking for each page.
> > > > >
> > > >
> > > > Hi T.J.,
> > > >
> > > > Your description of the three different cases sounds very interesting.
> > > > It helps me understand how difficult it can be to correctly charge
> > > > dma-buf in the current user scenarios.
> > > >
> > > > I’m wondering where I can find Android userspace code that transfers
> > > > the PID of RPC callers. Do we have any existing sample code in Android
> > > > for this?
> > >
> > > Hi Barry,
> > >
> > > In Java android.os.Binder.getCallingPid() will provide it. Here
> >
> > ... let me try again
> >
> > Here are some examples from the framework code:
> >
> > https://cs.android.com/search?q=getCallingPid%20f:ActivityManager&sq=&ss=android%2Fplatform%2Fsuperproject
> >
> > In native code we have AIBinder_getCallingPid and
> > android::IPCThreadState::self()->getCallingPid() (or
> > android::hardware::IPCThreadState::self()->getCallingPid() for HIDL)
> >
> > https://cs.android.com/search?q=getCallingPid%20l:cpp%20-f:prebuilt&ss=android%2Fplatform%2Fsuperproject
>
> Thanks very much, T.J. That is very helpful. I guess
> that would require user space to understand the RPC
> procedure, including single-hop and two-hop cases, and
> make the corresponding changes.

Yes, this is solvable by having a policy in allocator services where
the caller is implicitly charged, while also supporting cases where
the RPC includes additional explicit information about who to charge.
This needs security checks to prevent arbitrary remote charges at both
the ioctl() level (selinux charge_to from patch 4), and at the RPC
level (not sure yet but maybe a private interface between system
components and gralloc), so that only privileged components can
initiate remote charges.

> You pointed out the SurfaceFlinger cases, which are
> two hops. It seems that AI models are also using
> dma_heap, at least from what I have observed on MTK
> and Qualcomm phones. Likely, we need to understand
> those RPC relationships in userspace and make the
> corresponding changes.
> I assume AI models are a single-hop case?

It's currently a mix because AI model loading is largely controlled by
vendor code right now. Some implementations use
AHardwareBuffer_allocate, but that comes with unnecessary RPC overhead
for the AI use case. So I think we should be trending towards direct
allocations from dma-buf heaps because model loading time is
important.

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-18 23:39 UTC (permalink / raw)
  To: Christian König
  Cc: Albert Esteve, Christian Brauner, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <88efe10a-8b93-4a81-8279-4a5559d0f17c@amd.com>

On Mon, May 18, 2026 at 7:07 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/18/26 14:50, Albert Esteve wrote:
> > On Mon, May 18, 2026 at 9:20 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 5/15/26 19:06, T.J. Mercier wrote:
> >>> On Fri, May 15, 2026 at 6:53 AM Christian Brauner <brauner@kernel.org> wrote:
> >>>>
> >>>> On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
> >>>>> On embedded platforms a central process often allocates dma-buf
> >>>>> memory on behalf of client applications. Without a way to
> >>>>> attribute the charge to the requesting client's cgroup, the
> >>>>> cost lands on the allocator, making per-cgroup memory limits
> >>>>> ineffective for the actual consumers.
> >>>>>
> >>>>> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> >>>>
> >>>> Please be aware that pidfds come in two flavors:
> >>>>
> >>>> thread-group pidfds and thread-specific pidfds. Make sure that your API
> >>>> doesn't implicitly depend on this distinction not existing.
> >>>
> >>> Hi Christian,
> >>>
> >>> Memcg is not a controller that supports "thread mode" so all threads
> >>> in a group should belong to the same memcg.
> >>
> >> BTW: Exactly that is the requirement automotive has with their native context use case.
> >>
> >> The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
> >>
> >> At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
> >>
> >> Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
> >
> > Hi Christian,
> >
> > Thanks for sharing this atuomotive usecase. If I understand correctly,
> > the actual requirement is attributing dma-buf charges to the right
> > client, not putting each daemon thread in a different cgroup?
>
> Nope, exactly that's the difference.
>
> The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.
>
> Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.
>
> The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.
>
> The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.
>
> So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.
>
> > If so,
> > the `charge_pid_fd` approach achieves this directly by passing the
> > client's `pid_fd`, without needing to add per-thread cgroup
> > infrastructure.
>
> Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.
>
> Doing that automatically for CPU and I/O time would just be nice to have additionally.
>
> Regards,
> Christian.

Hopefully I'm following correctly here.... So you are duplicating the
GPU driver stack to achieve remote accounting on a per-thread basis?
Does this mean for GPU allocations you currently have some GFP_ACCOUNT
magic in your driver to attribute GPU memory to the correct remote
client? So this series would close the gap for dma-buf allocations,
but what about private GPU driver memory allocated on behalf of a
client?

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Song Liu @ 2026-05-18 23:22 UTC (permalink / raw)
  To: Paul Moore
  Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAHC9VhS1DJNs9gDB6gD9WKhL08giSVajBskZ+=mY0AWRCAsw7Q@mail.gmail.com>

On Mon, May 18, 2026 at 2:29 PM Paul Moore <paul@paul-moore.com> wrote:
[...]
> In my opinion, making killswitch an LSM is more of a procedural item
> that deals with how we view a capability like killswitch.  I
> personally view killswitch as somewhat similar to Lockdown, which is
> why I made the suggestion.
>
> The use of kprobes, while an interesting idea, presents problems as
> allowing any kernel symbol to be killed introduces the potential for
> security regressions.  As a reminder, some LSMs, as well as other
> kernel subsystems, have mechanisms in place to restrict root and/or
> enforce one-way configuration locks; while many people equate "root"
> with full control, in many cases today that is not strictly correct.
>
> Yes, kprobes have been around for some time, this is not a new
> problem, but killswitch makes it far more convenient and accessible to
> do dangerous things with kprobes.  If killswitch makes it past the RFC
> stage without any significant changes to its kill mechanism, we may
> need to start considering more liberal usage of NOKPROBE_SYMBOL()
> which I think would be an unfortunate casualty.

I don't think we can use NOKPROBE_SYMBOL(). There are functions
that we don't want to killswitch, but still want to trace.

Thanks,
Song

^ permalink raw reply

* Re: [Linaro-mm-sig] Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Barry Song @ 2026-05-18 23:00 UTC (permalink / raw)
  To: Christian König
  Cc: T.J. Mercier, Albert Esteve, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <cb84c2ee-9de1-4565-b2e0-60984721228f@amd.com>

On Mon, May 18, 2026 at 3:34 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/16/26 11:19, Barry Song wrote:
> > On Thu, May 14, 2026 at 12:35 AM T.J. Mercier <tjmercier@google.com> wrote:
> > [...]
> >>>> I have a question about this part. Albert I guess you are interested
> >>>> only in accounting dmabuf-heap allocations, or do you expect to add
> >>>> __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other
> >>>> non-dmabuf-heap exporters?
> >>>
> >>> We're scoping this to dma-buf heaps for now. CMA heaps and the dmem
> >>> controller are on the radar for follow-up/parallel work (there will be
> >>> dragons and will surely need discussion). For DRM and V4L2 the
> >>> long-term intent is migration to heaps, which would make direct
> >>> accounting on those paths unnecessary.
> >>
> >> Ah I see. GEM buffers exported to dmabufs are what I had in mind. I
> >> guess this would only leave the odd non-DRM driver with the need to
> >> add their own accounting calls, which I don't expect would be a big
> >> problem.
> >>
> >
> > sounds like we still have a long way to go to correctly account for
> > various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in
> > dma_buf_export(), so I guess it covers all dma-buf types except
> > dma_heap, but the problem is that it has no remote charging support at
> > all?
>
> No, just the other way around
>
> DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.
>
> dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.
>

Hi Christian,

Thanks very much for your explanation. So basically it seems that
dma_buf_export() is not the proper place to charge, since it may end up
mixing in non-system-memory accounting?

My question is also about the global view for both heap and non-heap cases.
After reading the discussion, I’ve tried to summarize it—please let me know
if my understanding is correct.

for dma_heap, we have the ioctl DMA_HEAP_IOCTL_ALLOC, where users can pass a
remote pidfd or similar information to indicate where the dma-buf should be
charged, as in Albert's patchset.

For non-dma_heap dma-bufs, we don’t have an obvious userspace entry point that
triggers the allocation. So we likely need other approaches. We could either
move more drivers over to dma-heap, or introduce something like
DMA_BUF_IOCTL_XFER_CHARGE, as you are discussing, to let userspace explicitly
declare a charge.

Best Regards
Barry

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Barry Song @ 2026-05-18 22:43 UTC (permalink / raw)
  To: Albert Esteve
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CADSE00LjJcL8P5M-UPEpzZijU70uEmUirnin29N8YR5W5D-oFg@mail.gmail.com>

On Mon, May 18, 2026 at 8:16 PM Albert Esteve <aesteve@redhat.com> wrote:
>
> On Sat, May 16, 2026 at 9:37 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Tue, May 12, 2026 at 5:18 PM Albert Esteve <aesteve@redhat.com> wrote:
> > >
> > > On embedded platforms a central process often allocates dma-buf
> > > memory on behalf of client applications. Without a way to
> > > attribute the charge to the requesting client's cgroup, the
> > > cost lands on the allocator, making per-cgroup memory limits
> > > ineffective for the actual consumers.
> > >
> > > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > > the mem_accounting module parameter enabled, the buffer is charged
> > > to the allocator's own cgroup.
> > >
> > > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > > all accounting through a single MEMCG_DMABUF path.
> > >
> > [...]
> >
> > > -               if (mem_accounting)
> > > -                       flags |= __GFP_ACCOUNT;
> >
> > Hi Albert,
> >
> > would it be better to move this and its description to patch 1? It
> > looks like patch 1 already introduces the double accounting changes,
> > and patch 2 is mainly just supporting remote charging.
>
> Hi Barry,
>
> Thanks for looking into this series! Yes, in my head I was trying to
> keep patch 1, which was taken from a previous, different series, and
> then diverge from it starting with patch 2. This would clarify the
> difference between the two. But I can see it just added some confusion
> (for example, patch 1 charges on dma_buf_export() and then it is moved
> to dma_heap_buffer_alloc() in patch 2). I will reorganize it better
> for the next version, including your suggestion.

Yep, I understand the situation now. I also understand
that you were referring to T.J.'s patch, which caused
some back-and-forth confusion for readers when reading
patches 1 and 2.

>
> >
> > Also, mem_accounting is only used by system_heap.c; has this patchset
> > also eliminated its need?
>
> No, mem_accounting is still handled in this patch for the general case
> where no `charge_pid_fd` is used. See dma_heap_buffer_alloc() code:
>
> +       if (memcg)
> +               css_get(&memcg->css);
> +       else if (mem_accounting)
> +               memcg = get_mem_cgroup_from_mm(current->mm);

I see. What feels a bit odd to me is that mem_accounting
could either be dropped (with unconditional charging), or
it should cover both remote and local charge cases.

I don’t have a strong opinion here—it just feels a bit
strange, since its description is quite generic for memcg:

"Enable cgroup-based memory accounting for dma-buf heap
allocations (default=false)."

Best Regards
Barry

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Barry Song @ 2026-05-18 22:19 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Christian König, Albert Esteve, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CABdmKX3wwgovwS-V8rVC3=+EZcTvPs_cttpQb1w6WemwLAVhsw@mail.gmail.com>

On Tue, May 19, 2026 at 5:17 AM T.J. Mercier <tjmercier@google.com> wrote:
[...]
> > > > Yeah I think this might work. I know of 3 cases, and it trivially
> > > > solves the first two. The third requires some work on our end to
> > > > extend our userspace interfaces to include the pidfd but it seems
> > > > doable. I'm checking with our graphics folks.
> > > >
> > > > 1) Direct allocation from user (e.g. app -> allocation ioctl on
> > > > /dev/dma_heap/foo)
> > > > No changes required to userspace. mem_accounting=1 charges the app.
> > > >
> > > > 2) Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
> > > > -> gralloc)
> > > > gralloc has the caller's pid as described in the commit message. Open
> > > > a pidfd and pass it in the dma_heap_allocation_data.
> > > >
> > > > 3) Double hop remote allocation (e.g. app -> dequeueBuffer ->
> > > > SurfaceFlinger -> gralloc)
> > > > In this case gralloc knows SurfaceFlinger's pid, but not the app's. So
> > > > we need to add the app's pidfd to the SurfaceFlinger -> gralloc
> > > > interface, or transfer the memcg charge from SurfaceFlinger to the app
> > > > after the allocation.
> > > > It'd be nice to avoid the charge transfer option entirely, but if we
> > > > need it that doesn't seem so bad in this case because it's a bulk
> > > > charge for the entire dmabuf rather than per-page. So the exporter
> > > > doesn't need to get involved (we wouldn't need a new dma_buf_op) and
> > > > we wouldn't have to worry about looping and locking for each page.
> > > >
> > >
> > > Hi T.J.,
> > >
> > > Your description of the three different cases sounds very interesting.
> > > It helps me understand how difficult it can be to correctly charge
> > > dma-buf in the current user scenarios.
> > >
> > > I’m wondering where I can find Android userspace code that transfers
> > > the PID of RPC callers. Do we have any existing sample code in Android
> > > for this?
> >
> > Hi Barry,
> >
> > In Java android.os.Binder.getCallingPid() will provide it. Here
>
> ... let me try again
>
> Here are some examples from the framework code:
>
> https://cs.android.com/search?q=getCallingPid%20f:ActivityManager&sq=&ss=android%2Fplatform%2Fsuperproject
>
> In native code we have AIBinder_getCallingPid and
> android::IPCThreadState::self()->getCallingPid() (or
> android::hardware::IPCThreadState::self()->getCallingPid() for HIDL)
>
> https://cs.android.com/search?q=getCallingPid%20l:cpp%20-f:prebuilt&ss=android%2Fplatform%2Fsuperproject

Thanks very much, T.J. That is very helpful. I guess
that would require user space to understand the RPC
procedure, including single-hop and two-hop cases, and
make the corresponding changes.

You pointed out the SurfaceFlinger cases, which are
two hops. It seems that AI models are also using
dma_heap, at least from what I have observed on MTK
and Qualcomm phones. Likely, we need to understand
those RPC relationships in userspace and make the
corresponding changes.
I assume AI models are a single-hop case?

Best Regards
Barry

^ permalink raw reply

* Re: [PATCH v5 00/14] module: Introduce hash-based integrity checking
From: Sami Tolvanen @ 2026-05-18 21:55 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Nathan Chancellor,
	Nicolas Schier, Arnd Bergmann, Luis Chamberlain, Petr Pavlu,
	Daniel Gomez, Paul Moore, James Morris, Serge E. Hallyn,
	Jonathan Corbet, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Naveen N Rao, Mimi Zohar, Roberto Sassu,
	Dmitry Kasatkin, Eric Snowberg, Nicolas Schier, Daniel Gomez,
	Aaron Tomlin, Christophe Leroy (CS GROUP), Nicolas Bouchinet,
	Xiu Jianfeng, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jiri Olsa, bpf, Fabian Grünbichler, Arnout Engelen,
	Mattia Rizzolo, kpcyrd, Christian Heusel, Câju Mihai-Drosi,
	Eric Biggers, Sebastian Andrzej Siewior, linux-kbuild,
	linux-kernel, linux-arch, linux-modules, linux-security-module,
	linux-doc, linuxppc-dev, linux-integrity, debian-kernel
In-Reply-To: <20260505-module-hashes-v5-0-e174a5a49fce@weissschuh.net>

Hi Thomas,

On Tue, May 05, 2026 at 11:05:04AM +0200, Thomas Weißschuh wrote:
> The current signature-based module integrity checking has some drawbacks
> in combination with reproducible builds. Either the module signing key
> is generated at build time, which makes the build unreproducible, or a
> static signing key is used, which precludes rebuilds by third parties
> and makes the whole build and packaging process much more complicated.
> 
> The goal is to reach bit-for-bit reproducibility. Excluding certain
> parts of the build output from the reproducibility analysis would be
> error-prone and force each downstream consumer to introduce new tooling.
> 
> Introduce a new mechanism to ensure only well-known modules are loaded
> by embedding a merkle tree root of all modules built as part of the full
> kernel build into vmlinux.

I noticed Sashiko had a few concerns about the build changes. Would you
mind taking a look to see if they're valid?

https://sashiko.dev/#/patchset/20260505-module-hashes-v5-0-e174a5a49fce%40weissschuh.net

Sami

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-18 21:29 UTC (permalink / raw)
  To: Song Liu
  Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAPhsuW4TJRqQKXgcBYog8YgFLU2h2Zq9ReahxTYp_zpDyvO8AA@mail.gmail.com>

On Mon, May 18, 2026 at 2:31 AM Song Liu <song@kernel.org> wrote:
> On Thu, May 14, 2026 at 8:48 PM Paul Moore <paul@paul-moore.com> wrote:
> > On Thu, May 7, 2026 at 3:05 AM Sasha Levin <sashal@kernel.org> wrote:
> > >
> > > When a (security) issue goes public, fleets stay exposed until a patched kernel
> > > is built, distributed, and rebooted into.
> > >
> > > For many such issues the simplest mitigation is to stop calling the buggy
> > > function. Killswitch provides that. An admin writes:
> > >
> > >     echo "engage af_alg_sendmsg -1" \
> > >         > /sys/kernel/security/killswitch/control
> > >
> > > After this, af_alg_sendmsg() returns -EPERM on every call without
> > > running its body. The mitigation takes effect immediately, and is dropped on
> > > the next reboot.
> > >
> > > A lot of recent kernel issues sit in code paths most installs only have enabled
> > > to support a relative minority of users: AF_ALG, ksmbd, nf_tables, vsock, ax25,
> > > and friends.
> > >
> > > For most users, the cost of "this socket family stops working for the day" is
> > > much smaller than the cost of running a known vulnerable kernel until the fix
> > > land.
> > >
> > > Assisted-by: Claude:claude-opus-4-7
> > > Signed-off-by: Sasha Levin <sashal@kernel.org>
> > > ---
> > >  Documentation/admin-guide/index.rst           |   1 +
> > >  Documentation/admin-guide/killswitch.rst      | 159 ++++
> > >  Documentation/admin-guide/tainted-kernels.rst |   8 +
> > >  MAINTAINERS                                   |  11 +
> > >  include/linux/killswitch.h                    |  19 +
> > >  include/linux/panic.h                         |   3 +-
> > >  init/Kconfig                                  |   2 +
> > >  kernel/Kconfig.killswitch                     |  31 +
> > >  kernel/Makefile                               |   1 +
> > >  kernel/killswitch.c                           | 798 ++++++++++++++++++
> > >  kernel/panic.c                                |   1 +
> > >  lib/Kconfig.debug                             |  13 +
> > >  lib/Makefile                                  |   1 +
> > >  lib/test_killswitch.c                         |  85 ++
> > >  tools/testing/selftests/Makefile              |   1 +
> > >  tools/testing/selftests/killswitch/.gitignore |   1 +
> > >  tools/testing/selftests/killswitch/Makefile   |   8 +
> > >  .../selftests/killswitch/cve_31431_test.c     | 162 ++++
> > >  .../selftests/killswitch/killswitch_test.sh   | 147 ++++
> > >  19 files changed, 1451 insertions(+), 1 deletion(-)
> > >  create mode 100644 Documentation/admin-guide/killswitch.rst
> > >  create mode 100644 include/linux/killswitch.h
> > >  create mode 100644 kernel/Kconfig.killswitch
> > >  create mode 100644 kernel/killswitch.c
> > >  create mode 100644 lib/test_killswitch.c
> > >  create mode 100644 tools/testing/selftests/killswitch/.gitignore
> > >  create mode 100644 tools/testing/selftests/killswitch/Makefile
> > >  create mode 100644 tools/testing/selftests/killswitch/cve_31431_test.c
> > >  create mode 100755 tools/testing/selftests/killswitch/killswitch_test.sh
> >
> > If we made Lockdown an LSM, we should probably also make killswitch an LSM.
>
> I don't think killswitch can stack with other LSMs. In fact, killswitch
> can be used to bypass other LSMs, for example:
>
> echo engage security_file_open 0 > /sys/kernel/security/killswitch/control
>
> will bypass all hooks on security_file_open.

From my perspective there are two different issues here: should
killswitch be a LSM, and should killswitch leverage kprobes to be able
to "kill" security related symbols.  After all, are we okay with
killswitch killing capable() and friends?

In my opinion, making killswitch an LSM is more of a procedural item
that deals with how we view a capability like killswitch.  I
personally view killswitch as somewhat similar to Lockdown, which is
why I made the suggestion.

The use of kprobes, while an interesting idea, presents problems as
allowing any kernel symbol to be killed introduces the potential for
security regressions.  As a reminder, some LSMs, as well as other
kernel subsystems, have mechanisms in place to restrict root and/or
enforce one-way configuration locks; while many people equate "root"
with full control, in many cases today that is not strictly correct.

Yes, kprobes have been around for some time, this is not a new
problem, but killswitch makes it far more convenient and accessible to
do dangerous things with kprobes.  If killswitch makes it past the RFC
stage without any significant changes to its kill mechanism, we may
need to start considering more liberal usage of NOKPROBE_SYMBOL()
which I think would be an unfortunate casualty.

-- 
paul-moore.com

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-18 21:17 UTC (permalink / raw)
  To: Barry Song
  Cc: Christian König, Albert Esteve, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CABdmKX0gqg309hcXcOHSj_yTg0h1zwDL34GDk8mX3wp4YoyfDg@mail.gmail.com>

On Mon, May 18, 2026 at 2:12 PM T.J. Mercier <tjmercier@google.com> wrote:
>
> On Sat, May 16, 2026 at 1:40 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Wed, May 13, 2026 at 2:54 AM T.J. Mercier <tjmercier@google.com> wrote:
> > >
> > > On Tue, May 12, 2026 at 3:14 AM Christian König
> > > <christian.koenig@amd.com> wrote:
> > > >
> > > > On 5/12/26 11:10, Albert Esteve wrote:
> > > > > On embedded platforms a central process often allocates dma-buf
> > > > > memory on behalf of client applications. Without a way to
> > > > > attribute the charge to the requesting client's cgroup, the
> > > > > cost lands on the allocator, making per-cgroup memory limits
> > > > > ineffective for the actual consumers.
> > > > >
> > > > > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > > > > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > > > > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > > > > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > > > > the mem_accounting module parameter enabled, the buffer is charged
> > > > > to the allocator's own cgroup.
> > > > >
> > > > > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > > > > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > > > > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > > > > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > > > > all accounting through a single MEMCG_DMABUF path.
> > > > >
> > > > > Usage examples:
> > > > >
> > > > >   1. Central allocator charging to a client at allocation time.
> > > > >      The allocator knows the client's PID (e.g., from binder's
> > > > >      sender_pid) and uses pidfd to attribute the charge:
> > > > >
> > > > >        pid_t client_pid = txn->sender_pid;
> > > > >        int pidfd = pidfd_open(client_pid, 0);
> > > > >
> > > > >        struct dma_heap_allocation_data alloc = {
> > > > >            .len             = buffer_size,
> > > > >            .fd_flags        = O_RDWR | O_CLOEXEC,
> > > > >            .charge_pid_fd   = pidfd,
> > > > >        };
> > > > >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > > > >        close(pidfd);
> > > > >        /* alloc.fd is now charged to client's cgroup */
> > > > >
> > > > >   2. Default allocation (no pidfd, mem_accounting=1).
> > > > >      When charge_pid_fd is not set and the mem_accounting module
> > > > >      parameter is enabled, the buffer is charged to the allocator's
> > > > >      own cgroup:
> > > > >
> > > > >        struct dma_heap_allocation_data alloc = {
> > > > >            .len      = buffer_size,
> > > > >            .fd_flags = O_RDWR | O_CLOEXEC,
> > > > >        };
> > > > >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > > > >        /* charged to current process's cgroup */
> > > > >
> > > > > Current limitations:
> > > > >
> > > > >  - Single-owner model: a dma-buf carries one memcg charge regardless of
> > > > >    how many processes share it. Means only the first owner (and exporter)
> > > > >    of the shared buffer bears the charge.
> > > > >  - Only memcg accounting supported. While this makes sense for system
> > > > >    heap buffers, other heaps (e.g., CMA heaps) will require selectively
> > > > >    charging also for the dmem controller.
> > > >
> > > > Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
> > >
> > > Yeah I think this might work. I know of 3 cases, and it trivially
> > > solves the first two. The third requires some work on our end to
> > > extend our userspace interfaces to include the pidfd but it seems
> > > doable. I'm checking with our graphics folks.
> > >
> > > 1) Direct allocation from user (e.g. app -> allocation ioctl on
> > > /dev/dma_heap/foo)
> > > No changes required to userspace. mem_accounting=1 charges the app.
> > >
> > > 2) Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
> > > -> gralloc)
> > > gralloc has the caller's pid as described in the commit message. Open
> > > a pidfd and pass it in the dma_heap_allocation_data.
> > >
> > > 3) Double hop remote allocation (e.g. app -> dequeueBuffer ->
> > > SurfaceFlinger -> gralloc)
> > > In this case gralloc knows SurfaceFlinger's pid, but not the app's. So
> > > we need to add the app's pidfd to the SurfaceFlinger -> gralloc
> > > interface, or transfer the memcg charge from SurfaceFlinger to the app
> > > after the allocation.
> > > It'd be nice to avoid the charge transfer option entirely, but if we
> > > need it that doesn't seem so bad in this case because it's a bulk
> > > charge for the entire dmabuf rather than per-page. So the exporter
> > > doesn't need to get involved (we wouldn't need a new dma_buf_op) and
> > > we wouldn't have to worry about looping and locking for each page.
> > >
> >
> > Hi T.J.,
> >
> > Your description of the three different cases sounds very interesting.
> > It helps me understand how difficult it can be to correctly charge
> > dma-buf in the current user scenarios.
> >
> > I’m wondering where I can find Android userspace code that transfers
> > the PID of RPC callers. Do we have any existing sample code in Android
> > for this?
>
> Hi Barry,
>
> In Java android.os.Binder.getCallingPid() will provide it. Here

... let me try again

Here are some examples from the framework code:

https://cs.android.com/search?q=getCallingPid%20f:ActivityManager&sq=&ss=android%2Fplatform%2Fsuperproject

In native code we have AIBinder_getCallingPid and
android::IPCThreadState::self()->getCallingPid() (or
android::hardware::IPCThreadState::self()->getCallingPid() for HIDL)

https://cs.android.com/search?q=getCallingPid%20l:cpp%20-f:prebuilt&ss=android%2Fplatform%2Fsuperproject

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-18 21:12 UTC (permalink / raw)
  To: Barry Song
  Cc: Christian König, Albert Esteve, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CAGsJ_4zjrFJYQQsLThTGXR6g+2PXzeAhjyDpLHfDFqVViWvyBQ@mail.gmail.com>

On Sat, May 16, 2026 at 1:40 AM Barry Song <baohua@kernel.org> wrote:
>
> On Wed, May 13, 2026 at 2:54 AM T.J. Mercier <tjmercier@google.com> wrote:
> >
> > On Tue, May 12, 2026 at 3:14 AM Christian König
> > <christian.koenig@amd.com> wrote:
> > >
> > > On 5/12/26 11:10, Albert Esteve wrote:
> > > > On embedded platforms a central process often allocates dma-buf
> > > > memory on behalf of client applications. Without a way to
> > > > attribute the charge to the requesting client's cgroup, the
> > > > cost lands on the allocator, making per-cgroup memory limits
> > > > ineffective for the actual consumers.
> > > >
> > > > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > > > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > > > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > > > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > > > the mem_accounting module parameter enabled, the buffer is charged
> > > > to the allocator's own cgroup.
> > > >
> > > > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > > > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > > > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > > > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > > > all accounting through a single MEMCG_DMABUF path.
> > > >
> > > > Usage examples:
> > > >
> > > >   1. Central allocator charging to a client at allocation time.
> > > >      The allocator knows the client's PID (e.g., from binder's
> > > >      sender_pid) and uses pidfd to attribute the charge:
> > > >
> > > >        pid_t client_pid = txn->sender_pid;
> > > >        int pidfd = pidfd_open(client_pid, 0);
> > > >
> > > >        struct dma_heap_allocation_data alloc = {
> > > >            .len             = buffer_size,
> > > >            .fd_flags        = O_RDWR | O_CLOEXEC,
> > > >            .charge_pid_fd   = pidfd,
> > > >        };
> > > >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > > >        close(pidfd);
> > > >        /* alloc.fd is now charged to client's cgroup */
> > > >
> > > >   2. Default allocation (no pidfd, mem_accounting=1).
> > > >      When charge_pid_fd is not set and the mem_accounting module
> > > >      parameter is enabled, the buffer is charged to the allocator's
> > > >      own cgroup:
> > > >
> > > >        struct dma_heap_allocation_data alloc = {
> > > >            .len      = buffer_size,
> > > >            .fd_flags = O_RDWR | O_CLOEXEC,
> > > >        };
> > > >        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > > >        /* charged to current process's cgroup */
> > > >
> > > > Current limitations:
> > > >
> > > >  - Single-owner model: a dma-buf carries one memcg charge regardless of
> > > >    how many processes share it. Means only the first owner (and exporter)
> > > >    of the shared buffer bears the charge.
> > > >  - Only memcg accounting supported. While this makes sense for system
> > > >    heap buffers, other heaps (e.g., CMA heaps) will require selectively
> > > >    charging also for the dmem controller.
> > >
> > > Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
> >
> > Yeah I think this might work. I know of 3 cases, and it trivially
> > solves the first two. The third requires some work on our end to
> > extend our userspace interfaces to include the pidfd but it seems
> > doable. I'm checking with our graphics folks.
> >
> > 1) Direct allocation from user (e.g. app -> allocation ioctl on
> > /dev/dma_heap/foo)
> > No changes required to userspace. mem_accounting=1 charges the app.
> >
> > 2) Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
> > -> gralloc)
> > gralloc has the caller's pid as described in the commit message. Open
> > a pidfd and pass it in the dma_heap_allocation_data.
> >
> > 3) Double hop remote allocation (e.g. app -> dequeueBuffer ->
> > SurfaceFlinger -> gralloc)
> > In this case gralloc knows SurfaceFlinger's pid, but not the app's. So
> > we need to add the app's pidfd to the SurfaceFlinger -> gralloc
> > interface, or transfer the memcg charge from SurfaceFlinger to the app
> > after the allocation.
> > It'd be nice to avoid the charge transfer option entirely, but if we
> > need it that doesn't seem so bad in this case because it's a bulk
> > charge for the entire dmabuf rather than per-page. So the exporter
> > doesn't need to get involved (we wouldn't need a new dma_buf_op) and
> > we wouldn't have to worry about looping and locking for each page.
> >
>
> Hi T.J.,
>
> Your description of the three different cases sounds very interesting.
> It helps me understand how difficult it can be to correctly charge
> dma-buf in the current user scenarios.
>
> I’m wondering where I can find Android userspace code that transfers
> the PID of RPC callers. Do we have any existing sample code in Android
> for this?

Hi Barry,

In Java android.os.Binder.getCallingPid() will provide it. Here


> > > I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
>
> Thanks
> Barry

^ permalink raw reply

* Re: [PATCH v2 02/16] security/Kconfig.hardening: Remove tautological condition from CC_HAS_ZERO_CALL_USED_REGS
From: Nathan Chancellor @ 2026-05-18 21:05 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Nicolas Schier, Bill Wendling, Justin Stitt, Nick Desaulniers,
	linux-kernel, llvm, linux-kbuild, Kees Cook, Gustavo A. R. Silva,
	linux-hardening, linux-security-module
In-Reply-To: <55186588-0cff-4908-923a-d5611707a3b0@app.fastmail.com>

On Mon, May 18, 2026 at 09:48:47AM +0200, Arnd Bergmann wrote:
> On Mon, May 18, 2026, at 01:05, Nathan Chancellor wrote:
> > Now that the minimum supported version of LLVM for building the kernel
> > has been raised to 17.0.1, the '!Clang || Clang > 15.0.6' dependency for
> > CONFIG_CC_HAS_ZERO_CALL_USED_REGS is always true, so it can be removed.
> >
> > Reviewed-by: Nicolas Schier <nsc@kernel.org>
> > Signed-off-by: Nathan Chancellor <nathan@kernel.org>
> 
> Acked-by: Arnd Bergmann <arnd@arndb.de>

Thanks for taking a look!

> >  config CC_HAS_ZERO_CALL_USED_REGS
> >  	def_bool $(cc-option,-fzero-call-used-regs=used-gpr)
> > -	# https://github.com/ClangBuiltLinux/linux/issues/1766
> > -	# https://github.com/llvm/llvm-project/issues/59242
> > -	depends on !CC_IS_CLANG || CLANG_VERSION > 150006
> > 
> 
> Maybe add a comment to mention that this now requires gcc-11,
> that way we have it easier to remove the check when that becomes
> the minimum version.

Sure, I can add

  # supported by gcc-11 or newer and all supported versions of clang

when I apply it.

-- 
Cheers,
Nathan

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Christian König @ 2026-05-18 14:06 UTC (permalink / raw)
  To: Albert Esteve
  Cc: T.J. Mercier, Christian Brauner, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CADSE00Lh95ygoXGKJGsYvQGEsFV8sVmwEC3uvh8M6r3ERzaJwg@mail.gmail.com>

On 5/18/26 14:50, Albert Esteve wrote:
> On Mon, May 18, 2026 at 9:20 AM Christian König
> <christian.koenig@amd.com> wrote:
>>
>> On 5/15/26 19:06, T.J. Mercier wrote:
>>> On Fri, May 15, 2026 at 6:53 AM Christian Brauner <brauner@kernel.org> wrote:
>>>>
>>>> On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
>>>>> On embedded platforms a central process often allocates dma-buf
>>>>> memory on behalf of client applications. Without a way to
>>>>> attribute the charge to the requesting client's cgroup, the
>>>>> cost lands on the allocator, making per-cgroup memory limits
>>>>> ineffective for the actual consumers.
>>>>>
>>>>> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
>>>>
>>>> Please be aware that pidfds come in two flavors:
>>>>
>>>> thread-group pidfds and thread-specific pidfds. Make sure that your API
>>>> doesn't implicitly depend on this distinction not existing.
>>>
>>> Hi Christian,
>>>
>>> Memcg is not a controller that supports "thread mode" so all threads
>>> in a group should belong to the same memcg.
>>
>> BTW: Exactly that is the requirement automotive has with their native context use case.
>>
>> The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
>>
>> At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
>>
>> Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
> 
> Hi Christian,
> 
> Thanks for sharing this atuomotive usecase. If I understand correctly,
> the actual requirement is attributing dma-buf charges to the right
> client, not putting each daemon thread in a different cgroup?

Nope, exactly that's the difference.

The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.

Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.

The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.

The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.

So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.

> If so,
> the `charge_pid_fd` approach achieves this directly by passing the
> client's `pid_fd`, without needing to add per-thread cgroup
> infrastructure.

Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.

Doing that automatically for CPU and I/O time would just be nice to have additionally.

Regards,
Christian.

> 
>>
>> Regards,
>> Christian.
>>
>>>
>>> Checking the flags from pidfd_get_pid would be the best way for an
>>> explicit check of the pidfd type?
>>>
>>>>> a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
>>>>> memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
>>>>> inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
>>>>> the mem_accounting module parameter enabled, the buffer is charged
>>>>> to the allocator's own cgroup.
>>>>>
>>>>> Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
>>>>> system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
>>>>> page allocations. Keeping __GFP_ACCOUNT would charge the same pages
>>>>> twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
>>>>> all accounting through a single MEMCG_DMABUF path.
>>>>>
>>>>> Usage examples:
>>>>>
>>>>>   1. Central allocator charging to a client at allocation time.
>>>>>      The allocator knows the client's PID (e.g., from binder's
>>>>>      sender_pid) and uses pidfd to attribute the charge:
>>>>>
>>>>>        pid_t client_pid = txn->sender_pid;
>>>>>        int pidfd = pidfd_open(client_pid, 0);
>>>>>
>>>>>        struct dma_heap_allocation_data alloc = {
>>>>>            .len             = buffer_size,
>>>>>            .fd_flags        = O_RDWR | O_CLOEXEC,
>>>>>            .charge_pid_fd   = pidfd,
>>>>>        };
>>>>>        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
>>>>>        close(pidfd);
>>>>>        /* alloc.fd is now charged to client's cgroup */
>>>>>
>>>>>   2. Default allocation (no pidfd, mem_accounting=1).
>>>>>      When charge_pid_fd is not set and the mem_accounting module
>>>>>      parameter is enabled, the buffer is charged to the allocator's
>>>>>      own cgroup:
>>>>>
>>>>>        struct dma_heap_allocation_data alloc = {
>>>>>            .len      = buffer_size,
>>>>>            .fd_flags = O_RDWR | O_CLOEXEC,
>>>>>        };
>>>>>        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
>>>>>        /* charged to current process's cgroup */
>>>>>
>>>>> Current limitations:
>>>>>
>>>>>  - Single-owner model: a dma-buf carries one memcg charge regardless of
>>>>>    how many processes share it. Means only the first owner (and exporter)
>>>>>    of the shared buffer bears the charge.
>>>>>  - Only memcg accounting supported. While this makes sense for system
>>>>>    heap buffers, other heaps (e.g., CMA heaps) will require selectively
>>>>>    charging also for the dmem controller.
>>>>>
>>>>> Signed-off-by: Albert Esteve <aesteve@redhat.com>
>>>>> ---
>>>>>  Documentation/admin-guide/cgroup-v2.rst |  5 ++--
>>>>>  drivers/dma-buf/dma-buf.c               | 16 ++++---------
>>>>>  drivers/dma-buf/dma-heap.c              | 42 ++++++++++++++++++++++++++++++---
>>>>>  drivers/dma-buf/heaps/system_heap.c     |  2 --
>>>>>  include/uapi/linux/dma-heap.h           |  6 +++++
>>>>>  5 files changed, 53 insertions(+), 18 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>>>>> index 8bdbc2e866430..824d269531eb1 100644
>>>>> --- a/Documentation/admin-guide/cgroup-v2.rst
>>>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>>>>> @@ -1636,8 +1636,9 @@ The following nested keys are defined.
>>>>>               structures.
>>>>>
>>>>>         dmabuf (npn)
>>>>> -             Amount of memory used for exported DMA buffers allocated by the cgroup.
>>>>> -             Stays with the allocating cgroup regardless of how the buffer is shared.
>>>>> +             Amount of memory used for exported DMA buffers allocated by or on
>>>>> +             behalf of the cgroup. Stays with the allocating cgroup regardless
>>>>> +             of how the buffer is shared.
>>>>>
>>>>>         workingset_refault_anon
>>>>>               Number of refaults of previously evicted anonymous pages.
>>>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>>>> index ce02377f48908..23fb758b78297 100644
>>>>> --- a/drivers/dma-buf/dma-buf.c
>>>>> +++ b/drivers/dma-buf/dma-buf.c
>>>>> @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
>>>>>        */
>>>>>       BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
>>>>>
>>>>> -     mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
>>>>> -     mem_cgroup_put(dmabuf->memcg);
>>>>> +     if (dmabuf->memcg) {
>>>>> +             mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
>>>>> +                                       PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
>>>>> +             mem_cgroup_put(dmabuf->memcg);
>>>>> +     }
>>>>>
>>>>>       dmabuf->ops->release(dmabuf);
>>>>>
>>>>> @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
>>>>>               dmabuf->resv = resv;
>>>>>       }
>>>>>
>>>>> -     dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
>>>>> -     if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
>>>>> -                                   GFP_KERNEL)) {
>>>>> -             ret = -ENOMEM;
>>>>> -             goto err_memcg;
>>>>> -     }
>>>>> -
>>>>>       file->private_data = dmabuf;
>>>>>       file->f_path.dentry->d_fsdata = dmabuf;
>>>>>       dmabuf->file = file;
>>>>> @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
>>>>>
>>>>>       return dmabuf;
>>>>>
>>>>> -err_memcg:
>>>>> -     mem_cgroup_put(dmabuf->memcg);
>>>>>  err_file:
>>>>>       fput(file);
>>>>>  err_module:
>>>>> diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
>>>>> index ac5f8685a6494..ff6e259afcdc0 100644
>>>>> --- a/drivers/dma-buf/dma-heap.c
>>>>> +++ b/drivers/dma-buf/dma-heap.c
>>>>> @@ -7,13 +7,17 @@
>>>>>   */
>>>>>
>>>>>  #include <linux/cdev.h>
>>>>> +#include <linux/cgroup.h>
>>>>>  #include <linux/device.h>
>>>>>  #include <linux/dma-buf.h>
>>>>>  #include <linux/dma-heap.h>
>>>>> +#include <linux/memcontrol.h>
>>>>> +#include <linux/sched/mm.h>
>>>>>  #include <linux/err.h>
>>>>>  #include <linux/export.h>
>>>>>  #include <linux/list.h>
>>>>>  #include <linux/nospec.h>
>>>>> +#include <linux/pidfd.h>
>>>>>  #include <linux/syscalls.h>
>>>>>  #include <linux/uaccess.h>
>>>>>  #include <linux/xarray.h>
>>>>> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
>>>>>                "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
>>>>>
>>>>>  static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
>>>>> -                              u32 fd_flags,
>>>>> -                              u64 heap_flags)
>>>>> +                              u32 fd_flags, u64 heap_flags,
>>>>> +                              struct mem_cgroup *charge_to)
>>>>>  {
>>>>>       struct dma_buf *dmabuf;
>>>>> +     unsigned int nr_pages;
>>>>> +     struct mem_cgroup *memcg = charge_to;
>>>>>       int fd;
>>>>>
>>>>>       /*
>>>>> @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
>>>>>       if (IS_ERR(dmabuf))
>>>>>               return PTR_ERR(dmabuf);
>>>>>
>>>>> +     nr_pages = len / PAGE_SIZE;
>>>>> +
>>>>> +     if (memcg)
>>>>> +             css_get(&memcg->css);
>>>>> +     else if (mem_accounting)
>>>>> +             memcg = get_mem_cgroup_from_mm(current->mm);
>>>>> +
>>>>> +     if (memcg) {
>>>>> +             if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
>>>>> +                     mem_cgroup_put(memcg);
>>>>> +                     dma_buf_put(dmabuf);
>>>>> +                     return -ENOMEM;
>>>>> +             }
>>>>> +             dmabuf->memcg = memcg;
>>>>> +     }
>>>>> +
>>>>>       fd = dma_buf_fd(dmabuf, fd_flags);
>>>>>       if (fd < 0) {
>>>>>               dma_buf_put(dmabuf);
>>>>> @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
>>>>>  {
>>>>>       struct dma_heap_allocation_data *heap_allocation = data;
>>>>>       struct dma_heap *heap = file->private_data;
>>>>> +     struct mem_cgroup *memcg = NULL;
>>>>> +     struct task_struct *task;
>>>>> +     unsigned int pidfd_flags;
>>>>>       int fd;
>>>>>
>>>>>       if (heap_allocation->fd)
>>>>> @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
>>>>>       if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
>>>>>               return -EINVAL;
>>>>>
>>>>> +     if (heap_allocation->charge_pid_fd) {
>>>>> +             task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
>>>>
>>>> Will always get a thread-group leader pidfd and will fail if this is a
>>>> thread-specific pidfd. pidfd_open(1234, PIDFD_THREAD) can be used to
>>>> open a thread-specific pidfd.
>>>>
>>>>> +             if (IS_ERR(task))
>>>>> +                     return PTR_ERR(task);
>>>>> +
>>>>> +             memcg = get_mem_cgroup_from_mm(task->mm);
>>>>> +             put_task_struct(task);
>>>>> +     }
>>>>> +
>>>>>       fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
>>>>>                                  heap_allocation->fd_flags,
>>>>> -                                heap_allocation->heap_flags);
>>>>> +                                heap_allocation->heap_flags,
>>>>> +                                memcg);
>>>>> +     mem_cgroup_put(memcg);
>>>>>       if (fd < 0)
>>>>>               return fd;
>>>>>
>>>>> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
>>>>> index 03c2b87cb1112..95d7688167b93 100644
>>>>> --- a/drivers/dma-buf/heaps/system_heap.c
>>>>> +++ b/drivers/dma-buf/heaps/system_heap.c
>>>>> @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
>>>>>               if (max_order < orders[i])
>>>>>                       continue;
>>>>>               flags = order_flags[i];
>>>>> -             if (mem_accounting)
>>>>> -                     flags |= __GFP_ACCOUNT;
>>>>>               page = alloc_pages(flags, orders[i]);
>>>>>               if (!page)
>>>>>                       continue;
>>>>> diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
>>>>> index a4cf716a49fa6..e02b0f8cbc6a1 100644
>>>>> --- a/include/uapi/linux/dma-heap.h
>>>>> +++ b/include/uapi/linux/dma-heap.h
>>>>> @@ -29,6 +29,10 @@
>>>>>   *                   handle to the allocated dma-buf
>>>>>   * @fd_flags:                file descriptor flags used when allocating
>>>>>   * @heap_flags:              flags passed to heap
>>>>> + * @charge_pid_fd:   optional pidfd of the process whose cgroup should be
>>>>> + *                   charged for this allocation; 0 means charge the calling
>>>>> + *                   process's cgroup
>>>>> + * @__padding:               reserved, must be zero
>>>>>   *
>>>>>   * Provided by userspace as an argument to the ioctl
>>>>>   */
>>>>> @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
>>>>>       __u32 fd;
>>>>>       __u32 fd_flags;
>>>>>       __u64 heap_flags;
>>>>> +     __u32 charge_pid_fd;
>>>>> +     __u32 __padding;
>>>>>  };
>>>>>
>>>>>  #define DMA_HEAP_IOC_MAGIC           'H'
>>>>>
>>>>> --
>>>>> 2.53.0
>>>>>
>>
> 


^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-18 12:50 UTC (permalink / raw)
  To: Christian König
  Cc: T.J. Mercier, Christian Brauner, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <208fb820-d8eb-4832-a343-ef8b360e8120@amd.com>

On Mon, May 18, 2026 at 9:20 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/15/26 19:06, T.J. Mercier wrote:
> > On Fri, May 15, 2026 at 6:53 AM Christian Brauner <brauner@kernel.org> wrote:
> >>
> >> On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
> >>> On embedded platforms a central process often allocates dma-buf
> >>> memory on behalf of client applications. Without a way to
> >>> attribute the charge to the requesting client's cgroup, the
> >>> cost lands on the allocator, making per-cgroup memory limits
> >>> ineffective for the actual consumers.
> >>>
> >>> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> >>
> >> Please be aware that pidfds come in two flavors:
> >>
> >> thread-group pidfds and thread-specific pidfds. Make sure that your API
> >> doesn't implicitly depend on this distinction not existing.
> >
> > Hi Christian,
> >
> > Memcg is not a controller that supports "thread mode" so all threads
> > in a group should belong to the same memcg.
>
> BTW: Exactly that is the requirement automotive has with their native context use case.
>
> The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
>
> At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
>
> Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.

Hi Christian,

Thanks for sharing this atuomotive usecase. If I understand correctly,
the actual requirement is attributing dma-buf charges to the right
client, not putting each daemon thread in a different cgroup? If so,
the `charge_pid_fd` approach achieves this directly by passing the
client's `pid_fd`, without needing to add per-thread cgroup
infrastructure.

>
> Regards,
> Christian.
>
> >
> > Checking the flags from pidfd_get_pid would be the best way for an
> > explicit check of the pidfd type?
> >
> >>> a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> >>> memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> >>> inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> >>> the mem_accounting module parameter enabled, the buffer is charged
> >>> to the allocator's own cgroup.
> >>>
> >>> Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> >>> system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> >>> page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> >>> twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> >>> all accounting through a single MEMCG_DMABUF path.
> >>>
> >>> Usage examples:
> >>>
> >>>   1. Central allocator charging to a client at allocation time.
> >>>      The allocator knows the client's PID (e.g., from binder's
> >>>      sender_pid) and uses pidfd to attribute the charge:
> >>>
> >>>        pid_t client_pid = txn->sender_pid;
> >>>        int pidfd = pidfd_open(client_pid, 0);
> >>>
> >>>        struct dma_heap_allocation_data alloc = {
> >>>            .len             = buffer_size,
> >>>            .fd_flags        = O_RDWR | O_CLOEXEC,
> >>>            .charge_pid_fd   = pidfd,
> >>>        };
> >>>        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> >>>        close(pidfd);
> >>>        /* alloc.fd is now charged to client's cgroup */
> >>>
> >>>   2. Default allocation (no pidfd, mem_accounting=1).
> >>>      When charge_pid_fd is not set and the mem_accounting module
> >>>      parameter is enabled, the buffer is charged to the allocator's
> >>>      own cgroup:
> >>>
> >>>        struct dma_heap_allocation_data alloc = {
> >>>            .len      = buffer_size,
> >>>            .fd_flags = O_RDWR | O_CLOEXEC,
> >>>        };
> >>>        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> >>>        /* charged to current process's cgroup */
> >>>
> >>> Current limitations:
> >>>
> >>>  - Single-owner model: a dma-buf carries one memcg charge regardless of
> >>>    how many processes share it. Means only the first owner (and exporter)
> >>>    of the shared buffer bears the charge.
> >>>  - Only memcg accounting supported. While this makes sense for system
> >>>    heap buffers, other heaps (e.g., CMA heaps) will require selectively
> >>>    charging also for the dmem controller.
> >>>
> >>> Signed-off-by: Albert Esteve <aesteve@redhat.com>
> >>> ---
> >>>  Documentation/admin-guide/cgroup-v2.rst |  5 ++--
> >>>  drivers/dma-buf/dma-buf.c               | 16 ++++---------
> >>>  drivers/dma-buf/dma-heap.c              | 42 ++++++++++++++++++++++++++++++---
> >>>  drivers/dma-buf/heaps/system_heap.c     |  2 --
> >>>  include/uapi/linux/dma-heap.h           |  6 +++++
> >>>  5 files changed, 53 insertions(+), 18 deletions(-)
> >>>
> >>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> >>> index 8bdbc2e866430..824d269531eb1 100644
> >>> --- a/Documentation/admin-guide/cgroup-v2.rst
> >>> +++ b/Documentation/admin-guide/cgroup-v2.rst
> >>> @@ -1636,8 +1636,9 @@ The following nested keys are defined.
> >>>               structures.
> >>>
> >>>         dmabuf (npn)
> >>> -             Amount of memory used for exported DMA buffers allocated by the cgroup.
> >>> -             Stays with the allocating cgroup regardless of how the buffer is shared.
> >>> +             Amount of memory used for exported DMA buffers allocated by or on
> >>> +             behalf of the cgroup. Stays with the allocating cgroup regardless
> >>> +             of how the buffer is shared.
> >>>
> >>>         workingset_refault_anon
> >>>               Number of refaults of previously evicted anonymous pages.
> >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> >>> index ce02377f48908..23fb758b78297 100644
> >>> --- a/drivers/dma-buf/dma-buf.c
> >>> +++ b/drivers/dma-buf/dma-buf.c
> >>> @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
> >>>        */
> >>>       BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> >>>
> >>> -     mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> >>> -     mem_cgroup_put(dmabuf->memcg);
> >>> +     if (dmabuf->memcg) {
> >>> +             mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
> >>> +                                       PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> >>> +             mem_cgroup_put(dmabuf->memcg);
> >>> +     }
> >>>
> >>>       dmabuf->ops->release(dmabuf);
> >>>
> >>> @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> >>>               dmabuf->resv = resv;
> >>>       }
> >>>
> >>> -     dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
> >>> -     if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
> >>> -                                   GFP_KERNEL)) {
> >>> -             ret = -ENOMEM;
> >>> -             goto err_memcg;
> >>> -     }
> >>> -
> >>>       file->private_data = dmabuf;
> >>>       file->f_path.dentry->d_fsdata = dmabuf;
> >>>       dmabuf->file = file;
> >>> @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> >>>
> >>>       return dmabuf;
> >>>
> >>> -err_memcg:
> >>> -     mem_cgroup_put(dmabuf->memcg);
> >>>  err_file:
> >>>       fput(file);
> >>>  err_module:
> >>> diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
> >>> index ac5f8685a6494..ff6e259afcdc0 100644
> >>> --- a/drivers/dma-buf/dma-heap.c
> >>> +++ b/drivers/dma-buf/dma-heap.c
> >>> @@ -7,13 +7,17 @@
> >>>   */
> >>>
> >>>  #include <linux/cdev.h>
> >>> +#include <linux/cgroup.h>
> >>>  #include <linux/device.h>
> >>>  #include <linux/dma-buf.h>
> >>>  #include <linux/dma-heap.h>
> >>> +#include <linux/memcontrol.h>
> >>> +#include <linux/sched/mm.h>
> >>>  #include <linux/err.h>
> >>>  #include <linux/export.h>
> >>>  #include <linux/list.h>
> >>>  #include <linux/nospec.h>
> >>> +#include <linux/pidfd.h>
> >>>  #include <linux/syscalls.h>
> >>>  #include <linux/uaccess.h>
> >>>  #include <linux/xarray.h>
> >>> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
> >>>                "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
> >>>
> >>>  static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> >>> -                              u32 fd_flags,
> >>> -                              u64 heap_flags)
> >>> +                              u32 fd_flags, u64 heap_flags,
> >>> +                              struct mem_cgroup *charge_to)
> >>>  {
> >>>       struct dma_buf *dmabuf;
> >>> +     unsigned int nr_pages;
> >>> +     struct mem_cgroup *memcg = charge_to;
> >>>       int fd;
> >>>
> >>>       /*
> >>> @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> >>>       if (IS_ERR(dmabuf))
> >>>               return PTR_ERR(dmabuf);
> >>>
> >>> +     nr_pages = len / PAGE_SIZE;
> >>> +
> >>> +     if (memcg)
> >>> +             css_get(&memcg->css);
> >>> +     else if (mem_accounting)
> >>> +             memcg = get_mem_cgroup_from_mm(current->mm);
> >>> +
> >>> +     if (memcg) {
> >>> +             if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
> >>> +                     mem_cgroup_put(memcg);
> >>> +                     dma_buf_put(dmabuf);
> >>> +                     return -ENOMEM;
> >>> +             }
> >>> +             dmabuf->memcg = memcg;
> >>> +     }
> >>> +
> >>>       fd = dma_buf_fd(dmabuf, fd_flags);
> >>>       if (fd < 0) {
> >>>               dma_buf_put(dmabuf);
> >>> @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> >>>  {
> >>>       struct dma_heap_allocation_data *heap_allocation = data;
> >>>       struct dma_heap *heap = file->private_data;
> >>> +     struct mem_cgroup *memcg = NULL;
> >>> +     struct task_struct *task;
> >>> +     unsigned int pidfd_flags;
> >>>       int fd;
> >>>
> >>>       if (heap_allocation->fd)
> >>> @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> >>>       if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
> >>>               return -EINVAL;
> >>>
> >>> +     if (heap_allocation->charge_pid_fd) {
> >>> +             task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
> >>
> >> Will always get a thread-group leader pidfd and will fail if this is a
> >> thread-specific pidfd. pidfd_open(1234, PIDFD_THREAD) can be used to
> >> open a thread-specific pidfd.
> >>
> >>> +             if (IS_ERR(task))
> >>> +                     return PTR_ERR(task);
> >>> +
> >>> +             memcg = get_mem_cgroup_from_mm(task->mm);
> >>> +             put_task_struct(task);
> >>> +     }
> >>> +
> >>>       fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
> >>>                                  heap_allocation->fd_flags,
> >>> -                                heap_allocation->heap_flags);
> >>> +                                heap_allocation->heap_flags,
> >>> +                                memcg);
> >>> +     mem_cgroup_put(memcg);
> >>>       if (fd < 0)
> >>>               return fd;
> >>>
> >>> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> >>> index 03c2b87cb1112..95d7688167b93 100644
> >>> --- a/drivers/dma-buf/heaps/system_heap.c
> >>> +++ b/drivers/dma-buf/heaps/system_heap.c
> >>> @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
> >>>               if (max_order < orders[i])
> >>>                       continue;
> >>>               flags = order_flags[i];
> >>> -             if (mem_accounting)
> >>> -                     flags |= __GFP_ACCOUNT;
> >>>               page = alloc_pages(flags, orders[i]);
> >>>               if (!page)
> >>>                       continue;
> >>> diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
> >>> index a4cf716a49fa6..e02b0f8cbc6a1 100644
> >>> --- a/include/uapi/linux/dma-heap.h
> >>> +++ b/include/uapi/linux/dma-heap.h
> >>> @@ -29,6 +29,10 @@
> >>>   *                   handle to the allocated dma-buf
> >>>   * @fd_flags:                file descriptor flags used when allocating
> >>>   * @heap_flags:              flags passed to heap
> >>> + * @charge_pid_fd:   optional pidfd of the process whose cgroup should be
> >>> + *                   charged for this allocation; 0 means charge the calling
> >>> + *                   process's cgroup
> >>> + * @__padding:               reserved, must be zero
> >>>   *
> >>>   * Provided by userspace as an argument to the ioctl
> >>>   */
> >>> @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
> >>>       __u32 fd;
> >>>       __u32 fd_flags;
> >>>       __u64 heap_flags;
> >>> +     __u32 charge_pid_fd;
> >>> +     __u32 __padding;
> >>>  };
> >>>
> >>>  #define DMA_HEAP_IOC_MAGIC           'H'
> >>>
> >>> --
> >>> 2.53.0
> >>>
>


^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-18 12:16 UTC (permalink / raw)
  To: Barry Song
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CAGsJ_4xfznffbjOaNKwnN6oZk_H6pqOzYqd1zx4Q9XrocdzV8A@mail.gmail.com>

On Sat, May 16, 2026 at 9:37 AM Barry Song <baohua@kernel.org> wrote:
>
> On Tue, May 12, 2026 at 5:18 PM Albert Esteve <aesteve@redhat.com> wrote:
> >
> > On embedded platforms a central process often allocates dma-buf
> > memory on behalf of client applications. Without a way to
> > attribute the charge to the requesting client's cgroup, the
> > cost lands on the allocator, making per-cgroup memory limits
> > ineffective for the actual consumers.
> >
> > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > the mem_accounting module parameter enabled, the buffer is charged
> > to the allocator's own cgroup.
> >
> > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > all accounting through a single MEMCG_DMABUF path.
> >
> [...]
>
> > -               if (mem_accounting)
> > -                       flags |= __GFP_ACCOUNT;
>
> Hi Albert,
>
> would it be better to move this and its description to patch 1? It
> looks like patch 1 already introduces the double accounting changes,
> and patch 2 is mainly just supporting remote charging.

Hi Barry,

Thanks for looking into this series! Yes, in my head I was trying to
keep patch 1, which was taken from a previous, different series, and
then diverge from it starting with patch 2. This would clarify the
difference between the two. But I can see it just added some confusion
(for example, patch 1 charges on dma_buf_export() and then it is moved
to dma_heap_buffer_alloc() in patch 2). I will reorganize it better
for the next version, including your suggestion.

>
> Also, mem_accounting is only used by system_heap.c; has this patchset
> also eliminated its need?

No, mem_accounting is still handled in this patch for the general case
where no `charge_pid_fd` is used. See dma_heap_buffer_alloc() code:

+       if (memcg)
+               css_get(&memcg->css);
+       else if (mem_accounting)
+               memcg = get_mem_cgroup_from_mm(current->mm);

>
> Thanks
> Barry
>


^ permalink raw reply

* Re: [Linaro-mm-sig] Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-18 12:06 UTC (permalink / raw)
  To: Christian König
  Cc: Barry Song, T.J. Mercier, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <cb84c2ee-9de1-4565-b2e0-60984721228f@amd.com>

On Mon, May 18, 2026 at 9:34 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/16/26 11:19, Barry Song wrote:
> > On Thu, May 14, 2026 at 12:35 AM T.J. Mercier <tjmercier@google.com> wrote:
> > [...]
> >>>> I have a question about this part. Albert I guess you are interested
> >>>> only in accounting dmabuf-heap allocations, or do you expect to add
> >>>> __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other
> >>>> non-dmabuf-heap exporters?
> >>>
> >>> We're scoping this to dma-buf heaps for now. CMA heaps and the dmem
> >>> controller are on the radar for follow-up/parallel work (there will be
> >>> dragons and will surely need discussion). For DRM and V4L2 the
> >>> long-term intent is migration to heaps, which would make direct
> >>> accounting on those paths unnecessary.
> >>
> >> Ah I see. GEM buffers exported to dmabufs are what I had in mind. I
> >> guess this would only leave the odd non-DRM driver with the need to
> >> add their own accounting calls, which I don't expect would be a big
> >> problem.
> >>
> >
> > sounds like we still have a long way to go to correctly account for
> > various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in
> > dma_buf_export(), so I guess it covers all dma-buf types except
> > dma_heap, but the problem is that it has no remote charging support at
> > all?
>
> No, just the other way around
>
> DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.
>
> dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.
>
> >>> udmabufs are already
> >>> memcg-charged, so adding a separate MEMCG_DMABUF would double count.
> >>> Are there any other exporters you had in mind that would benefit from
> >>> this approach?
>
> Well apart from DMA-buf memfd_create() is one of the things which as broken our neck in the past a couple of times.
>
> But thinking more about it what if instead of making this DMA-buf heaps specific what if we have a general cgroups function which allows to change accounting of a buffer referenced by a file descriptor to a different process?
>
> That would cover not only the DMA-buf heaps use case, but also all other DMA-buf with dmem and whatever we come up in the future as well.

I removed a draft adding an ioctl for charge transfer from the series
before sending because I wanted to focus on the charge_pid_fd approach
and keep things simple, deferring the recharge path to a follow-up
depending on feedback.

The main difference between my removed draft and what you're
describing, iiuc, is scope and layer: my draft was an explicit ioctl
on the dma-buf fd that the consumer calls to claim the charge (see
below), while you seem to be suggesting a more general kernel-internal
function that could work across buffer types and cgroup controllers,
so not necessarily userspace-initiated? A kernel-internal function
will need a way to identify the target process, which sounds similar
to the binder-backed approach from TJ [1]. For everything else, the
receiver still needs to declare itself, which the ioctl accomplishes.

```
# When an app imports a daemon-allocated buffer, it can transfer the
charge to itself:
int buf_fd = receive_dmabuf_from_daemon();
ioctl(buf_fd, DMA_BUF_IOCTL_XFER_CHARGE); /* charge now attributed to
apps's cgroup */
```

[1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.com/

>
> The only drawback I can see is that DMA-buf heap allocations would be temporarily accounted to the memory allocation daemon, but I don't think that this would be a problem.

The main reasons we moved away from TJ's transfer-based approach
toward `charge_pid_fd` are: avoid the transient charge window on the
daemon's cgroup; and to decouple from Binder, allowing any allocator
to use it.

Technically, both approaches could coexist, though. Of the three
scenarios TJ described:
- Scenario 2 is directly addressed by charge_pid_fd approach without
any transient charge on the daemon at the cost of one extra field in
the heap ioctl uAPI struct.
- Scenario 3 can be handled by the charge transfer function without
changes to SurfaceFlinger. The app or dequeueBuffer claims the charge
for itself or the app, respectively (depending on whether we include a
pid_fd field in the transfer ioctl). It also covers non-heap
exporters. The con in both variants is the transient charge window on
the daemon.

Both approaches shift the responsibility for correct charging
attribution to userspace: first, 'charge_pid_fd` on the allocator's
side, and the transfer charge on the consumer's side.

Deciding on one, the other or both depends on how much we value
avoiding transient attribution, and how much we need a non-heap
generic solution. With the XFER_CHARGE we can cover both. Thus, the
`charge_pid_fd` approach in this RFC can be seen as a
performance/strictness optimisation, eliminating transient charges to
the daemon at the cost of a permanent uAPI addition to the heap ioctl
struct, but not strictly required for correctness. On the other hand,
if we agree on the end goal of migrating other exporters to use
dma-buf heaps, and scenario 3 is addressed by adding the app's pid_fd
to SurfaceFlinger, then `charge_pid_fd` alone is a coherent/sufficient
approach despite the uAPI change.

>
> Regards,
> Christian.
>
> >
> > Thanks
> > Barry
>


^ permalink raw reply

* Re: [PATCH v2 05/17] tracing: Add __print_untrusted_str()
From: Mickaël Salaün @ 2026-05-18 10:26 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
  Cc: Christian Brauner, Günther Noack, Jann Horn, Jeff Xu,
	Justin Suess, Kees Cook, Mathieu Desnoyers, Matthieu Buffet,
	Mikhail Ivanov, Tingmao Wang, kernel-team, linux-fsdevel,
	linux-security-module, linux-trace-kernel, Andrii Nakryiko
In-Reply-To: <20260406143717.1815792-6-mic@digikod.net>

Steve, Masami, Mathieu, are you ok with this new helper?

On Mon, Apr 06, 2026 at 04:37:03PM +0200, Mickaël Salaün wrote:
> Landlock tracepoints expose filesystem paths and process names
> that may contain spaces, equal signs, or other characters that
> break ftrace field parsing.
> 
> Add a new __print_untrusted_str() helper to safely print strings after
> escaping all special characters, including common separators (space,
> equal sign), quotes, and backslashes.  This transforms a string from an
> untrusted source (e.g. user space) to make it:
> - safe to parse,
> - easy to read (for simple strings),
> - easy to get back the original.
> 
> Cc: Günther Noack <gnoack@google.com>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tingmao Wang <m@maowtm.org>
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> ---
> 
> Changes since v1:
> https://lore.kernel.org/r/20250523165741.693976-4-mic@digikod.net
> - Remove WARN_ON() (pointed out by Steven Rostedt).
> ---
>  include/linux/trace_events.h               |  2 ++
>  include/trace/stages/stage3_trace_output.h |  4 +++
>  include/trace/stages/stage7_class_define.h |  1 +
>  kernel/trace/trace_output.c                | 41 ++++++++++++++++++++++
>  4 files changed, 48 insertions(+)
> 
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 37eb2f0f3dd8..7f4325d327ee 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -57,6 +57,8 @@ trace_print_hex_dump_seq(struct trace_seq *p, const char *prefix_str,
>  			 int prefix_type, int rowsize, int groupsize,
>  			 const void *buf, size_t len, bool ascii);
>  
> +const char *trace_print_untrusted_str_seq(struct trace_seq *s, const char *str);
> +
>  int trace_raw_output_prep(struct trace_iterator *iter,
>  			  struct trace_event *event);
>  extern __printf(2, 3)
> diff --git a/include/trace/stages/stage3_trace_output.h b/include/trace/stages/stage3_trace_output.h
> index fce85ea2df1c..62e98babb969 100644
> --- a/include/trace/stages/stage3_trace_output.h
> +++ b/include/trace/stages/stage3_trace_output.h
> @@ -133,6 +133,10 @@
>  	trace_print_hex_dump_seq(p, prefix_str, prefix_type,		\
>  				 rowsize, groupsize, buf, len, ascii)
>  
> +#undef __print_untrusted_str
> +#define __print_untrusted_str(str)							\
> +		trace_print_untrusted_str_seq(p, __get_str(str))
> +
>  #undef __print_ns_to_secs
>  #define __print_ns_to_secs(value)			\
>  	({						\
> diff --git a/include/trace/stages/stage7_class_define.h b/include/trace/stages/stage7_class_define.h
> index fcd564a590f4..1164aacd550f 100644
> --- a/include/trace/stages/stage7_class_define.h
> +++ b/include/trace/stages/stage7_class_define.h
> @@ -24,6 +24,7 @@
>  #undef __print_array
>  #undef __print_dynamic_array
>  #undef __print_hex_dump
> +#undef __print_untrusted_str
>  #undef __get_buf
>  
>  /*
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index 1996d7aba038..9d14c7cc654d 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -16,6 +16,7 @@
>  #include <linux/btf.h>
>  #include <linux/bpf.h>
>  #include <linux/hashtable.h>
> +#include <linux/string_helpers.h>
>  
>  #include "trace_output.h"
>  #include "trace_btf.h"
> @@ -321,6 +322,46 @@ trace_print_hex_dump_seq(struct trace_seq *p, const char *prefix_str,
>  }
>  EXPORT_SYMBOL(trace_print_hex_dump_seq);
>  
> +/**
> + * trace_print_untrusted_str_seq - print a string after escaping characters
> + * @s: trace seq struct to write to
> + * @src: The string to print
> + *
> + * Prints a string to a trace seq after escaping all special characters,
> + * including common separators (space, equal sign), quotes, and backslashes.
> + * This transforms a string from an untrusted source (e.g. user space) to make
> + * it:
> + * - safe to parse,
> + * - easy to read (for simple strings),
> + * - easy to get back the original.
> + */
> +const char *trace_print_untrusted_str_seq(struct trace_seq *s,
> +					   const char *src)
> +{
> +	int escaped_size;
> +	char *buf;
> +	size_t buf_size = seq_buf_get_buf(&s->seq, &buf);
> +	const char *ret = trace_seq_buffer_ptr(s);
> +
> +	/* Buffer exhaustion is normal when the trace buffer is full. */
> +	if (!src || buf_size == 0)
> +		return NULL;
> +
> +	escaped_size = string_escape_mem(src, strlen(src), buf, buf_size,
> +		ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_NAP | ESCAPE_APPEND |
> +		ESCAPE_OCTAL, " ='\"\\");
> +	if (unlikely(escaped_size >= buf_size)) {
> +		/* We need some room for the final '\0'. */
> +		seq_buf_set_overflow(&s->seq);
> +		s->full = 1;
> +		return NULL;
> +	}
> +	seq_buf_commit(&s->seq, escaped_size);
> +	trace_seq_putc(s, 0);
> +	return ret;
> +}
> +EXPORT_SYMBOL(trace_print_untrusted_str_seq);
> +
>  int trace_raw_output_prep(struct trace_iterator *iter,
>  			  struct trace_event *trace_event)
>  {
> -- 
> 2.53.0
> 
> 

^ permalink raw reply

* Re: [linus:master] [selftests]  465b05bae5: kernel-selftests.landlock.audit_test.audit.tsync_override_log_subdomains_off.fail
From: Thomas Weißschuh @ 2026-05-18 10:01 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Günther Noack, kernel test robot, linux-security-module,
	oe-lkp, lkp, linux-kernel, Shuah Khan, Kees Cook, linux-kselftest
In-Reply-To: <20260518.ohn9DahGhui6@digikod.net>

On Mon, May 18, 2026 at 11:30:42AM +0200, Mickaël Salaün wrote:
> On Mon, May 18, 2026 at 10:48:27AM +0200, Thomas Weißschuh wrote:
> > On Wed, May 13, 2026 at 12:52:35PM +0200, Mickaël Salaün wrote:
> > (...)
> > 
> > > > > config: x86_64-rhel-9.4-kselftests
> > > > > compiler: gcc-14
> > > > > test machine: 16 threads Intel(R) Core(TM) i7-13620H (Raptor Lake) with 32G memory
> > > > > 
> > > > > (please refer to attached dmesg/kmsg for entire log/backtrace)
> > > > > 
> > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > > > the same patch/commit), kindly add following tags
> > > > > | Reported-by: kernel test robot <oliver.sang@intel.com>
> > > > > | Closes: https://lore.kernel.org/oe-lkp/202605111649.a8b30a62-lkp@intel.com
> > > > 
> > > > I was unable to run the landlock selftests myself, on my machines they are
> > > > failing at runtime with all kinds of colorful errors. Are the requirements
> > > > explained somewhere?
> > > 
> > > I'm curious about the errors you get.  They are standard kselftests that
> > > should work following this workflow:
> > > 
> > >   make TARGETS=landlock O=build kselftest-gen_tar
> > > 
> > > and then running ./build/kselftests/kselftest_install/run_kselftest.sh
> > > as root in a VM.  The required kernel configuration is listed in
> > > tools/testing/selftests/landlock/config
> > 
> > So there are two root issues I ran into:
> > 
> > 1) The tests can not be executed from virtiofs (as set up by virtme-ng):
> 
> Most filesystem tests initially set up tmpfs and then use it.
> 
> I'm using virtme-ng too, see the
> https://github.com/landlock-lsm/landlock-test-tools
> 
>  ARCH=x86_64 .../check-linux.sh build_light kselftest
> 
> > 
> >  #  RUN           audit.layers ...
> > # audit_test.c:52:layers:Expected 0 (0) <= self->audit_fd (-13)
> > # audit_test.c:61:layers:Failed to initialize audit: Permission denied
> > # layers: Test failed
> > #          FAIL  audit.layers
> > not ok 1 audit.layers
> > 
> > (The same for all other testcases)
> 
> It looks like the tests are not run with enough privileges.  Do you run
> them as root?  Does the kernel has the required config set?

Yes. The exact same setup works when executed from a tmpfs.

> > 2) $PWD needs to be the test binary directory for "./wait-pipe-sandbox" to work.
> 
> Yes.  run_kselftest.sh should handle that.

Fair enough. The selftests I used so far worked just fine when executed directly.
Maybe only I am using them this way. Some better diagnostics would have saved
me some time. Consider it a suggestion.

> > > To make it easier, we wrote a wrapper to test everything with UML:
> > > https://github.com/landlock-lsm/landlock-test-tools (see check-linux.sh)
> > > 
> > > > 
> > > > > # #  RUN           audit.tsync_override_log_subdomains_off ...
> > > > > # # audit_test.c:591:tsync_override_log_subdomains_off:Expected 0 (0) == matches_log_signal(_metadata, self->audit_fd, child_data.parent_pid, NULL) (-11)
> > > > 
> > > > This error number means "EAGAIN 11 Resource temporarily unavailable",
> > > > so it could be a temporary error.
> > > 
> > > Yes, the test is flaky under pressure.
> > > 
> > > > 
> > > > Can you reproduce this issue? Is it really dependent on my patch as
> > > > blamed above? If so, does the selftest rely on the previous, incorrect order?
> > > 
> > > I don't think it directly depends on your patch but it might be a side
> > > effect.  Anyway, I've been working on fixing this kind of issue and just
> > > sent a fix:
> > > https://lore.kernel.org/r/20260513105112.140137-2-mic@digikod.net
> > 
> > Thanks, unfortunately I can't validate that it will fix the issue at hand.
> 
> I pushed it to -next, we'll see but I'm pretty sure this is the issue.

Nice. I justed wanted to make clear that I won't be able to provide a Tested-by.


Thanks again,
Thomas

^ permalink raw reply

* Re: [linus:master] [selftests]  465b05bae5: kernel-selftests.landlock.audit_test.audit.tsync_override_log_subdomains_off.fail
From: Mickaël Salaün @ 2026-05-18  9:30 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Günther Noack, kernel test robot, linux-security-module,
	oe-lkp, lkp, linux-kernel, Shuah Khan, Kees Cook, linux-kselftest
In-Reply-To: <20260518100602-5b161e99-83fa-4170-bb7b-1642df6b5a3d@linutronix.de>

On Mon, May 18, 2026 at 10:48:27AM +0200, Thomas Weißschuh wrote:
> On Wed, May 13, 2026 at 12:52:35PM +0200, Mickaël Salaün wrote:
> (...)
> 
> > > > config: x86_64-rhel-9.4-kselftests
> > > > compiler: gcc-14
> > > > test machine: 16 threads Intel(R) Core(TM) i7-13620H (Raptor Lake) with 32G memory
> > > > 
> > > > (please refer to attached dmesg/kmsg for entire log/backtrace)
> > > > 
> > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > > the same patch/commit), kindly add following tags
> > > > | Reported-by: kernel test robot <oliver.sang@intel.com>
> > > > | Closes: https://lore.kernel.org/oe-lkp/202605111649.a8b30a62-lkp@intel.com
> > > 
> > > I was unable to run the landlock selftests myself, on my machines they are
> > > failing at runtime with all kinds of colorful errors. Are the requirements
> > > explained somewhere?
> > 
> > I'm curious about the errors you get.  They are standard kselftests that
> > should work following this workflow:
> > 
> >   make TARGETS=landlock O=build kselftest-gen_tar
> > 
> > and then running ./build/kselftests/kselftest_install/run_kselftest.sh
> > as root in a VM.  The required kernel configuration is listed in
> > tools/testing/selftests/landlock/config
> 
> So there are two root issues I ran into:
> 
> 1) The tests can not be executed from virtiofs (as set up by virtme-ng):

Most filesystem tests initially set up tmpfs and then use it.

I'm using virtme-ng too, see the
https://github.com/landlock-lsm/landlock-test-tools

 ARCH=x86_64 .../check-linux.sh build_light kselftest

> 
>  #  RUN           audit.layers ...
> # audit_test.c:52:layers:Expected 0 (0) <= self->audit_fd (-13)
> # audit_test.c:61:layers:Failed to initialize audit: Permission denied
> # layers: Test failed
> #          FAIL  audit.layers
> not ok 1 audit.layers
> 
> (The same for all other testcases)

It looks like the tests are not run with enough privileges.  Do you run
them as root?  Does the kernel has the required config set?

> 
> 2) $PWD needs to be the test binary directory for "./wait-pipe-sandbox" to work.

Yes.  run_kselftest.sh should handle that.

> 
> > To make it easier, we wrote a wrapper to test everything with UML:
> > https://github.com/landlock-lsm/landlock-test-tools (see check-linux.sh)
> > 
> > > 
> > > > # #  RUN           audit.tsync_override_log_subdomains_off ...
> > > > # # audit_test.c:591:tsync_override_log_subdomains_off:Expected 0 (0) == matches_log_signal(_metadata, self->audit_fd, child_data.parent_pid, NULL) (-11)
> > > 
> > > This error number means "EAGAIN 11 Resource temporarily unavailable",
> > > so it could be a temporary error.
> > 
> > Yes, the test is flaky under pressure.
> > 
> > > 
> > > Can you reproduce this issue? Is it really dependent on my patch as
> > > blamed above? If so, does the selftest rely on the previous, incorrect order?
> > 
> > I don't think it directly depends on your patch but it might be a side
> > effect.  Anyway, I've been working on fixing this kind of issue and just
> > sent a fix:
> > https://lore.kernel.org/r/20260513105112.140137-2-mic@digikod.net
> 
> Thanks, unfortunately I can't validate that it will fix the issue at hand.

I pushed it to -next, we'll see but I'm pretty sure this is the issue.

^ permalink raw reply

* Re: [linus:master] [selftests]  465b05bae5: kernel-selftests.landlock.audit_test.audit.tsync_override_log_subdomains_off.fail
From: Thomas Weißschuh @ 2026-05-18  8:48 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Günther Noack, kernel test robot, linux-security-module,
	oe-lkp, lkp, linux-kernel, Shuah Khan, Kees Cook, linux-kselftest
In-Reply-To: <20260513.eeboh9zooQuu@digikod.net>

On Wed, May 13, 2026 at 12:52:35PM +0200, Mickaël Salaün wrote:
(...)

> > > config: x86_64-rhel-9.4-kselftests
> > > compiler: gcc-14
> > > test machine: 16 threads Intel(R) Core(TM) i7-13620H (Raptor Lake) with 32G memory
> > > 
> > > (please refer to attached dmesg/kmsg for entire log/backtrace)
> > > 
> > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > the same patch/commit), kindly add following tags
> > > | Reported-by: kernel test robot <oliver.sang@intel.com>
> > > | Closes: https://lore.kernel.org/oe-lkp/202605111649.a8b30a62-lkp@intel.com
> > 
> > I was unable to run the landlock selftests myself, on my machines they are
> > failing at runtime with all kinds of colorful errors. Are the requirements
> > explained somewhere?
> 
> I'm curious about the errors you get.  They are standard kselftests that
> should work following this workflow:
> 
>   make TARGETS=landlock O=build kselftest-gen_tar
> 
> and then running ./build/kselftests/kselftest_install/run_kselftest.sh
> as root in a VM.  The required kernel configuration is listed in
> tools/testing/selftests/landlock/config

So there are two root issues I ran into:

1) The tests can not be executed from virtiofs (as set up by virtme-ng):

 #  RUN           audit.layers ...
# audit_test.c:52:layers:Expected 0 (0) <= self->audit_fd (-13)
# audit_test.c:61:layers:Failed to initialize audit: Permission denied
# layers: Test failed
#          FAIL  audit.layers
not ok 1 audit.layers

(The same for all other testcases)

2) $PWD needs to be the test binary directory for "./wait-pipe-sandbox" to work.

> To make it easier, we wrote a wrapper to test everything with UML:
> https://github.com/landlock-lsm/landlock-test-tools (see check-linux.sh)
> 
> > 
> > > # #  RUN           audit.tsync_override_log_subdomains_off ...
> > > # # audit_test.c:591:tsync_override_log_subdomains_off:Expected 0 (0) == matches_log_signal(_metadata, self->audit_fd, child_data.parent_pid, NULL) (-11)
> > 
> > This error number means "EAGAIN 11 Resource temporarily unavailable",
> > so it could be a temporary error.
> 
> Yes, the test is flaky under pressure.
> 
> > 
> > Can you reproduce this issue? Is it really dependent on my patch as
> > blamed above? If so, does the selftest rely on the previous, incorrect order?
> 
> I don't think it directly depends on your patch but it might be a side
> effect.  Anyway, I've been working on fixing this kind of issue and just
> sent a fix:
> https://lore.kernel.org/r/20260513105112.140137-2-mic@digikod.net

Thanks, unfortunately I can't validate that it will fix the issue at hand.


Thomas

^ permalink raw reply

* Re: [PATCH v2 02/16] security/Kconfig.hardening: Remove tautological condition from CC_HAS_ZERO_CALL_USED_REGS
From: Arnd Bergmann @ 2026-05-18  7:48 UTC (permalink / raw)
  To: Nathan Chancellor, Nicolas Schier, Bill Wendling, Justin Stitt,
	Nick Desaulniers
  Cc: linux-kernel, llvm, linux-kbuild, Kees Cook, Gustavo A. R. Silva,
	linux-hardening, linux-security-module
In-Reply-To: <20260517-bump-minimum-supported-llvm-version-to-17-v2-2-b3b8cda46bdd@kernel.org>

On Mon, May 18, 2026, at 01:05, Nathan Chancellor wrote:
> Now that the minimum supported version of LLVM for building the kernel
> has been raised to 17.0.1, the '!Clang || Clang > 15.0.6' dependency for
> CONFIG_CC_HAS_ZERO_CALL_USED_REGS is always true, so it can be removed.
>
> Reviewed-by: Nicolas Schier <nsc@kernel.org>
> Signed-off-by: Nathan Chancellor <nathan@kernel.org>

Acked-by: Arnd Bergmann <arnd@arndb.de>

>  config CC_HAS_ZERO_CALL_USED_REGS
>  	def_bool $(cc-option,-fzero-call-used-regs=used-gpr)
> -	# https://github.com/ClangBuiltLinux/linux/issues/1766
> -	# https://github.com/llvm/llvm-project/issues/59242
> -	depends on !CC_IS_CLANG || CLANG_VERSION > 150006
> 

Maybe add a comment to mention that this now requires gcc-11,
that way we have it easier to remove the check when that becomes
the minimum version.

       Arnd

^ permalink raw reply

* Re: [Linaro-mm-sig] Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Christian König @ 2026-05-18  7:34 UTC (permalink / raw)
  To: Barry Song, T.J. Mercier
  Cc: Albert Esteve, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Sumit Semwal, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, Christian Brauner,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
	linux-media, dri-, linaro-mm-sig, linux-mm, linux-security-module,
	selinux, linux-kselftest, mripard, echanude
In-Reply-To: <CAGsJ_4zyecY6E-=Tm4_couT7uoM9LMcFdTMUPkZAjj4zUKE-dQ@mail.gmail.com>

On 5/16/26 11:19, Barry Song wrote:
> On Thu, May 14, 2026 at 12:35 AM T.J. Mercier <tjmercier@google.com> wrote:
> [...]
>>>> I have a question about this part. Albert I guess you are interested
>>>> only in accounting dmabuf-heap allocations, or do you expect to add
>>>> __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other
>>>> non-dmabuf-heap exporters?
>>>
>>> We're scoping this to dma-buf heaps for now. CMA heaps and the dmem
>>> controller are on the radar for follow-up/parallel work (there will be
>>> dragons and will surely need discussion). For DRM and V4L2 the
>>> long-term intent is migration to heaps, which would make direct
>>> accounting on those paths unnecessary.
>>
>> Ah I see. GEM buffers exported to dmabufs are what I had in mind. I
>> guess this would only leave the odd non-DRM driver with the need to
>> add their own accounting calls, which I don't expect would be a big
>> problem.
>>
> 
> sounds like we still have a long way to go to correctly account for
> various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in
> dma_buf_export(), so I guess it covers all dma-buf types except
> dma_heap, but the problem is that it has no remote charging support at
> all?

No, just the other way around

DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.

dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.

>>> udmabufs are already
>>> memcg-charged, so adding a separate MEMCG_DMABUF would double count.
>>> Are there any other exporters you had in mind that would benefit from
>>> this approach?

Well apart from DMA-buf memfd_create() is one of the things which as broken our neck in the past a couple of times.

But thinking more about it what if instead of making this DMA-buf heaps specific what if we have a general cgroups function which allows to change accounting of a buffer referenced by a file descriptor to a different process?

That would cover not only the DMA-buf heaps use case, but also all other DMA-buf with dmem and whatever we come up in the future as well.

The only drawback I can see is that DMA-buf heap allocations would be temporarily accounted to the memory allocation daemon, but I don't think that this would be a problem.

Regards,
Christian.

> 
> Thanks
> Barry


^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Christian König @ 2026-05-18  7:19 UTC (permalink / raw)
  To: T.J. Mercier, Christian Brauner
  Cc: Albert Esteve, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Sumit Semwal, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, Paul Moore,
	James Morris, Serge E. Hallyn, Stephen Smalley, Ondrej Mosnacek,
	Shuah Khan, cgroups, linux-doc, linux-kernel, linux-media,
	dri-devel, linaro-mm-sig, linux-mm, linux-security-module,
	selinux, linux-kselftest, mripard, echanude
In-Reply-To: <CABdmKX0d6Zsg+_TxXjB80UZR23ZvXzxYoWzORgwmx=ZiuE+Nzw@mail.gmail.com>

On 5/15/26 19:06, T.J. Mercier wrote:
> On Fri, May 15, 2026 at 6:53 AM Christian Brauner <brauner@kernel.org> wrote:
>>
>> On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
>>> On embedded platforms a central process often allocates dma-buf
>>> memory on behalf of client applications. Without a way to
>>> attribute the charge to the requesting client's cgroup, the
>>> cost lands on the allocator, making per-cgroup memory limits
>>> ineffective for the actual consumers.
>>>
>>> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
>>
>> Please be aware that pidfds come in two flavors:
>>
>> thread-group pidfds and thread-specific pidfds. Make sure that your API
>> doesn't implicitly depend on this distinction not existing.
> 
> Hi Christian,
> 
> Memcg is not a controller that supports "thread mode" so all threads
> in a group should belong to the same memcg.

BTW: Exactly that is the requirement automotive has with their native context use case.

The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.

At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.

Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.

Regards,
Christian.

> 
> Checking the flags from pidfd_get_pid would be the best way for an
> explicit check of the pidfd type?
> 
>>> a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
>>> memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
>>> inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
>>> the mem_accounting module parameter enabled, the buffer is charged
>>> to the allocator's own cgroup.
>>>
>>> Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
>>> system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
>>> page allocations. Keeping __GFP_ACCOUNT would charge the same pages
>>> twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
>>> all accounting through a single MEMCG_DMABUF path.
>>>
>>> Usage examples:
>>>
>>>   1. Central allocator charging to a client at allocation time.
>>>      The allocator knows the client's PID (e.g., from binder's
>>>      sender_pid) and uses pidfd to attribute the charge:
>>>
>>>        pid_t client_pid = txn->sender_pid;
>>>        int pidfd = pidfd_open(client_pid, 0);
>>>
>>>        struct dma_heap_allocation_data alloc = {
>>>            .len             = buffer_size,
>>>            .fd_flags        = O_RDWR | O_CLOEXEC,
>>>            .charge_pid_fd   = pidfd,
>>>        };
>>>        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
>>>        close(pidfd);
>>>        /* alloc.fd is now charged to client's cgroup */
>>>
>>>   2. Default allocation (no pidfd, mem_accounting=1).
>>>      When charge_pid_fd is not set and the mem_accounting module
>>>      parameter is enabled, the buffer is charged to the allocator's
>>>      own cgroup:
>>>
>>>        struct dma_heap_allocation_data alloc = {
>>>            .len      = buffer_size,
>>>            .fd_flags = O_RDWR | O_CLOEXEC,
>>>        };
>>>        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
>>>        /* charged to current process's cgroup */
>>>
>>> Current limitations:
>>>
>>>  - Single-owner model: a dma-buf carries one memcg charge regardless of
>>>    how many processes share it. Means only the first owner (and exporter)
>>>    of the shared buffer bears the charge.
>>>  - Only memcg accounting supported. While this makes sense for system
>>>    heap buffers, other heaps (e.g., CMA heaps) will require selectively
>>>    charging also for the dmem controller.
>>>
>>> Signed-off-by: Albert Esteve <aesteve@redhat.com>
>>> ---
>>>  Documentation/admin-guide/cgroup-v2.rst |  5 ++--
>>>  drivers/dma-buf/dma-buf.c               | 16 ++++---------
>>>  drivers/dma-buf/dma-heap.c              | 42 ++++++++++++++++++++++++++++++---
>>>  drivers/dma-buf/heaps/system_heap.c     |  2 --
>>>  include/uapi/linux/dma-heap.h           |  6 +++++
>>>  5 files changed, 53 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>>> index 8bdbc2e866430..824d269531eb1 100644
>>> --- a/Documentation/admin-guide/cgroup-v2.rst
>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>>> @@ -1636,8 +1636,9 @@ The following nested keys are defined.
>>>               structures.
>>>
>>>         dmabuf (npn)
>>> -             Amount of memory used for exported DMA buffers allocated by the cgroup.
>>> -             Stays with the allocating cgroup regardless of how the buffer is shared.
>>> +             Amount of memory used for exported DMA buffers allocated by or on
>>> +             behalf of the cgroup. Stays with the allocating cgroup regardless
>>> +             of how the buffer is shared.
>>>
>>>         workingset_refault_anon
>>>               Number of refaults of previously evicted anonymous pages.
>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>> index ce02377f48908..23fb758b78297 100644
>>> --- a/drivers/dma-buf/dma-buf.c
>>> +++ b/drivers/dma-buf/dma-buf.c
>>> @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
>>>        */
>>>       BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
>>>
>>> -     mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
>>> -     mem_cgroup_put(dmabuf->memcg);
>>> +     if (dmabuf->memcg) {
>>> +             mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
>>> +                                       PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
>>> +             mem_cgroup_put(dmabuf->memcg);
>>> +     }
>>>
>>>       dmabuf->ops->release(dmabuf);
>>>
>>> @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
>>>               dmabuf->resv = resv;
>>>       }
>>>
>>> -     dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
>>> -     if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
>>> -                                   GFP_KERNEL)) {
>>> -             ret = -ENOMEM;
>>> -             goto err_memcg;
>>> -     }
>>> -
>>>       file->private_data = dmabuf;
>>>       file->f_path.dentry->d_fsdata = dmabuf;
>>>       dmabuf->file = file;
>>> @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
>>>
>>>       return dmabuf;
>>>
>>> -err_memcg:
>>> -     mem_cgroup_put(dmabuf->memcg);
>>>  err_file:
>>>       fput(file);
>>>  err_module:
>>> diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
>>> index ac5f8685a6494..ff6e259afcdc0 100644
>>> --- a/drivers/dma-buf/dma-heap.c
>>> +++ b/drivers/dma-buf/dma-heap.c
>>> @@ -7,13 +7,17 @@
>>>   */
>>>
>>>  #include <linux/cdev.h>
>>> +#include <linux/cgroup.h>
>>>  #include <linux/device.h>
>>>  #include <linux/dma-buf.h>
>>>  #include <linux/dma-heap.h>
>>> +#include <linux/memcontrol.h>
>>> +#include <linux/sched/mm.h>
>>>  #include <linux/err.h>
>>>  #include <linux/export.h>
>>>  #include <linux/list.h>
>>>  #include <linux/nospec.h>
>>> +#include <linux/pidfd.h>
>>>  #include <linux/syscalls.h>
>>>  #include <linux/uaccess.h>
>>>  #include <linux/xarray.h>
>>> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
>>>                "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
>>>
>>>  static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
>>> -                              u32 fd_flags,
>>> -                              u64 heap_flags)
>>> +                              u32 fd_flags, u64 heap_flags,
>>> +                              struct mem_cgroup *charge_to)
>>>  {
>>>       struct dma_buf *dmabuf;
>>> +     unsigned int nr_pages;
>>> +     struct mem_cgroup *memcg = charge_to;
>>>       int fd;
>>>
>>>       /*
>>> @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
>>>       if (IS_ERR(dmabuf))
>>>               return PTR_ERR(dmabuf);
>>>
>>> +     nr_pages = len / PAGE_SIZE;
>>> +
>>> +     if (memcg)
>>> +             css_get(&memcg->css);
>>> +     else if (mem_accounting)
>>> +             memcg = get_mem_cgroup_from_mm(current->mm);
>>> +
>>> +     if (memcg) {
>>> +             if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
>>> +                     mem_cgroup_put(memcg);
>>> +                     dma_buf_put(dmabuf);
>>> +                     return -ENOMEM;
>>> +             }
>>> +             dmabuf->memcg = memcg;
>>> +     }
>>> +
>>>       fd = dma_buf_fd(dmabuf, fd_flags);
>>>       if (fd < 0) {
>>>               dma_buf_put(dmabuf);
>>> @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
>>>  {
>>>       struct dma_heap_allocation_data *heap_allocation = data;
>>>       struct dma_heap *heap = file->private_data;
>>> +     struct mem_cgroup *memcg = NULL;
>>> +     struct task_struct *task;
>>> +     unsigned int pidfd_flags;
>>>       int fd;
>>>
>>>       if (heap_allocation->fd)
>>> @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
>>>       if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
>>>               return -EINVAL;
>>>
>>> +     if (heap_allocation->charge_pid_fd) {
>>> +             task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
>>
>> Will always get a thread-group leader pidfd and will fail if this is a
>> thread-specific pidfd. pidfd_open(1234, PIDFD_THREAD) can be used to
>> open a thread-specific pidfd.
>>
>>> +             if (IS_ERR(task))
>>> +                     return PTR_ERR(task);
>>> +
>>> +             memcg = get_mem_cgroup_from_mm(task->mm);
>>> +             put_task_struct(task);
>>> +     }
>>> +
>>>       fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
>>>                                  heap_allocation->fd_flags,
>>> -                                heap_allocation->heap_flags);
>>> +                                heap_allocation->heap_flags,
>>> +                                memcg);
>>> +     mem_cgroup_put(memcg);
>>>       if (fd < 0)
>>>               return fd;
>>>
>>> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
>>> index 03c2b87cb1112..95d7688167b93 100644
>>> --- a/drivers/dma-buf/heaps/system_heap.c
>>> +++ b/drivers/dma-buf/heaps/system_heap.c
>>> @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
>>>               if (max_order < orders[i])
>>>                       continue;
>>>               flags = order_flags[i];
>>> -             if (mem_accounting)
>>> -                     flags |= __GFP_ACCOUNT;
>>>               page = alloc_pages(flags, orders[i]);
>>>               if (!page)
>>>                       continue;
>>> diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
>>> index a4cf716a49fa6..e02b0f8cbc6a1 100644
>>> --- a/include/uapi/linux/dma-heap.h
>>> +++ b/include/uapi/linux/dma-heap.h
>>> @@ -29,6 +29,10 @@
>>>   *                   handle to the allocated dma-buf
>>>   * @fd_flags:                file descriptor flags used when allocating
>>>   * @heap_flags:              flags passed to heap
>>> + * @charge_pid_fd:   optional pidfd of the process whose cgroup should be
>>> + *                   charged for this allocation; 0 means charge the calling
>>> + *                   process's cgroup
>>> + * @__padding:               reserved, must be zero
>>>   *
>>>   * Provided by userspace as an argument to the ioctl
>>>   */
>>> @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
>>>       __u32 fd;
>>>       __u32 fd_flags;
>>>       __u64 heap_flags;
>>> +     __u32 charge_pid_fd;
>>> +     __u32 __padding;
>>>  };
>>>
>>>  #define DMA_HEAP_IOC_MAGIC           'H'
>>>
>>> --
>>> 2.53.0
>>>


^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Song Liu @ 2026-05-18  6:31 UTC (permalink / raw)
  To: Paul Moore
  Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAHC9VhTwDt2Bx8n0io9Qge_fUEnrHsxrFAQY+KaemKWqJqBQxw@mail.gmail.com>

On Thu, May 14, 2026 at 8:48 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Thu, May 7, 2026 at 3:05 AM Sasha Levin <sashal@kernel.org> wrote:
> >
> > When a (security) issue goes public, fleets stay exposed until a patched kernel
> > is built, distributed, and rebooted into.
> >
> > For many such issues the simplest mitigation is to stop calling the buggy
> > function. Killswitch provides that. An admin writes:
> >
> >     echo "engage af_alg_sendmsg -1" \
> >         > /sys/kernel/security/killswitch/control
> >
> > After this, af_alg_sendmsg() returns -EPERM on every call without
> > running its body. The mitigation takes effect immediately, and is dropped on
> > the next reboot.
> >
> > A lot of recent kernel issues sit in code paths most installs only have enabled
> > to support a relative minority of users: AF_ALG, ksmbd, nf_tables, vsock, ax25,
> > and friends.
> >
> > For most users, the cost of "this socket family stops working for the day" is
> > much smaller than the cost of running a known vulnerable kernel until the fix
> > land.
> >
> > Assisted-by: Claude:claude-opus-4-7
> > Signed-off-by: Sasha Levin <sashal@kernel.org>
> > ---
> >  Documentation/admin-guide/index.rst           |   1 +
> >  Documentation/admin-guide/killswitch.rst      | 159 ++++
> >  Documentation/admin-guide/tainted-kernels.rst |   8 +
> >  MAINTAINERS                                   |  11 +
> >  include/linux/killswitch.h                    |  19 +
> >  include/linux/panic.h                         |   3 +-
> >  init/Kconfig                                  |   2 +
> >  kernel/Kconfig.killswitch                     |  31 +
> >  kernel/Makefile                               |   1 +
> >  kernel/killswitch.c                           | 798 ++++++++++++++++++
> >  kernel/panic.c                                |   1 +
> >  lib/Kconfig.debug                             |  13 +
> >  lib/Makefile                                  |   1 +
> >  lib/test_killswitch.c                         |  85 ++
> >  tools/testing/selftests/Makefile              |   1 +
> >  tools/testing/selftests/killswitch/.gitignore |   1 +
> >  tools/testing/selftests/killswitch/Makefile   |   8 +
> >  .../selftests/killswitch/cve_31431_test.c     | 162 ++++
> >  .../selftests/killswitch/killswitch_test.sh   | 147 ++++
> >  19 files changed, 1451 insertions(+), 1 deletion(-)
> >  create mode 100644 Documentation/admin-guide/killswitch.rst
> >  create mode 100644 include/linux/killswitch.h
> >  create mode 100644 kernel/Kconfig.killswitch
> >  create mode 100644 kernel/killswitch.c
> >  create mode 100644 lib/test_killswitch.c
> >  create mode 100644 tools/testing/selftests/killswitch/.gitignore
> >  create mode 100644 tools/testing/selftests/killswitch/Makefile
> >  create mode 100644 tools/testing/selftests/killswitch/cve_31431_test.c
> >  create mode 100755 tools/testing/selftests/killswitch/killswitch_test.sh
>
> If we made Lockdown an LSM, we should probably also make killswitch an LSM.

I don't think killswitch can stack with other LSMs. In fact, killswitch
can be used to bypass other LSMs, for example:

echo engage security_file_open 0 > /sys/kernel/security/killswitch/control

will bypass all hooks on security_file_open.

Thanks,
Song

> For the LSM crowd who might be seeing this for the first time, the
> original thread can be found on lore via the link below:
> https://lore.kernel.org/all/20260507070547.2268452-1-sashal@kernel.org
>
> --
> paul-moore.com
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox