Linux Security Modules development

Linux Security Modules development
 help / color / mirror / Atom feed

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Barry Song @ 2026-05-18 22:43 UTC (permalink / raw)
  To: Albert Esteve
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CADSE00LjJcL8P5M-UPEpzZijU70uEmUirnin29N8YR5W5D-oFg@mail.gmail.com>

On Mon, May 18, 2026 at 8:16 PM Albert Esteve <aesteve@redhat.com> wrote:
>
> On Sat, May 16, 2026 at 9:37 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Tue, May 12, 2026 at 5:18 PM Albert Esteve <aesteve@redhat.com> wrote:
> > >
> > > On embedded platforms a central process often allocates dma-buf
> > > memory on behalf of client applications. Without a way to
> > > attribute the charge to the requesting client's cgroup, the
> > > cost lands on the allocator, making per-cgroup memory limits
> > > ineffective for the actual consumers.
> > >
> > > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > > the mem_accounting module parameter enabled, the buffer is charged
> > > to the allocator's own cgroup.
> > >
> > > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > > all accounting through a single MEMCG_DMABUF path.
> > >
> > [...]
> >
> > > -               if (mem_accounting)
> > > -                       flags |= __GFP_ACCOUNT;
> >
> > Hi Albert,
> >
> > would it be better to move this and its description to patch 1? It
> > looks like patch 1 already introduces the double accounting changes,
> > and patch 2 is mainly just supporting remote charging.
>
> Hi Barry,
>
> Thanks for looking into this series! Yes, in my head I was trying to
> keep patch 1, which was taken from a previous, different series, and
> then diverge from it starting with patch 2. This would clarify the
> difference between the two. But I can see it just added some confusion
> (for example, patch 1 charges on dma_buf_export() and then it is moved
> to dma_heap_buffer_alloc() in patch 2). I will reorganize it better
> for the next version, including your suggestion.

Yep, I understand the situation now. I also understand
that you were referring to T.J.'s patch, which caused
some back-and-forth confusion for readers when reading
patches 1 and 2.

>
> >
> > Also, mem_accounting is only used by system_heap.c; has this patchset
> > also eliminated its need?
>
> No, mem_accounting is still handled in this patch for the general case
> where no `charge_pid_fd` is used. See dma_heap_buffer_alloc() code:
>
> +       if (memcg)
> +               css_get(&memcg->css);
> +       else if (mem_accounting)
> +               memcg = get_mem_cgroup_from_mm(current->mm);

I see. What feels a bit odd to me is that mem_accounting
could either be dropped (with unconditional charging), or
it should cover both remote and local charge cases.

I don’t have a strong opinion here—it just feels a bit
strange, since its description is quite generic for memcg:

"Enable cgroup-based memory accounting for dma-buf heap
allocations (default=false)."

Best Regards
Barry

^ permalink raw reply

* Re: [Linaro-mm-sig] Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Barry Song @ 2026-05-18 23:00 UTC (permalink / raw)
  To: Christian König
  Cc: T.J. Mercier, Albert Esteve, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <cb84c2ee-9de1-4565-b2e0-60984721228f@amd.com>

On Mon, May 18, 2026 at 3:34 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/16/26 11:19, Barry Song wrote:
> > On Thu, May 14, 2026 at 12:35 AM T.J. Mercier <tjmercier@google.com> wrote:
> > [...]
> >>>> I have a question about this part. Albert I guess you are interested
> >>>> only in accounting dmabuf-heap allocations, or do you expect to add
> >>>> __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other
> >>>> non-dmabuf-heap exporters?
> >>>
> >>> We're scoping this to dma-buf heaps for now. CMA heaps and the dmem
> >>> controller are on the radar for follow-up/parallel work (there will be
> >>> dragons and will surely need discussion). For DRM and V4L2 the
> >>> long-term intent is migration to heaps, which would make direct
> >>> accounting on those paths unnecessary.
> >>
> >> Ah I see. GEM buffers exported to dmabufs are what I had in mind. I
> >> guess this would only leave the odd non-DRM driver with the need to
> >> add their own accounting calls, which I don't expect would be a big
> >> problem.
> >>
> >
> > sounds like we still have a long way to go to correctly account for
> > various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in
> > dma_buf_export(), so I guess it covers all dma-buf types except
> > dma_heap, but the problem is that it has no remote charging support at
> > all?
>
> No, just the other way around
>
> DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.
>
> dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.
>

Hi Christian,

Thanks very much for your explanation. So basically it seems that
dma_buf_export() is not the proper place to charge, since it may end up
mixing in non-system-memory accounting?

My question is also about the global view for both heap and non-heap cases.
After reading the discussion, I’ve tried to summarize it—please let me know
if my understanding is correct.

for dma_heap, we have the ioctl DMA_HEAP_IOCTL_ALLOC, where users can pass a
remote pidfd or similar information to indicate where the dma-buf should be
charged, as in Albert's patchset.

For non-dma_heap dma-bufs, we don’t have an obvious userspace entry point that
triggers the allocation. So we likely need other approaches. We could either
move more drivers over to dma-heap, or introduce something like
DMA_BUF_IOCTL_XFER_CHARGE, as you are discussing, to let userspace explicitly
declare a charge.

Best Regards
Barry

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Song Liu @ 2026-05-18 23:22 UTC (permalink / raw)
  To: Paul Moore
  Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAHC9VhS1DJNs9gDB6gD9WKhL08giSVajBskZ+=mY0AWRCAsw7Q@mail.gmail.com>

On Mon, May 18, 2026 at 2:29 PM Paul Moore <paul@paul-moore.com> wrote:
[...]
> In my opinion, making killswitch an LSM is more of a procedural item
> that deals with how we view a capability like killswitch.  I
> personally view killswitch as somewhat similar to Lockdown, which is
> why I made the suggestion.
>
> The use of kprobes, while an interesting idea, presents problems as
> allowing any kernel symbol to be killed introduces the potential for
> security regressions.  As a reminder, some LSMs, as well as other
> kernel subsystems, have mechanisms in place to restrict root and/or
> enforce one-way configuration locks; while many people equate "root"
> with full control, in many cases today that is not strictly correct.
>
> Yes, kprobes have been around for some time, this is not a new
> problem, but killswitch makes it far more convenient and accessible to
> do dangerous things with kprobes.  If killswitch makes it past the RFC
> stage without any significant changes to its kill mechanism, we may
> need to start considering more liberal usage of NOKPROBE_SYMBOL()
> which I think would be an unfortunate casualty.

I don't think we can use NOKPROBE_SYMBOL(). There are functions
that we don't want to killswitch, but still want to trace.

Thanks,
Song

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-18 23:39 UTC (permalink / raw)
  To: Christian König
  Cc: Albert Esteve, Christian Brauner, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <88efe10a-8b93-4a81-8279-4a5559d0f17c@amd.com>

On Mon, May 18, 2026 at 7:07 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/18/26 14:50, Albert Esteve wrote:
> > On Mon, May 18, 2026 at 9:20 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 5/15/26 19:06, T.J. Mercier wrote:
> >>> On Fri, May 15, 2026 at 6:53 AM Christian Brauner <brauner@kernel.org> wrote:
> >>>>
> >>>> On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
> >>>>> On embedded platforms a central process often allocates dma-buf
> >>>>> memory on behalf of client applications. Without a way to
> >>>>> attribute the charge to the requesting client's cgroup, the
> >>>>> cost lands on the allocator, making per-cgroup memory limits
> >>>>> ineffective for the actual consumers.
> >>>>>
> >>>>> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> >>>>
> >>>> Please be aware that pidfds come in two flavors:
> >>>>
> >>>> thread-group pidfds and thread-specific pidfds. Make sure that your API
> >>>> doesn't implicitly depend on this distinction not existing.
> >>>
> >>> Hi Christian,
> >>>
> >>> Memcg is not a controller that supports "thread mode" so all threads
> >>> in a group should belong to the same memcg.
> >>
> >> BTW: Exactly that is the requirement automotive has with their native context use case.
> >>
> >> The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
> >>
> >> At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
> >>
> >> Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
> >
> > Hi Christian,
> >
> > Thanks for sharing this atuomotive usecase. If I understand correctly,
> > the actual requirement is attributing dma-buf charges to the right
> > client, not putting each daemon thread in a different cgroup?
>
> Nope, exactly that's the difference.
>
> The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.
>
> Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.
>
> The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.
>
> The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.
>
> So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.
>
> > If so,
> > the `charge_pid_fd` approach achieves this directly by passing the
> > client's `pid_fd`, without needing to add per-thread cgroup
> > infrastructure.
>
> Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.
>
> Doing that automatically for CPU and I/O time would just be nice to have additionally.
>
> Regards,
> Christian.

Hopefully I'm following correctly here.... So you are duplicating the
GPU driver stack to achieve remote accounting on a per-thread basis?
Does this mean for GPU allocations you currently have some GFP_ACCOUNT
magic in your driver to attribute GPU memory to the correct remote
client? So this series would close the gap for dma-buf allocations,
but what about private GPU driver memory allocated on behalf of a
client?

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-18 23:39 UTC (permalink / raw)
  To: Barry Song
  Cc: Christian König, Albert Esteve, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CAGsJ_4y=Gsv=FSUjJ5+99Gg6ULUnv0LRexCGOGetzChR3YA44Q@mail.gmail.com>

On Mon, May 18, 2026 at 3:19 PM Barry Song <baohua@kernel.org> wrote:
>
> On Tue, May 19, 2026 at 5:17 AM T.J. Mercier <tjmercier@google.com> wrote:
> [...]
> > > > > Yeah I think this might work. I know of 3 cases, and it trivially
> > > > > solves the first two. The third requires some work on our end to
> > > > > extend our userspace interfaces to include the pidfd but it seems
> > > > > doable. I'm checking with our graphics folks.
> > > > >
> > > > > 1) Direct allocation from user (e.g. app -> allocation ioctl on
> > > > > /dev/dma_heap/foo)
> > > > > No changes required to userspace. mem_accounting=1 charges the app.
> > > > >
> > > > > 2) Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
> > > > > -> gralloc)
> > > > > gralloc has the caller's pid as described in the commit message. Open
> > > > > a pidfd and pass it in the dma_heap_allocation_data.
> > > > >
> > > > > 3) Double hop remote allocation (e.g. app -> dequeueBuffer ->
> > > > > SurfaceFlinger -> gralloc)
> > > > > In this case gralloc knows SurfaceFlinger's pid, but not the app's. So
> > > > > we need to add the app's pidfd to the SurfaceFlinger -> gralloc
> > > > > interface, or transfer the memcg charge from SurfaceFlinger to the app
> > > > > after the allocation.
> > > > > It'd be nice to avoid the charge transfer option entirely, but if we
> > > > > need it that doesn't seem so bad in this case because it's a bulk
> > > > > charge for the entire dmabuf rather than per-page. So the exporter
> > > > > doesn't need to get involved (we wouldn't need a new dma_buf_op) and
> > > > > we wouldn't have to worry about looping and locking for each page.
> > > > >
> > > >
> > > > Hi T.J.,
> > > >
> > > > Your description of the three different cases sounds very interesting.
> > > > It helps me understand how difficult it can be to correctly charge
> > > > dma-buf in the current user scenarios.
> > > >
> > > > I’m wondering where I can find Android userspace code that transfers
> > > > the PID of RPC callers. Do we have any existing sample code in Android
> > > > for this?
> > >
> > > Hi Barry,
> > >
> > > In Java android.os.Binder.getCallingPid() will provide it. Here
> >
> > ... let me try again
> >
> > Here are some examples from the framework code:
> >
> > https://cs.android.com/search?q=getCallingPid%20f:ActivityManager&sq=&ss=android%2Fplatform%2Fsuperproject
> >
> > In native code we have AIBinder_getCallingPid and
> > android::IPCThreadState::self()->getCallingPid() (or
> > android::hardware::IPCThreadState::self()->getCallingPid() for HIDL)
> >
> > https://cs.android.com/search?q=getCallingPid%20l:cpp%20-f:prebuilt&ss=android%2Fplatform%2Fsuperproject
>
> Thanks very much, T.J. That is very helpful. I guess
> that would require user space to understand the RPC
> procedure, including single-hop and two-hop cases, and
> make the corresponding changes.

Yes, this is solvable by having a policy in allocator services where
the caller is implicitly charged, while also supporting cases where
the RPC includes additional explicit information about who to charge.
This needs security checks to prevent arbitrary remote charges at both
the ioctl() level (selinux charge_to from patch 4), and at the RPC
level (not sure yet but maybe a private interface between system
components and gralloc), so that only privileged components can
initiate remote charges.

> You pointed out the SurfaceFlinger cases, which are
> two hops. It seems that AI models are also using
> dma_heap, at least from what I have observed on MTK
> and Qualcomm phones. Likely, we need to understand
> those RPC relationships in userspace and make the
> corresponding changes.
> I assume AI models are a single-hop case?

It's currently a mix because AI model loading is largely controlled by
vendor code right now. Some implementations use
AHardwareBuffer_allocate, but that comes with unnecessary RPC overhead
for the AI use case. So I think we should be trending towards direct
allocations from dma-buf heaps because model loading time is
important.

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-18 23:39 UTC (permalink / raw)
  To: Barry Song
  Cc: Albert Esteve, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Sumit Semwal, Christian König,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CAGsJ_4xwJ7SAhKPJyRtMTw6psTO7H1EcFFpDw0po1W8PX4FE8g@mail.gmail.com>

On Mon, May 18, 2026 at 3:43 PM Barry Song <baohua@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 8:16 PM Albert Esteve <aesteve@redhat.com> wrote:
> >
> > On Sat, May 16, 2026 at 9:37 AM Barry Song <baohua@kernel.org> wrote:
> > >
> > > On Tue, May 12, 2026 at 5:18 PM Albert Esteve <aesteve@redhat.com> wrote:
> > > >
> > > > On embedded platforms a central process often allocates dma-buf
> > > > memory on behalf of client applications. Without a way to
> > > > attribute the charge to the requesting client's cgroup, the
> > > > cost lands on the allocator, making per-cgroup memory limits
> > > > ineffective for the actual consumers.
> > > >
> > > > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > > > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > > > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > > > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > > > the mem_accounting module parameter enabled, the buffer is charged
> > > > to the allocator's own cgroup.
> > > >
> > > > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > > > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > > > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > > > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > > > all accounting through a single MEMCG_DMABUF path.
> > > >
> > > [...]
> > >
> > > > -               if (mem_accounting)
> > > > -                       flags |= __GFP_ACCOUNT;
> > >
> > > Hi Albert,
> > >
> > > would it be better to move this and its description to patch 1? It
> > > looks like patch 1 already introduces the double accounting changes,
> > > and patch 2 is mainly just supporting remote charging.
> >
> > Hi Barry,
> >
> > Thanks for looking into this series! Yes, in my head I was trying to
> > keep patch 1, which was taken from a previous, different series, and
> > then diverge from it starting with patch 2. This would clarify the
> > difference between the two. But I can see it just added some confusion
> > (for example, patch 1 charges on dma_buf_export() and then it is moved
> > to dma_heap_buffer_alloc() in patch 2). I will reorganize it better
> > for the next version, including your suggestion.
>
> Yep, I understand the situation now. I also understand
> that you were referring to T.J.'s patch, which caused
> some back-and-forth confusion for readers when reading
> patches 1 and 2.

Albert, please don't feel obligated to keep my patch intact if
integrating it into other patches simplifies the series.

> > > Also, mem_accounting is only used by system_heap.c; has this patchset
> > > also eliminated its need?
> >
> > No, mem_accounting is still handled in this patch for the general case
> > where no `charge_pid_fd` is used. See dma_heap_buffer_alloc() code:
> >
> > +       if (memcg)
> > +               css_get(&memcg->css);
> > +       else if (mem_accounting)
> > +               memcg = get_mem_cgroup_from_mm(current->mm);
>
> I see. What feels a bit odd to me is that mem_accounting
> could either be dropped (with unconditional charging), or
> it should cover both remote and local charge cases.
>
> I don’t have a strong opinion here—it just feels a bit
> strange, since its description is quite generic for memcg:
>
> "Enable cgroup-based memory accounting for dma-buf heap
> allocations (default=false)."
>
> Best Regards
> Barry

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-18 23:57 UTC (permalink / raw)
  To: Song Liu
  Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAPhsuW5jQOzRTi1ea+=UPhx5W9bkBdivPagRE=O=nx0zf_vb8w@mail.gmail.com>

On Mon, May 18, 2026 at 7:23 PM Song Liu <song@kernel.org> wrote:
> On Mon, May 18, 2026 at 2:29 PM Paul Moore <paul@paul-moore.com> wrote:
> [...]
> > In my opinion, making killswitch an LSM is more of a procedural item
> > that deals with how we view a capability like killswitch.  I
> > personally view killswitch as somewhat similar to Lockdown, which is
> > why I made the suggestion.
> >
> > The use of kprobes, while an interesting idea, presents problems as
> > allowing any kernel symbol to be killed introduces the potential for
> > security regressions.  As a reminder, some LSMs, as well as other
> > kernel subsystems, have mechanisms in place to restrict root and/or
> > enforce one-way configuration locks; while many people equate "root"
> > with full control, in many cases today that is not strictly correct.
> >
> > Yes, kprobes have been around for some time, this is not a new
> > problem, but killswitch makes it far more convenient and accessible to
> > do dangerous things with kprobes.  If killswitch makes it past the RFC
> > stage without any significant changes to its kill mechanism, we may
> > need to start considering more liberal usage of NOKPROBE_SYMBOL()
> > which I think would be an unfortunate casualty.
>
> I don't think we can use NOKPROBE_SYMBOL(). There are functions
> that we don't want to killswitch, but still want to trace.

That was exactly my point, but we need to figure something out so
killswitch doesn't make it easier to cause a regression.

-- 
paul-moore.com

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Song Liu @ 2026-05-19  0:01 UTC (permalink / raw)
  To: Paul Moore
  Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAHC9VhTGDOJZDzEA31qc0S6FJpYNC4oD__HQ-65wqqwhng=V9Q@mail.gmail.com>

On Mon, May 18, 2026 at 4:57 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Mon, May 18, 2026 at 7:23 PM Song Liu <song@kernel.org> wrote:
> > On Mon, May 18, 2026 at 2:29 PM Paul Moore <paul@paul-moore.com> wrote:
> > [...]
> > > In my opinion, making killswitch an LSM is more of a procedural item
> > > that deals with how we view a capability like killswitch.  I
> > > personally view killswitch as somewhat similar to Lockdown, which is
> > > why I made the suggestion.
> > >
> > > The use of kprobes, while an interesting idea, presents problems as
> > > allowing any kernel symbol to be killed introduces the potential for
> > > security regressions.  As a reminder, some LSMs, as well as other
> > > kernel subsystems, have mechanisms in place to restrict root and/or
> > > enforce one-way configuration locks; while many people equate "root"
> > > with full control, in many cases today that is not strictly correct.
> > >
> > > Yes, kprobes have been around for some time, this is not a new
> > > problem, but killswitch makes it far more convenient and accessible to
> > > do dangerous things with kprobes.  If killswitch makes it past the RFC
> > > stage without any significant changes to its kill mechanism, we may
> > > need to start considering more liberal usage of NOKPROBE_SYMBOL()
> > > which I think would be an unfortunate casualty.
> >
> > I don't think we can use NOKPROBE_SYMBOL(). There are functions
> > that we don't want to killswitch, but still want to trace.
>
> That was exactly my point, but we need to figure something out so
> killswitch doesn't make it easier to cause a regression.

killswitch is making it easier to fix a CVE. It can surely make it easier
to cause a regression. AFAICT, the only protection here is "it is only
for root".

Thanks,
Song

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Sasha Levin @ 2026-05-19  0:21 UTC (permalink / raw)
  To: Paul Moore
  Cc: Song Liu, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAHC9VhTGDOJZDzEA31qc0S6FJpYNC4oD__HQ-65wqqwhng=V9Q@mail.gmail.com>

On Mon, May 18, 2026 at 07:57:37PM -0400, Paul Moore wrote:
>On Mon, May 18, 2026 at 7:23 PM Song Liu <song@kernel.org> wrote:
>> On Mon, May 18, 2026 at 2:29 PM Paul Moore <paul@paul-moore.com> wrote:
>> [...]
>> > In my opinion, making killswitch an LSM is more of a procedural item
>> > that deals with how we view a capability like killswitch.  I
>> > personally view killswitch as somewhat similar to Lockdown, which is
>> > why I made the suggestion.
>> >
>> > The use of kprobes, while an interesting idea, presents problems as
>> > allowing any kernel symbol to be killed introduces the potential for
>> > security regressions.  As a reminder, some LSMs, as well as other
>> > kernel subsystems, have mechanisms in place to restrict root and/or
>> > enforce one-way configuration locks; while many people equate "root"
>> > with full control, in many cases today that is not strictly correct.
>> >
>> > Yes, kprobes have been around for some time, this is not a new
>> > problem, but killswitch makes it far more convenient and accessible to
>> > do dangerous things with kprobes.  If killswitch makes it past the RFC
>> > stage without any significant changes to its kill mechanism, we may
>> > need to start considering more liberal usage of NOKPROBE_SYMBOL()
>> > which I think would be an unfortunate casualty.
>>
>> I don't think we can use NOKPROBE_SYMBOL(). There are functions
>> that we don't want to killswitch, but still want to trace.
>
>That was exactly my point, but we need to figure something out so
>killswitch doesn't make it easier to cause a regression.

At this point it sounds like you're trying to stop root from shooting itself in
the foot, which we always avoided doing.

killswitch doesn't add any new functionality in that regard: rood can always
load a module, write to /dev/mem, etc which would allow it to do the same
thing. It's just a more convenient way to do it.

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Sasha Levin @ 2026-05-19  0:31 UTC (permalink / raw)
  To: Paul Moore
  Cc: Song Liu, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAHC9VhS1DJNs9gDB6gD9WKhL08giSVajBskZ+=mY0AWRCAsw7Q@mail.gmail.com>

On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
>From my perspective there are two different issues here: should
>killswitch be a LSM, and should killswitch leverage kprobes to be able
>to "kill" security related symbols.  After all, are we okay with
>killswitch killing capable() and friends?

killswitch doesn't do it on it's own. It may be instructed by root to do that,
at which point that is root's problem.

>In my opinion, making killswitch an LSM is more of a procedural item
>that deals with how we view a capability like killswitch.  I
>personally view killswitch as somewhat similar to Lockdown, which is
>why I made the suggestion.

Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
"random" code paths from executing, and I don't think we can create LSM hooks
at that granularity, no?

>The use of kprobes, while an interesting idea, presents problems as
>allowing any kernel symbol to be killed introduces the potential for
>security regressions.  As a reminder, some LSMs, as well as other
>kernel subsystems, have mechanisms in place to restrict root and/or
>enforce one-way configuration locks; while many people equate "root"
>with full control, in many cases today that is not strictly correct.

killswitch "complies" with lockdown. Is there a different scenario which we
should be blocking?

>Yes, kprobes have been around for some time, this is not a new
>problem, but killswitch makes it far more convenient and accessible to
>do dangerous things with kprobes.  If killswitch makes it past the RFC
>stage without any significant changes to its kill mechanism, we may
>need to start considering more liberal usage of NOKPROBE_SYMBOL()
>which I think would be an unfortunate casualty.

Why? If I don't really mind the security impact, I want to be able to have a
killswitch-like interface on my systems. If an attacker is in my systems,
killswitch is the least of my concerns I think.

If you are security concious, just don't enable CONFIG_KILLSWITCH?

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-19  2:55 UTC (permalink / raw)
  To: Song Liu
  Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAPhsuW7Js0Z6tU30RphbUjWsJXETkxaGOArfGZpDjNPmZFUpuQ@mail.gmail.com>

On Mon, May 18, 2026 at 8:01 PM Song Liu <song@kernel.org> wrote:
> On Mon, May 18, 2026 at 4:57 PM Paul Moore <paul@paul-moore.com> wrote:
> > On Mon, May 18, 2026 at 7:23 PM Song Liu <song@kernel.org> wrote:
> > > On Mon, May 18, 2026 at 2:29 PM Paul Moore <paul@paul-moore.com> wrote:
> > > [...]
> > > > In my opinion, making killswitch an LSM is more of a procedural item
> > > > that deals with how we view a capability like killswitch.  I
> > > > personally view killswitch as somewhat similar to Lockdown, which is
> > > > why I made the suggestion.
> > > >
> > > > The use of kprobes, while an interesting idea, presents problems as
> > > > allowing any kernel symbol to be killed introduces the potential for
> > > > security regressions.  As a reminder, some LSMs, as well as other
> > > > kernel subsystems, have mechanisms in place to restrict root and/or
> > > > enforce one-way configuration locks; while many people equate "root"
> > > > with full control, in many cases today that is not strictly correct.
> > > >
> > > > Yes, kprobes have been around for some time, this is not a new
> > > > problem, but killswitch makes it far more convenient and accessible to
> > > > do dangerous things with kprobes.  If killswitch makes it past the RFC
> > > > stage without any significant changes to its kill mechanism, we may
> > > > need to start considering more liberal usage of NOKPROBE_SYMBOL()
> > > > which I think would be an unfortunate casualty.
> > >
> > > I don't think we can use NOKPROBE_SYMBOL(). There are functions
> > > that we don't want to killswitch, but still want to trace.
> >
> > That was exactly my point, but we need to figure something out so
> > killswitch doesn't make it easier to cause a regression.
>
> killswitch is making it easier to fix a CVE. It can surely make it easier
> to cause a regression. AFAICT, the only protection here is "it is only
> for root".

As I mentioned earlier, several LSMs have the ability to restrict root
beyond what is possible with traditional Linux accesscontrols.  For
example, with SELinux one could deny root a specific privilege while
also blocking changes to the SELinux policy; the root user would not
be able to restore that privilege without rebooting the system.

On a killswitch enabled system the ability to restrict root is lost as
root would be able to kill the enforcement of those access controls.
Presumably one could have the LSM block access to killswitch in this
particular case, but that defeats the purpose doesn't it?

The audit subsystem also has a somewhat similar one-way configuration
lock, which when set does not allow even root to unlock it, a reboot
is required.  By a bit of luck with regards to how the code is
written*, it may not be vulnerable to a killswitch regression but I do
wonder if there are other similar things in the kernel which would
have the same type of problem.

* This is probably the first time I think I've ever considered myself
lucky with respect to the audit code implementation.

-- 
paul-moore.com

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-19  3:08 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Song Liu, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <aguvV8QCxK28ZHct@laps>

On Mon, May 18, 2026 at 8:31 PM Sasha Levin <sashal@kernel.org> wrote:
> On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
> >From my perspective there are two different issues here: should
> >killswitch be a LSM, and should killswitch leverage kprobes to be able
> >to "kill" security related symbols.  After all, are we okay with
> >killswitch killing capable() and friends?
>
> killswitch doesn't do it on it's own. It may be instructed by root to do that,
> at which point that is root's problem.

As I mentioned previously, there are cases where we can restrict
root's privileges today, but a functional killswitch would allow that
restriction to be bypassed.  My last email to Song has an example with
SELinux.

> >In my opinion, making killswitch an LSM is more of a procedural item
> >that deals with how we view a capability like killswitch.  I
> >personally view killswitch as somewhat similar to Lockdown, which is
> >why I made the suggestion.
>
> Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
> "random" code paths from executing, and I don't think we can create LSM hooks
> at that granularity, no?

I don't see any LSM hooks in this revision of killswitch, and as long
as it is based on a kprobes I can't imagine it would ever use any.  As
I mentioned above, my killswitch-as-a-LSM comment is primarily about
killswitch filling a role very similar to Lockdown.

> >The use of kprobes, while an interesting idea, presents problems as
> >allowing any kernel symbol to be killed introduces the potential for
> >security regressions.  As a reminder, some LSMs, as well as other
> >kernel subsystems, have mechanisms in place to restrict root and/or
> >enforce one-way configuration locks; while many people equate "root"
> >with full control, in many cases today that is not strictly correct.
>
> killswitch "complies" with lockdown. Is there a different scenario which we
> should be blocking?

See the SELinux example I mentioned in my email to Song.

> >Yes, kprobes have been around for some time, this is not a new
> >problem, but killswitch makes it far more convenient and accessible to
> >do dangerous things with kprobes.  If killswitch makes it past the RFC
> >stage without any significant changes to its kill mechanism, we may
> >need to start considering more liberal usage of NOKPROBE_SYMBOL()
> >which I think would be an unfortunate casualty.
>
> Why? If I don't really mind the security impact, I want to be able to have a
> killswitch-like interface on my systems. If an attacker is in my systems,
> killswitch is the least of my concerns I think.
>
> If you are security concious, just don't enable CONFIG_KILLSWITCH?

Isn't the whole point of killswitch to have it enabled everywhere
because you never know when you might want/need it?

-- 
paul-moore.com

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Song Liu @ 2026-05-19  5:29 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Paul Moore, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <aguvV8QCxK28ZHct@laps>

On Mon, May 18, 2026 at 5:31 PM Sasha Levin <sashal@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
> >From my perspective there are two different issues here: should
> >killswitch be a LSM, and should killswitch leverage kprobes to be able
> >to "kill" security related symbols.  After all, are we okay with
> >killswitch killing capable() and friends?
>
> killswitch doesn't do it on it's own. It may be instructed by root to do that,
> at which point that is root's problem.
>
> >In my opinion, making killswitch an LSM is more of a procedural item
> >that deals with how we view a capability like killswitch.  I
> >personally view killswitch as somewhat similar to Lockdown, which is
> >why I made the suggestion.
>
> Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
> "random" code paths from executing, and I don't think we can create LSM hooks
> at that granularity, no?

There are much fewer LSM hooks than ftrace-able (killswitch-able)
functions. In this sense, killswitch is more granular. However, LSM
hooks allow LSM policies to make different decisions for different
arguments. In this sense, LSM hooks are more granular than
killswitch, as killswitch can only set a fixed return value for each
engaged function.

With current LSM solutions, we can mitigate issues like Copy Fail
without breaking other features of the system. In [1], Cloudflare
shared how they mitigate Copy Fail with BPF LSM.

Thanks,
Song

[1] https://blog.cloudflare.com/copy-fail-linux-vulnerability-mitigation/

^ permalink raw reply

* Re: [Linaro-mm-sig] Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Christian König @ 2026-05-19  7:09 UTC (permalink / raw)
  To: Barry Song
  Cc: T.J. Mercier, Albert Esteve, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CAGsJ_4z121v4tK_3+j-hkD7HH0gH3w8tWD8nk0CwRhFE5T+4Og@mail.gmail.com>

On 5/19/26 01:00, Barry Song wrote:
> On Mon, May 18, 2026 at 3:34 PM Christian König
> <christian.koenig@amd.com> wrote:
>>
>> On 5/16/26 11:19, Barry Song wrote:
>>> On Thu, May 14, 2026 at 12:35 AM T.J. Mercier <tjmercier@google.com> wrote:
>>> [...]
>>>>>> I have a question about this part. Albert I guess you are interested
>>>>>> only in accounting dmabuf-heap allocations, or do you expect to add
>>>>>> __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other
>>>>>> non-dmabuf-heap exporters?
>>>>>
>>>>> We're scoping this to dma-buf heaps for now. CMA heaps and the dmem
>>>>> controller are on the radar for follow-up/parallel work (there will be
>>>>> dragons and will surely need discussion). For DRM and V4L2 the
>>>>> long-term intent is migration to heaps, which would make direct
>>>>> accounting on those paths unnecessary.
>>>>
>>>> Ah I see. GEM buffers exported to dmabufs are what I had in mind. I
>>>> guess this would only leave the odd non-DRM driver with the need to
>>>> add their own accounting calls, which I don't expect would be a big
>>>> problem.
>>>>
>>>
>>> sounds like we still have a long way to go to correctly account for
>>> various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in
>>> dma_buf_export(), so I guess it covers all dma-buf types except
>>> dma_heap, but the problem is that it has no remote charging support at
>>> all?
>>
>> No, just the other way around
>>
>> DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.
>>
>> dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.
>>
> 
> Hi Christian,
> 
> Thanks very much for your explanation. So basically it seems that
> dma_buf_export() is not the proper place to charge, since it may end up
> mixing in non-system-memory accounting?

Yes, exactly that.

> My question is also about the global view for both heap and non-heap cases.
> After reading the discussion, I’ve tried to summarize it—please let me know
> if my understanding is correct.
> 
> for dma_heap, we have the ioctl DMA_HEAP_IOCTL_ALLOC, where users can pass a
> remote pidfd or similar information to indicate where the dma-buf should be
> charged, as in Albert's patchset.

Well that's the current proposal, but I think we need to come up with something more general.

> For non-dma_heap dma-bufs, we don’t have an obvious userspace entry point that
> triggers the allocation. So we likely need other approaches. We could either
> move more drivers over to dma-heap, or introduce something like
> DMA_BUF_IOCTL_XFER_CHARGE, as you are discussing, to let userspace explicitly
> declare a charge.

Yeah but that's not only for DMA-buf, we need that for file descriptors returned by memfd_create() as well.

Regards,
Christian.

> Best Regards
> Barry


^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Christian König @ 2026-05-19  7:19 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Albert Esteve, Christian Brauner, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CABdmKX3yZubjDKbVqwrjHAiKyj_ioHzOoxd0wzFbJK=PAGOqcQ@mail.gmail.com>

On 5/19/26 01:39, T.J. Mercier wrote:
> On Mon, May 18, 2026 at 7:07 AM Christian König
> <christian.koenig@amd.com> wrote:
>>
>> On 5/18/26 14:50, Albert Esteve wrote:
>>> On Mon, May 18, 2026 at 9:20 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>>
>>>> On 5/15/26 19:06, T.J. Mercier wrote:
>>>>> On Fri, May 15, 2026 at 6:53 AM Christian Brauner <brauner@kernel.org> wrote:
>>>>>>
>>>>>> On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
>>>>>>> On embedded platforms a central process often allocates dma-buf
>>>>>>> memory on behalf of client applications. Without a way to
>>>>>>> attribute the charge to the requesting client's cgroup, the
>>>>>>> cost lands on the allocator, making per-cgroup memory limits
>>>>>>> ineffective for the actual consumers.
>>>>>>>
>>>>>>> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
>>>>>>
>>>>>> Please be aware that pidfds come in two flavors:
>>>>>>
>>>>>> thread-group pidfds and thread-specific pidfds. Make sure that your API
>>>>>> doesn't implicitly depend on this distinction not existing.
>>>>>
>>>>> Hi Christian,
>>>>>
>>>>> Memcg is not a controller that supports "thread mode" so all threads
>>>>> in a group should belong to the same memcg.
>>>>
>>>> BTW: Exactly that is the requirement automotive has with their native context use case.
>>>>
>>>> The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
>>>>
>>>> At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
>>>>
>>>> Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
>>>
>>> Hi Christian,
>>>
>>> Thanks for sharing this atuomotive usecase. If I understand correctly,
>>> the actual requirement is attributing dma-buf charges to the right
>>> client, not putting each daemon thread in a different cgroup?
>>
>> Nope, exactly that's the difference.
>>
>> The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.
>>
>> Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.
>>
>> The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.
>>
>> The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.
>>
>> So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.
>>
>>> If so,
>>> the `charge_pid_fd` approach achieves this directly by passing the
>>> client's `pid_fd`, without needing to add per-thread cgroup
>>> infrastructure.
>>
>> Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.
>>
>> Doing that automatically for CPU and I/O time would just be nice to have additionally.
>>
>> Regards,
>> Christian.
> 
> Hopefully I'm following correctly here.... So you are duplicating the
> GPU driver stack to achieve remote accounting on a per-thread basis?

Not quite, we are duplicating the handling cgroup provides in the kernel in userspace.

For this memory usage information as well as execution times of the GPU kernel driver is exposed in fdinfo for example.

> Does this mean for GPU allocations you currently have some GFP_ACCOUNT
> magic in your driver to attribute GPU memory to the correct remote
> client?

No, we just expose what the kernel driver has allocated for itself. E.g. page tables, buffers etc...

When userspace allocates something using memfd_create() for example we just ignore that. 

> So this series would close the gap for dma-buf allocations,
> but what about private GPU driver memory allocated on behalf of a
> client?

Well we would need a cgroup which isn't associated with any process were we could charge the GPU driver allocations against.

But good point, charging against a pid wouldn't work in this use case.

Regards,
Christian.

^ permalink raw reply

* Re: [Linaro-mm-sig] Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Christian König @ 2026-05-19  7:53 UTC (permalink / raw)
  To: Albert Esteve
  Cc: Barry Song, T.J. Mercier, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CADSE00Lc42s2bzXzV5D7t1Enf56u4BVj-yXLp3Yxhm0=qMPvuw@mail.gmail.com>

On 5/18/26 14:06, Albert Esteve wrote:
>>>>> udmabufs are already
>>>>> memcg-charged, so adding a separate MEMCG_DMABUF would double count.
>>>>> Are there any other exporters you had in mind that would benefit from
>>>>> this approach?
>>
>> Well apart from DMA-buf memfd_create() is one of the things which as broken our neck in the past a couple of times.
>>
>> But thinking more about it what if instead of making this DMA-buf heaps specific what if we have a general cgroups function which allows to change accounting of a buffer referenced by a file descriptor to a different process?
>>
>> That would cover not only the DMA-buf heaps use case, but also all other DMA-buf with dmem and whatever we come up in the future as well.
> 
> I removed a draft adding an ioctl for charge transfer from the series
> before sending because I wanted to focus on the charge_pid_fd approach
> and keep things simple, deferring the recharge path to a follow-up
> depending on feedback.
> 
> The main difference between my removed draft and what you're
> describing, iiuc, is scope and layer: my draft was an explicit ioctl
> on the dma-buf fd that the consumer calls to claim the charge (see
> below), while you seem to be suggesting a more general kernel-internal
> function that could work across buffer types and cgroup controllers,
> so not necessarily userspace-initiated? A kernel-internal function
> will need a way to identify the target process, which sounds similar
> to the binder-backed approach from TJ [1]. For everything else, the
> receiver still needs to declare itself, which the ioctl accomplishes.
> 
> ```
> # When an app imports a daemon-allocated buffer, it can transfer the
> charge to itself:
> int buf_fd = receive_dmabuf_from_daemon();
> ioctl(buf_fd, DMA_BUF_IOCTL_XFER_CHARGE); /* charge now attributed to
> apps's cgroup */

Well that thinking goes into the right direction, but the requirements are still not completely covered as far as I can see.

Let me explain below a bit more.

> 
> [1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.com/
> 
>>
>> The only drawback I can see is that DMA-buf heap allocations would be temporarily accounted to the memory allocation daemon, but I don't think that this would be a problem.
> 
> The main reasons we moved away from TJ's transfer-based approach
> toward `charge_pid_fd` are: avoid the transient charge window on the
> daemon's cgroup; and to decouple from Binder, allowing any allocator
> to use it.

Yeah those concerns are completely correct.

The application should not volunteering says 'Charge that buffer to me.', but rather that the daemon says force charge that buffer to this application and tell me when the application is over its limit.

> 
> Technically, both approaches could coexist, though. Of the three
> scenarios TJ described:
> - Scenario 2 is directly addressed by charge_pid_fd approach without
> any transient charge on the daemon at the cost of one extra field in
> the heap ioctl uAPI struct.

Yeah extending the uAPI to pass in the pid on allocation time is not much of a problem, but you also need to modify the whole stack above it and that is a bit more trickier.

> - Scenario 3 can be handled by the charge transfer function without
> changes to SurfaceFlinger. The app or dequeueBuffer claims the charge
> for itself or the app, respectively (depending on whether we include a
> pid_fd field in the transfer ioctl). It also covers non-heap
> exporters. The con in both variants is the transient charge window on
> the daemon.

It should be trivial for the deamon to charge the buffer to an application before handing it out.

> Both approaches shift the responsibility for correct charging
> attribution to userspace: first, 'charge_pid_fd` on the allocator's
> side, and the transfer charge on the consumer's side.

Yeah that's why I said it would be better if we do that without any uAPI change, but with all the uAPI we have to transfer file descriptors (dup(), fork(), passing FDs over sockets etc...) it could be really tricky to implement that.

> Deciding on one, the other or both depends on how much we value
> avoiding transient attribution, and how much we need a non-heap
> generic solution. With the XFER_CHARGE we can cover both. Thus, the
> `charge_pid_fd` approach in this RFC can be seen as a
> performance/strictness optimisation, eliminating transient charges to
> the daemon at the cost of a permanent uAPI addition to the heap ioctl
> struct, but not strictly required for correctness.

Well all we need is a uAPI which says charge this buffer (file descriptor) to that cgroup (pidfd).

With this at hand we should be able to handle all use cases at the same time.

> On the other hand,
> if we agree on the end goal of migrating other exporters to use
> dma-buf heaps

That won't work. DMA-buf heaps is actually only a rather small and Anroid specific use case.

We have tons of other interfaces to allocate DMA-bufs which need to stay around because of HW restrictions and we do need a solution for them as well.

Regards,
Christian.

>, and scenario 3 is addressed by adding the app's pid_fd
> to SurfaceFlinger, then `charge_pid_fd` alone is a coherent/sufficient
> approach despite the uAPI change.
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>> Thanks
>>> Barry
>>
> 


^ permalink raw reply

* Re: [PATCH] keys/trusted_keys: mark 'migratable' as __ro_after_init
From: Jarkko Sakkinen @ 2026-05-19  8:15 UTC (permalink / raw)
  To: Len Bao
  Cc: James Bottomley, Mimi Zohar, David Howells, Paul Moore,
	James Morris, Serge E. Hallyn, linux-integrity, keyrings,
	linux-security-module, linux-kernel
In-Reply-To: <20260516152249.41851-1-len.bao@gmx.us>

On Sat, May 16, 2026 at 03:22:47PM +0000, Len Bao wrote:
> The 'migratable' variable is initialized only during the init phase
> in the 'init_trusted' function and never changed. So, mark it as
> __ro_after_init.
> 
> Signed-off-by: Len Bao <len.bao@gmx.us>
> ---
>  security/keys/trusted-keys/trusted_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/security/keys/trusted-keys/trusted_core.c b/security/keys/trusted-keys/trusted_core.c
> index 0b142d941..433579365 100644
> --- a/security/keys/trusted-keys/trusted_core.c
> +++ b/security/keys/trusted-keys/trusted_core.c
> @@ -59,7 +59,7 @@ DEFINE_STATIC_CALL_NULL(trusted_key_unseal,
>  DEFINE_STATIC_CALL_NULL(trusted_key_get_random,
>  			*trusted_key_sources[0].ops->get_random);
>  static void (*trusted_key_exit)(void);
> -static unsigned char migratable;
> +static unsigned char migratable __ro_after_init;
>  
>  enum {
>  	Opt_err,
> -- 
> 2.43.0
> 

Thank you.

Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-19  8:25 UTC (permalink / raw)
  To: Christian König
  Cc: T.J. Mercier, Christian Brauner, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <01b6eefc-c107-4f8c-9d7c-3b86f54cabaa@amd.com>

On Tue, May 19, 2026 at 9:20 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/19/26 01:39, T.J. Mercier wrote:
> > On Mon, May 18, 2026 at 7:07 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 5/18/26 14:50, Albert Esteve wrote:
> >>> On Mon, May 18, 2026 at 9:20 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>>
> >>>> On 5/15/26 19:06, T.J. Mercier wrote:
> >>>>> On Fri, May 15, 2026 at 6:53 AM Christian Brauner <brauner@kernel.org> wrote:
> >>>>>>
> >>>>>> On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
> >>>>>>> On embedded platforms a central process often allocates dma-buf
> >>>>>>> memory on behalf of client applications. Without a way to
> >>>>>>> attribute the charge to the requesting client's cgroup, the
> >>>>>>> cost lands on the allocator, making per-cgroup memory limits
> >>>>>>> ineffective for the actual consumers.
> >>>>>>>
> >>>>>>> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> >>>>>>
> >>>>>> Please be aware that pidfds come in two flavors:
> >>>>>>
> >>>>>> thread-group pidfds and thread-specific pidfds. Make sure that your API
> >>>>>> doesn't implicitly depend on this distinction not existing.
> >>>>>
> >>>>> Hi Christian,
> >>>>>
> >>>>> Memcg is not a controller that supports "thread mode" so all threads
> >>>>> in a group should belong to the same memcg.
> >>>>
> >>>> BTW: Exactly that is the requirement automotive has with their native context use case.
> >>>>
> >>>> The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
> >>>>
> >>>> At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
> >>>>
> >>>> Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
> >>>
> >>> Hi Christian,
> >>>
> >>> Thanks for sharing this atuomotive usecase. If I understand correctly,
> >>> the actual requirement is attributing dma-buf charges to the right
> >>> client, not putting each daemon thread in a different cgroup?
> >>
> >> Nope, exactly that's the difference.
> >>
> >> The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.
> >>
> >> Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.
> >>
> >> The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.
> >>
> >> The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.
> >>
> >> So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.
> >>
> >>> If so,
> >>> the `charge_pid_fd` approach achieves this directly by passing the
> >>> client's `pid_fd`, without needing to add per-thread cgroup
> >>> infrastructure.
> >>
> >> Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.
> >>
> >> Doing that automatically for CPU and I/O time would just be nice to have additionally.
> >>
> >> Regards,
> >> Christian.
> >
> > Hopefully I'm following correctly here.... So you are duplicating the
> > GPU driver stack to achieve remote accounting on a per-thread basis?
>
> Not quite, we are duplicating the handling cgroup provides in the kernel in userspace.
>
> For this memory usage information as well as execution times of the GPU kernel driver is exposed in fdinfo for example.
>
> > Does this mean for GPU allocations you currently have some GFP_ACCOUNT
> > magic in your driver to attribute GPU memory to the correct remote
> > client?
>
> No, we just expose what the kernel driver has allocated for itself. E.g. page tables, buffers etc...
>
> When userspace allocates something using memfd_create() for example we just ignore that.


>
> > So this series would close the gap for dma-buf allocations,
> > but what about private GPU driver memory allocated on behalf of a
> > client?
>
> Well we would need a cgroup which isn't associated with any process were we could charge the GPU driver allocations against.

I think I better understand your framing for this now. Thanks again
for taking the time to explain.

I was looking for a way to pass cgroup around to do the charge. I
found that `struct cgroup *cgroup_get_from_fd(int fd)` already exists
in cgroups available symbols to handle cgroup directories.

So here's an idea...

Rename the charge_pid_fd to charge_fd:
- If it is a pidfd (`!IS_ERR(pidfd_pid(fget(charge_fd)))`) then we do
what we're already doing here.
- If it is a cgroup_fd (`!IS_ERR(cgroup_get_from_fd(charge_fd))`) then
we charge to that cgroup.

Also we could add add an ioctl for the generic fd path similar to what
we have for dma-buf heaps. Or have a new flavour for memfd_create:
```
memfd_create2(name, flags, charge_fd);
```

The transfer ioctl could also be made generic to accept both pidfds
and cgroup_fds.

For this series we could move forward as is, and make the generic
solution a follow-up series, knowing that the field can be reused for
cgroup fds.

>
> But good point, charging against a pid wouldn't work in this use case.
>
> Regards,
> Christian.
>


^ permalink raw reply

* Re: [PATCH v5 00/13] ima: Introduce staging mechanism
From: Roberto Sassu @ 2026-05-19  8:38 UTC (permalink / raw)
  To: Lakshmi Ramasubramanian, steven chen, corbet, skhan, zohar,
	dmitry.kasatkin, eric.snowberg, paul, jmorris, serge
  Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
	gregorylumen, Roberto Sassu
In-Reply-To: <8db443f1-d2f3-47ce-9116-18985ed0b290@linux.microsoft.com>

On Fri, 2026-05-15 at 10:37 -0700, Lakshmi Ramasubramanian wrote:
> Thanks for the response Roberto.
> 
> On 5/12/2026 1:17 AM, Roberto Sassu wrote:
> 
> > > > 
> > > > This submission proposes two ways for log trimming:
> > > > 
> > > > *Flavor 1:* Staging With Prompt
> > > > *Flavor 2:* Stage and Delete N
> > > > 
> > 
> > I'm happy to support your trimming method. Just does not fit with my
> > use case. I would like to keep both.
> > 
> 
> If "Flavor 1: Staging With Prompt" would be beneficial to the Linux 
> kernel customers, in general, we should continue to review the change 
> and merge it eventually.
> 
> My request, then, would be to split this patch set into 2 parts:
> 
> 	Part 1: Implements "Staging With Prompt"
> 
> 	Part 2: Implements "Stage and Delete N"
> 
> I think that would make it easier for reviewing the code, test\validate, 
> and merge.

No need in my opinion, it is simple enough.

Roberto


^ permalink raw reply

* Re: [Linaro-mm-sig] Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-19  9:17 UTC (permalink / raw)
  To: Christian König
  Cc: Barry Song, T.J. Mercier, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <9cc79977-9a42-40eb-bfa7-460881c1e10f@amd.com>

On Tue, May 19, 2026 at 9:53 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/18/26 14:06, Albert Esteve wrote:
> >>>>> udmabufs are already
> >>>>> memcg-charged, so adding a separate MEMCG_DMABUF would double count.
> >>>>> Are there any other exporters you had in mind that would benefit from
> >>>>> this approach?
> >>
> >> Well apart from DMA-buf memfd_create() is one of the things which as broken our neck in the past a couple of times.
> >>
> >> But thinking more about it what if instead of making this DMA-buf heaps specific what if we have a general cgroups function which allows to change accounting of a buffer referenced by a file descriptor to a different process?
> >>
> >> That would cover not only the DMA-buf heaps use case, but also all other DMA-buf with dmem and whatever we come up in the future as well.
> >
> > I removed a draft adding an ioctl for charge transfer from the series
> > before sending because I wanted to focus on the charge_pid_fd approach
> > and keep things simple, deferring the recharge path to a follow-up
> > depending on feedback.
> >
> > The main difference between my removed draft and what you're
> > describing, iiuc, is scope and layer: my draft was an explicit ioctl
> > on the dma-buf fd that the consumer calls to claim the charge (see
> > below), while you seem to be suggesting a more general kernel-internal
> > function that could work across buffer types and cgroup controllers,
> > so not necessarily userspace-initiated? A kernel-internal function
> > will need a way to identify the target process, which sounds similar
> > to the binder-backed approach from TJ [1]. For everything else, the
> > receiver still needs to declare itself, which the ioctl accomplishes.
> >
> > ```
> > # When an app imports a daemon-allocated buffer, it can transfer the
> > charge to itself:
> > int buf_fd = receive_dmabuf_from_daemon();
> > ioctl(buf_fd, DMA_BUF_IOCTL_XFER_CHARGE); /* charge now attributed to
> > apps's cgroup */
>
> Well that thinking goes into the right direction, but the requirements are still not completely covered as far as I can see.
>
> Let me explain below a bit more.
>
> >
> > [1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.com/
> >
> >>
> >> The only drawback I can see is that DMA-buf heap allocations would be temporarily accounted to the memory allocation daemon, but I don't think that this would be a problem.
> >
> > The main reasons we moved away from TJ's transfer-based approach
> > toward `charge_pid_fd` are: avoid the transient charge window on the
> > daemon's cgroup; and to decouple from Binder, allowing any allocator
> > to use it.
>
> Yeah those concerns are completely correct.
>
> The application should not volunteering says 'Charge that buffer to me.', but rather that the daemon says force charge that buffer to this application and tell me when the application is over its limit.
>
> >
> > Technically, both approaches could coexist, though. Of the three
> > scenarios TJ described:
> > - Scenario 2 is directly addressed by charge_pid_fd approach without
> > any transient charge on the daemon at the cost of one extra field in
> > the heap ioctl uAPI struct.
>
> Yeah extending the uAPI to pass in the pid on allocation time is not much of a problem, but you also need to modify the whole stack above it and that is a bit more trickier.
>
> > - Scenario 3 can be handled by the charge transfer function without
> > changes to SurfaceFlinger. The app or dequeueBuffer claims the charge
> > for itself or the app, respectively (depending on whether we include a
> > pid_fd field in the transfer ioctl). It also covers non-heap
> > exporters. The con in both variants is the transient charge window on
> > the daemon.
>
> It should be trivial for the deamon to charge the buffer to an application before handing it out.

Yeah, true.

>
> > Both approaches shift the responsibility for correct charging
> > attribution to userspace: first, 'charge_pid_fd` on the allocator's
> > side, and the transfer charge on the consumer's side.
>
> Yeah that's why I said it would be better if we do that without any uAPI change, but with all the uAPI we have to transfer file descriptors (dup(), fork(), passing FDs over sockets etc...) it could be really tricky to implement that.
>
> > Deciding on one, the other or both depends on how much we value
> > avoiding transient attribution, and how much we need a non-heap
> > generic solution. With the XFER_CHARGE we can cover both. Thus, the
> > `charge_pid_fd` approach in this RFC can be seen as a
> > performance/strictness optimisation, eliminating transient charges to
> > the daemon at the cost of a permanent uAPI addition to the heap ioctl
> > struct, but not strictly required for correctness.
>
> Well all we need is a uAPI which says charge this buffer (file descriptor) to that cgroup (pidfd).

So you favor having only the XFER_CHARGE variant. That is fine with me.
If that is fine for others also that could be the way forward. If we
extend it to accept either a pidfd or a cgroup fd (as commented
previously), we can cover all dma-buf use cases with a single
primitive:
```
ioctl(buf_fd, DMA_BUF_IOCTL_XFER_CHARGE, charge_fd);
```
With the daemon invoking this ioctl before handing out the buf_fd.

This should cover most usecases? Except for the memfd case, which
requires a separate mechanism. That would be follow-up work.

>
> With this at hand we should be able to handle all use cases at the same time.
>
> > On the other hand,
> > if we agree on the end goal of migrating other exporters to use
> > dma-buf heaps
>
> That won't work. DMA-buf heaps is actually only a rather small and Anroid specific use case.
>
> We have tons of other interfaces to allocate DMA-bufs which need to stay around because of HW restrictions and we do need a solution for them as well.
>
> Regards,
> Christian.
>
> >, and scenario 3 is addressed by adding the app's pid_fd
> > to SurfaceFlinger, then `charge_pid_fd` alone is a coherent/sufficient
> > approach despite the uAPI change.
> >
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Thanks
> >>> Barry
> >>
> >
>


^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-19  9:43 UTC (permalink / raw)
  To: Barry Song
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CAGsJ_4xwJ7SAhKPJyRtMTw6psTO7H1EcFFpDw0po1W8PX4FE8g@mail.gmail.com>

On Tue, May 19, 2026 at 12:43 AM Barry Song <baohua@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 8:16 PM Albert Esteve <aesteve@redhat.com> wrote:
> >
> > On Sat, May 16, 2026 at 9:37 AM Barry Song <baohua@kernel.org> wrote:
> > >
> > > On Tue, May 12, 2026 at 5:18 PM Albert Esteve <aesteve@redhat.com> wrote:
> > > >
> > > > On embedded platforms a central process often allocates dma-buf
> > > > memory on behalf of client applications. Without a way to
> > > > attribute the charge to the requesting client's cgroup, the
> > > > cost lands on the allocator, making per-cgroup memory limits
> > > > ineffective for the actual consumers.
> > > >
> > > > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > > > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > > > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > > > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > > > the mem_accounting module parameter enabled, the buffer is charged
> > > > to the allocator's own cgroup.
> > > >
> > > > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > > > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > > > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > > > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > > > all accounting through a single MEMCG_DMABUF path.
> > > >
> > > [...]
> > >
> > > > -               if (mem_accounting)
> > > > -                       flags |= __GFP_ACCOUNT;
> > >
> > > Hi Albert,
> > >
> > > would it be better to move this and its description to patch 1? It
> > > looks like patch 1 already introduces the double accounting changes,
> > > and patch 2 is mainly just supporting remote charging.
> >
> > Hi Barry,
> >
> > Thanks for looking into this series! Yes, in my head I was trying to
> > keep patch 1, which was taken from a previous, different series, and
> > then diverge from it starting with patch 2. This would clarify the
> > difference between the two. But I can see it just added some confusion
> > (for example, patch 1 charges on dma_buf_export() and then it is moved
> > to dma_heap_buffer_alloc() in patch 2). I will reorganize it better
> > for the next version, including your suggestion.
>
> Yep, I understand the situation now. I also understand
> that you were referring to T.J.'s patch, which caused
> some back-and-forth confusion for readers when reading
> patches 1 and 2.
>
> >
> > >
> > > Also, mem_accounting is only used by system_heap.c; has this patchset
> > > also eliminated its need?
> >
> > No, mem_accounting is still handled in this patch for the general case
> > where no `charge_pid_fd` is used. See dma_heap_buffer_alloc() code:
> >
> > +       if (memcg)
> > +               css_get(&memcg->css);
> > +       else if (mem_accounting)
> > +               memcg = get_mem_cgroup_from_mm(current->mm);
>
> I see. What feels a bit odd to me is that mem_accounting
> could either be dropped (with unconditional charging), or
> it should cover both remote and local charge cases.

Good point. If I understand correctly, looking at patch [1] that
introduced the flag, the shared buffer caveats mentioned there are not
yet covered by this approach, so the flag should stay. I will make it
consistent and cover both remote and local charge cases.

[1] https://lore.kernel.org/all/20260116-dmabuf-heap-system-memcg-v3-1-ecc6b62cc446@redhat.com/

>
> I don’t have a strong opinion here—it just feels a bit
> strange, since its description is quite generic for memcg:
>
> "Enable cgroup-based memory accounting for dma-buf heap
> allocations (default=false)."
>
> Best Regards
> Barry
>


^ permalink raw reply

* [bug report] keys: request_key_auth payload use-after-free in keyctl_instantiate_key_common()
From: Shaomin Chen @ 2026-05-19 14:44 UTC (permalink / raw)
  To: keyrings, linux-security-module, linux-kernel
  Cc: David Howells, Jarkko Sakkinen, Paul Moore, James Morris,
	Serge E. Hallyn

Hi,

keyctl_instantiate_key_common() can use a stale request_key_auth payload after
the current request-key authorisation key has been revoked.

The relevant code pattern is:

	rka = instkey->payload.data[0];
	...
	copy_from_iter_full(payload, plen, from);   /* can fault and sleep */
	...
	get_instantiation_keyring(ringid, rka, &dest_keyring);
	key_instantiate_and_link(rka->target_key, payload, plen,
				 dest_keyring, instkey);

keyctl_instantiate_key_common() does not hold authkey->sem, an RCU read-side
critical section, or a reference to the request_key_auth payload across the
sleeping copy and later rka dereferences.

One race sequence is:

	Task A: request-key helper child        Task B: original request_key path
	-------------------------------        ---------------------------------
	assume request-key authority
	enter KEYCTL_INSTANTIATE_IOV
	rka = instkey->payload.data[0]
	block in copy_from_iter_full()
	                                      helper parent instantiates target key
	                                      helper returns to kernel
	                                      complete_request_key(authkey, 0)
	                                      key_revoke(authkey)
	                                      request_key_auth_revoke(authkey)
	                                      rcu_assign_keypointer(authkey, NULL)
	                                      call_rcu(&rka->rcu, ...)
	                                      request_key_auth_rcu_disposal()
	                                      free_request_key_auth(rka)
	resume from copy_from_iter_full()
	get_instantiation_keyring(..., rka, ...)
	key_instantiate_and_link(rka->target_key, ...)

I reproduced this on a current upstream v7.1-rc3 based tree,
HEAD ab5fce87a778c, with KASAN enabled:

	BUG: KASAN: slab-use-after-free in keyctl_instantiate_key_common+0x1dc/0x2a0
	Read of size 8
	Allocated by task:
	  request_key_auth_new+0xe0/0x4d0
	Freed by task:
	  key_revoke+0x62/0xc0
	  call_sbin_request_key+0x6cb/0x740

The reproducer uses a request-key helper that forks a second process with the
request-key authority.  The second process enters KEYCTL_INSTANTIATE_IOV and
blocks in copy_from_iter_full() on a user fault after rka has been loaded.  The
original helper then instantiates the target key and returns, which revokes the
auth key and queues the request_key_auth payload for RCU freeing.  When the
blocked instantiate path resumes, it dereferences the stale rka pointer.

I can provide the reproducer and a candidate patch.

Regards,
Shaomin Chen

^ permalink raw reply

* Re: [Linaro-mm-sig] Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Maxime Ripard @ 2026-05-19 16:32 UTC (permalink / raw)
  To: Christian König
  Cc: Albert Esteve, Barry Song, T.J. Mercier, Tejun Heo,
	Johannes Weiner, Michal Koutný, Jonathan Corbet, Shuah Khan,
	Sumit Semwal, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Benjamin Gaignard, Brian Starkey,
	John Stultz, Christian Brauner, Paul Moore, James Morris,
	Serge E. Hallyn, Stephen Smalley, Ondrej Mosnacek, Shuah Khan,
	cgroups, linux-doc, linux-kernel, linux-media, dri-,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, echanude
In-Reply-To: <9cc79977-9a42-40eb-bfa7-460881c1e10f@amd.com>

[-- Attachment #1: Type: text/plain, Size: 6159 bytes --]

Hi Chritian,

On Tue, May 19, 2026 at 09:53:19AM +0200, Christian König wrote:
> On 5/18/26 14:06, Albert Esteve wrote:
> >>>>> udmabufs are already
> >>>>> memcg-charged, so adding a separate MEMCG_DMABUF would double count.
> >>>>> Are there any other exporters you had in mind that would benefit from
> >>>>> this approach?
> >>
> >> Well apart from DMA-buf memfd_create() is one of the things which as broken our neck in the past a couple of times.
> >>
> >> But thinking more about it what if instead of making this DMA-buf heaps specific what if we have a general cgroups function which allows to change accounting of a buffer referenced by a file descriptor to a different process?
> >>
> >> That would cover not only the DMA-buf heaps use case, but also all other DMA-buf with dmem and whatever we come up in the future as well.
> > 
> > I removed a draft adding an ioctl for charge transfer from the series
> > before sending because I wanted to focus on the charge_pid_fd approach
> > and keep things simple, deferring the recharge path to a follow-up
> > depending on feedback.
> > 
> > The main difference between my removed draft and what you're
> > describing, iiuc, is scope and layer: my draft was an explicit ioctl
> > on the dma-buf fd that the consumer calls to claim the charge (see
> > below), while you seem to be suggesting a more general kernel-internal
> > function that could work across buffer types and cgroup controllers,
> > so not necessarily userspace-initiated? A kernel-internal function
> > will need a way to identify the target process, which sounds similar
> > to the binder-backed approach from TJ [1]. For everything else, the
> > receiver still needs to declare itself, which the ioctl accomplishes.
> > 
> > ```
> > # When an app imports a daemon-allocated buffer, it can transfer the
> > charge to itself:
> > int buf_fd = receive_dmabuf_from_daemon();
> > ioctl(buf_fd, DMA_BUF_IOCTL_XFER_CHARGE); /* charge now attributed to
> > apps's cgroup */
> 
> Well that thinking goes into the right direction, but the requirements are still not completely
> covered as far as I can see.
> 
> Let me explain below a bit more.
> 
> > 
> > [1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.com/
> > 
> >>
> >> The only drawback I can see is that DMA-buf heap allocations would be temporarily accounted to the memory allocation daemon, but I don't think that this would be a problem.
> > 
> > The main reasons we moved away from TJ's transfer-based approach
> > toward `charge_pid_fd` are: avoid the transient charge window on the
> > daemon's cgroup; and to decouple from Binder, allowing any allocator
> > to use it.
> 
> Yeah those concerns are completely correct.
> 
> The application should not volunteering says 'Charge that buffer to
> me.', but rather that the daemon says force charge that buffer to this
> application and tell me when the application is over its limit.

I would agree, but with a caveat: how do we want to deal with malicious
applications here? The application should have expressed that it's okay
for it to be charged by a different process, otherwise it becomes
trivial for a malicious app to create arbitrary charges against another
application in the system and DoS it.

But then, that means that an application could arbitrarily charge the
daemon as well if it doesn't opt-in but asks for allocations.

So maybe we should have an opt-in for the caller, and a way for the
daemon to check if the caller has indeed opted in before performing the
allocation (and the charge transfer)?

> > Technically, both approaches could coexist, though. Of the three
> > scenarios TJ described:
> > - Scenario 2 is directly addressed by charge_pid_fd approach without
> > any transient charge on the daemon at the cost of one extra field in
> > the heap ioctl uAPI struct.
> 
> Yeah extending the uAPI to pass in the pid on allocation time is not
> much of a problem, but you also need to modify the whole stack above
> it and that is a bit more trickier.
> 
> > - Scenario 3 can be handled by the charge transfer function without
> > changes to SurfaceFlinger. The app or dequeueBuffer claims the charge
> > for itself or the app, respectively (depending on whether we include a
> > pid_fd field in the transfer ioctl). It also covers non-heap
> > exporters. The con in both variants is the transient charge window on
> > the daemon.
> 
> It should be trivial for the deamon to charge the buffer to an
> application before handing it out.
> 
> > Both approaches shift the responsibility for correct charging
> > attribution to userspace: first, 'charge_pid_fd` on the allocator's
> > side, and the transfer charge on the consumer's side.
> 
> Yeah that's why I said it would be better if we do that without any
> uAPI change, but with all the uAPI we have to transfer file
> descriptors (dup(), fork(), passing FDs over sockets etc...) it could
> be really tricky to implement that.
> 
> > Deciding on one, the other or both depends on how much we value
> > avoiding transient attribution, and how much we need a non-heap
> > generic solution. With the XFER_CHARGE we can cover both. Thus, the
> > `charge_pid_fd` approach in this RFC can be seen as a
> > performance/strictness optimisation, eliminating transient charges to
> > the daemon at the cost of a permanent uAPI addition to the heap ioctl
> > struct, but not strictly required for correctness.
> 
> Well all we need is a uAPI which says charge this buffer (file
> descriptor) to that cgroup (pidfd).
> 
> With this at hand we should be able to handle all use cases at the
> same time.
> 
> > On the other hand, if we agree on the end goal of migrating other
> > exporters to use dma-buf heaps
> 
> That won't work. DMA-buf heaps is actually only a rather small and
> Anroid specific use case.

I don't think that's true anymore. heaps are used in lots of different
use cases now in the embedded space, including in regular, generic,
components not specifically used for embedded systems.

Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 273 bytes --]

^ permalink raw reply

* Re: [RFC] TID v2.0: kernel module for cache-line zeroization against Flush+Reload (CLFLUSHOPT + LFENCE + REP STOSQ)
From: Jann Horn @ 2026-05-19 16:47 UTC (permalink / raw)
  To: Ahmed Hassan
  Cc: linux-kernel, linux-security-module, linux-hardening,
	kernel-hardening, linux-crypto, linux-mm, linux-api,
	linux-kselftest
In-Reply-To: <F78521DA-08DC-424E-BBE1-231BC900CEE0@gmail.com>

On Mon, May 18, 2026 at 11:47 PM Ahmed Hassan
<ahmaaaaadbntaaaaa@gmail.com> wrote:
>
> Hi kernel developers,
>
> I am sharing TID (The Instant Destroyer) v2.0, a Linux kernel module
> written in C that addresses a specific gap in existing security
> libraries: none of them (libsodium, OpenSSL, glibc memzero_explicit)
> flush CPU cache lines after memory zeroization.
>
>
> == Problem ==
>
> Standard zeroization functions (explicit_bzero, sodium_memzero,
> OPENSSL_cleanse) prevent the compiler from eliding the wipe, but do
> not evict CPU cache lines (L1/L2/L3). This leaves residual key
> material measurable via Flush+Reload (Yarom & Falkner, 2014) after
> data use ends.

The thing you're talking about isn't really related to the
Flush+Reload side channel attack, right? You're just talking about
flushing cache lines.

In what threat model would this be an issue? Normally, the goal of
memory zeroing is to ensure that sensitive data is wiped before an
attacker has a chance to physically pull out the RAM from a machine
and plug it into another device that can reveal RAM contents, or
before an attacker gains physical control of a locked device and can
connect malicious peripherals to it, or such.

So for this to be an actual security problem, the device would have to
keep running in a sufficiently high power state that data caches are
not discarded, and at the same time not perform enough memory accesses
to cause this memory to be discarded...

Assuming that this is an actual problem, why are you using a kernel
module for this? At least on x86, CLFLUSH is unprivileged, so crypto
libraries should be able to just use that directly. (There is the
caveat of what happens when the kernel migrates pages or kills a
process, but that's a larger problem.)

^ permalink raw reply

* Re: [Linaro-mm-sig] Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-19 18:06 UTC (permalink / raw)
  To: Christian König
  Cc: Barry Song, Albert Esteve, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <8a13b1ad-f1be-4ef4-905e-0d9828ae8cb5@amd.com>

On Tue, May 19, 2026 at 12:10 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/19/26 01:00, Barry Song wrote:
> > On Mon, May 18, 2026 at 3:34 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 5/16/26 11:19, Barry Song wrote:
> >>> On Thu, May 14, 2026 at 12:35 AM T.J. Mercier <tjmercier@google.com> wrote:
> >>> [...]
> >>>>>> I have a question about this part. Albert I guess you are interested
> >>>>>> only in accounting dmabuf-heap allocations, or do you expect to add
> >>>>>> __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other
> >>>>>> non-dmabuf-heap exporters?
> >>>>>
> >>>>> We're scoping this to dma-buf heaps for now. CMA heaps and the dmem
> >>>>> controller are on the radar for follow-up/parallel work (there will be
> >>>>> dragons and will surely need discussion). For DRM and V4L2 the
> >>>>> long-term intent is migration to heaps, which would make direct
> >>>>> accounting on those paths unnecessary.
> >>>>
> >>>> Ah I see. GEM buffers exported to dmabufs are what I had in mind. I
> >>>> guess this would only leave the odd non-DRM driver with the need to
> >>>> add their own accounting calls, which I don't expect would be a big
> >>>> problem.
> >>>>
> >>>
> >>> sounds like we still have a long way to go to correctly account for
> >>> various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in
> >>> dma_buf_export(), so I guess it covers all dma-buf types except
> >>> dma_heap, but the problem is that it has no remote charging support at
> >>> all?
> >>
> >> No, just the other way around
> >>
> >> DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.
> >>
> >> dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.
> >>
> >
> > Hi Christian,
> >
> > Thanks very much for your explanation. So basically it seems that
> > dma_buf_export() is not the proper place to charge, since it may end up
> > mixing in non-system-memory accounting?
>
> Yes, exactly that.
>
> > My question is also about the global view for both heap and non-heap cases.
> > After reading the discussion, I’ve tried to summarize it—please let me know
> > if my understanding is correct.
> >
> > for dma_heap, we have the ioctl DMA_HEAP_IOCTL_ALLOC, where users can pass a
> > remote pidfd or similar information to indicate where the dma-buf should be
> > charged, as in Albert's patchset.
>
> Well that's the current proposal, but I think we need to come up with something more general.
>
> > For non-dma_heap dma-bufs, we don’t have an obvious userspace entry point that
> > triggers the allocation. So we likely need other approaches. We could either
> > move more drivers over to dma-heap, or introduce something like
> > DMA_BUF_IOCTL_XFER_CHARGE, as you are discussing, to let userspace explicitly
> > declare a charge.
>
> Yeah but that's not only for DMA-buf, we need that for file descriptors returned by memfd_create() as well.

memfds get charged on fault, so an allocator shouldn't currently be
charged just for creating the fd. Unlike system/CMA heap buffers, the
shmem backing a memfd / udmabuf is LRU memory, and swapping the memcg
owner of those pages is a more-involved process which is not supported
by memcg v2. There used to be some support in memcg v1, but it was
removed. Commit e548ad4a7cbf ("mm: memcg: move charge migration code
to memcontrol-v1.c ") said, "It's a fairly large and complicated code
which created a number of problems in the past." So I'm not sure how
much appetite there would be to support it in v2 for this.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox