Realtime threads delayed due to kcompactd0

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Realtime threads delayed due to kcompactd0
@ 2025-07-25  5:30 Alexander Krabler
  2025-07-31 18:34 ` Frank van der Linden
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Krabler @ 2025-07-25  5:30 UTC (permalink / raw)
  To: linux-rt-users@vger.kernel.org, linux-mm@kvack.org
  Cc: Dennis Schimmel, Daniel Braunwarth

Hi all,

some of our realtime tasks get delayed from time to time due to activity of kcompactd0.
Out of nothing, realtime tasks go into uninterruptable sleep for some time.
This delay can be as much as 1.1ms, which is not acceptable for us.

Our hardware is an aarch64-based SOC with 8 A72 cores, kernel is 6.12.17 with PREEMPT_RT.
We have CONFIG_COMPACTION and CONFIG_MIGRATION enabled.

Here are some snippets from ftrace:
             kcompactd0-88    [001] 13112.100041: mm_compaction_begin:  zone_start=0x80000 migrate_pfn=0x80000 free_pfn=0xffe00 zone_end=0x100000, mode=sync
...      
             kcompactd0-88    [001] 13112.159782: mm_compaction_isolate_migratepages: range=(0x85800 ~ 0x85841) nr_scanned=65 nr_taken=32
             kcompactd0-88    [001] 13112.159810: mm_compaction_isolate_freepages: range=(0xddc40 ~ 0xddc48) nr_scanned=8 nr_taken=8
             kcompactd0-88    [001] 13112.160002: irq_handler_entry:    irq=11 name=arch_timer
             kcompactd0-88    [001] 13112.160012: irq_handler_exit:     irq=11 ret=handled
             kcompactd0-88    [001] 13112.160121: mm_compaction_migratepages: nr_migrated=32 nr_failed=0
             kcompactd0-88    [001] 13112.160122: mm_compaction_finished: node=0 zone=DMA      order=-1 ret=continue
             kcompactd0-88    [001] 13112.160185: mm_compaction_isolate_migratepages: range=(0x85841 ~ 0x85a00) nr_scanned=447 nr_taken=166
             kcompactd0-88    [001] 13112.160204: mm_compaction_isolate_freepages: range=(0xddc48 ~ 0xddd80) nr_scanned=312 nr_taken=196
              tRealtime-16499 [004] 13112.160511: sched_switch:         tRealtime:16499 [25] D ==> tKRC:16479 [39]
              tRealtime-16499 [004] 13112.160512: kernel_stack:         <stack trace >
=> __schedule (ffffcde843022d6c)
=> schedule (ffffcde843023464)
=> io_schedule (ffffcde8430235ec)
=> migration_entry_wait_on_locked (ffffcde8424a1ad8)
=> migration_entry_wait (ffffcde84254c400)
=> do_swap_page (ffffcde8424f7fac)
=> __handle_mm_fault (ffffcde8424f8b64)
=> handle_mm_fault (ffffcde8424f9bc0)
=> do_page_fault (ffffcde843030380)
=> do_translation_fault (ffffcde84303072c)
=> do_mem_abort (ffffcde84222f674)
=> el0_ia (ffffcde84301eb20)
=> el0t_64_sync_handler (ffffcde84301f020)
=> el0t_64_sync (ffffcde842211514)
             kcompactd0-88    [001] 13112.160557: sched_pi_setprio:     comm=kcompactd0 pid=88 oldprio=39 newprio=120
             kcompactd0-88    [001] 13112.160569: sched_waking:         comm=tKRC pid=16479 prio=39 target_cpu=004
             kcompactd0-88    [001] 13112.160986: sched_waking:         comm=tKRC pid=16479 prio=39 target_cpu=004
             kcompactd0-88    [001] 13112.161412: sched_waking:         comm=tOther pid=16520 prio=40 target_cpu=004
             kcompactd0-88    [001] 13112.161457: sched_pi_setprio:     comm=kcompactd0 pid=88 oldprio=40 newprio=120
             kcompactd0-88    [001] 13112.161465: sched_waking:         comm=tOther pid=16520 prio=40 target_cpu=004
             kcompactd0-88    [001] 13112.161654: sched_waking:         comm=tRealtime pid=16499 prio=25 target_cpu=004            

In our setup kcompactd0 gets enough CPU time (on core 1), however, it seems strange that it doesn't get the priority inherited from blocked realtime tasks.
(It does for short amounts of time, which seems to be due to the locks inside migration_entry_wait_on_locked.)

Is there anything we can do here?

Thanks,
Alexander

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-07-25  5:30 Realtime threads delayed due to kcompactd0 Alexander Krabler
@ 2025-07-31 18:34 ` Frank van der Linden
  2025-07-31 18:41   ` Vlastimil Babka
  0 siblings, 1 reply; 15+ messages in thread
From: Frank van der Linden @ 2025-07-31 18:34 UTC (permalink / raw)
  To: Alexander Krabler
  Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org,
	Dennis Schimmel, Daniel Braunwarth

On Thu, Jul 24, 2025 at 10:30 PM Alexander Krabler
<Alexander.Krabler@kuka.com> wrote:
>
> Hi all,
>
> some of our realtime tasks get delayed from time to time due to activity of kcompactd0.
> Out of nothing, realtime tasks go into uninterruptable sleep for some time.
> This delay can be as much as 1.1ms, which is not acceptable for us.
>
> Our hardware is an aarch64-based SOC with 8 A72 cores, kernel is 6.12.17 with PREEMPT_RT.
> We have CONFIG_COMPACTION and CONFIG_MIGRATION enabled.
>
> Here are some snippets from ftrace:
>              kcompactd0-88    [001] 13112.100041: mm_compaction_begin:  zone_start=0x80000 migrate_pfn=0x80000 free_pfn=0xffe00 zone_end=0x100000, mode=sync
> ...
>              kcompactd0-88    [001] 13112.159782: mm_compaction_isolate_migratepages: range=(0x85800 ~ 0x85841) nr_scanned=65 nr_taken=32
>              kcompactd0-88    [001] 13112.159810: mm_compaction_isolate_freepages: range=(0xddc40 ~ 0xddc48) nr_scanned=8 nr_taken=8
>              kcompactd0-88    [001] 13112.160002: irq_handler_entry:    irq=11 name=arch_timer
>              kcompactd0-88    [001] 13112.160012: irq_handler_exit:     irq=11 ret=handled
>              kcompactd0-88    [001] 13112.160121: mm_compaction_migratepages: nr_migrated=32 nr_failed=0
>              kcompactd0-88    [001] 13112.160122: mm_compaction_finished: node=0 zone=DMA      order=-1 ret=continue
>              kcompactd0-88    [001] 13112.160185: mm_compaction_isolate_migratepages: range=(0x85841 ~ 0x85a00) nr_scanned=447 nr_taken=166
>              kcompactd0-88    [001] 13112.160204: mm_compaction_isolate_freepages: range=(0xddc48 ~ 0xddd80) nr_scanned=312 nr_taken=196
>               tRealtime-16499 [004] 13112.160511: sched_switch:         tRealtime:16499 [25] D ==> tKRC:16479 [39]
>               tRealtime-16499 [004] 13112.160512: kernel_stack:         <stack trace >
> => __schedule (ffffcde843022d6c)
> => schedule (ffffcde843023464)
> => io_schedule (ffffcde8430235ec)
> => migration_entry_wait_on_locked (ffffcde8424a1ad8)
> => migration_entry_wait (ffffcde84254c400)
> => do_swap_page (ffffcde8424f7fac)
> => __handle_mm_fault (ffffcde8424f8b64)
> => handle_mm_fault (ffffcde8424f9bc0)
> => do_page_fault (ffffcde843030380)
> => do_translation_fault (ffffcde84303072c)
> => do_mem_abort (ffffcde84222f674)
> => el0_ia (ffffcde84301eb20)
> => el0t_64_sync_handler (ffffcde84301f020)
> => el0t_64_sync (ffffcde842211514)
>              kcompactd0-88    [001] 13112.160557: sched_pi_setprio:     comm=kcompactd0 pid=88 oldprio=39 newprio=120
>              kcompactd0-88    [001] 13112.160569: sched_waking:         comm=tKRC pid=16479 prio=39 target_cpu=004
>              kcompactd0-88    [001] 13112.160986: sched_waking:         comm=tKRC pid=16479 prio=39 target_cpu=004
>              kcompactd0-88    [001] 13112.161412: sched_waking:         comm=tOther pid=16520 prio=40 target_cpu=004
>              kcompactd0-88    [001] 13112.161457: sched_pi_setprio:     comm=kcompactd0 pid=88 oldprio=40 newprio=120
>              kcompactd0-88    [001] 13112.161465: sched_waking:         comm=tOther pid=16520 prio=40 target_cpu=004
>              kcompactd0-88    [001] 13112.161654: sched_waking:         comm=tRealtime pid=16499 prio=25 target_cpu=004
>
> In our setup kcompactd0 gets enough CPU time (on core 1), however, it seems strange that it doesn't get the priority inherited from blocked realtime tasks.
> (It does for short amounts of time, which seems to be due to the locks inside migration_entry_wait_on_locked.)
>
> Is there anything we can do here?
>
> Thanks,
> Alexander

Yes, we have (likely) seen this issue too, in a !CONFIG_PREEMPT setting.

The basic problem is that the calling thread (kcompactd or it could be
any thread that goes in to direct compaction) creates a resource that
needs to be waited for until it's done, in the form of the migration
PTEs. Since a migration PTE is not a lock that is held by the thread
doing the migration, there is no priority inheritance in the realtime
case, and priority inversion can happen.

This issue has always been there, but it has been made more prominent
with batch migration. With batch migration, all migration PTEs are set
up in the first step, followed by a TLB flush, and then the copy / new
map setup is done. So, the migration PTEs stick around for longer, and
the chance that other threads block on them is higher. For the
!CONFIG_PREEMPT case, the cond_resched() in the loop can also cause
the thread creating the migration PTEs to be descheduled while a
number of migration PTEs are in place, so there is a similar priority
inversion chance.

Not sure what the right thing to do would be. Either explicitly boost
the priority of a thread temporarily during migrate_pages_batch, or
mitigate the issue by dealing with 'busy' pages more quickly in
migrate_pages_batch.

- Frank


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-07-31 18:34 ` Frank van der Linden
@ 2025-07-31 18:41   ` Vlastimil Babka
  2025-08-01  2:46     ` Mike Galbraith
  0 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2025-07-31 18:41 UTC (permalink / raw)
  To: Frank van der Linden, Alexander Krabler
  Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org,
	Dennis Schimmel, Daniel Braunwarth

On 7/31/25 20:34, Frank van der Linden wrote:
> Not sure what the right thing to do would be. Either explicitly boost
> the priority of a thread temporarily during migrate_pages_batch, or
> mitigate the issue by dealing with 'busy' pages more quickly in
> migrate_pages_batch.

There's a workaround for realtime tasks. If you mlock[all]() their memory,
setting sysctl vm.compact_unevictable_allowed to 0 should exclude these
pages from migration by compaction.

Vlastimil
> - Frank
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-07-31 18:41   ` Vlastimil Babka
@ 2025-08-01  2:46     ` Mike Galbraith
  2025-08-01  9:58       ` Vlastimil Babka
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Galbraith @ 2025-08-01  2:46 UTC (permalink / raw)
  To: Vlastimil Babka, Frank van der Linden, Alexander Krabler
  Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org,
	Dennis Schimmel, Daniel Braunwarth

On Thu, 2025-07-31 at 20:41 +0200, Vlastimil Babka wrote:
> On 7/31/25 20:34, Frank van der Linden wrote:
> > Not sure what the right thing to do would be. Either explicitly boost
> > the priority of a thread temporarily during migrate_pages_batch, or
> > mitigate the issue by dealing with 'busy' pages more quickly in
> > migrate_pages_batch.
> 
> There's a workaround for realtime tasks. If you mlock[all]() their memory,
> setting sysctl vm.compact_unevictable_allowed to 0 should exclude these
> pages from migration by compaction.

Hm, per documentation that's done automatically for PREEMPT_RT...

On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
to compaction, which would block the task from becoming active until the fault
is resolved.

...but rummaging, seems other stuff can step on it (contiguous alloc?).

	-Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-08-01  2:46     ` Mike Galbraith
@ 2025-08-01  9:58       ` Vlastimil Babka
  2025-08-01 11:23         ` Alexander Krabler
  2025-08-01 19:27         ` Frank van der Linden
  0 siblings, 2 replies; 15+ messages in thread
From: Vlastimil Babka @ 2025-08-01  9:58 UTC (permalink / raw)
  To: Mike Galbraith, Frank van der Linden, Alexander Krabler
  Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org,
	Dennis Schimmel, Daniel Braunwarth

On 8/1/25 04:46, Mike Galbraith wrote:
> On Thu, 2025-07-31 at 20:41 +0200, Vlastimil Babka wrote:
>> On 7/31/25 20:34, Frank van der Linden wrote:
>> > Not sure what the right thing to do would be. Either explicitly boost
>> > the priority of a thread temporarily during migrate_pages_batch, or
>> > mitigate the issue by dealing with 'busy' pages more quickly in
>> > migrate_pages_batch.
>> 
>> There's a workaround for realtime tasks. If you mlock[all]() their memory,
>> setting sysctl vm.compact_unevictable_allowed to 0 should exclude these
>> pages from migration by compaction.
> 
> Hm, per documentation that's done automatically for PREEMPT_RT...

Oh I see.

> On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
> to compaction, which would block the task from becoming active until the fault
> is resolved.

So it's probably the mlock() part missing since that should otherwise apply
to kcompactd.

> ...but rummaging, seems other stuff can step on it (contiguous alloc?).

Yeah, there was time CMA was just something for mobile phone hardware. As
usage increases beyond that maybe we'll have to tackle it. Ideally by not
having mlock'd pages in CMA areas at all. And if contiguous alloc is
attempted outside of CMA areas, respect the sysctl there too.

There are also things like mbind() migrating pages for NUMA locality but I
assume people just wouldn't try to do that with realtime workloads.


> 	-Mike



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-08-01  9:58       ` Vlastimil Babka
@ 2025-08-01 11:23         ` Alexander Krabler
  2025-08-01 12:57           ` Vlastimil Babka
  2025-08-01 19:27         ` Frank van der Linden
  1 sibling, 1 reply; 15+ messages in thread
From: Alexander Krabler @ 2025-08-01 11:23 UTC (permalink / raw)
  To: Vlastimil Babka, Mike Galbraith, Frank van der Linden
  Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org,
	Dennis Schimmel, Daniel Braunwarth

On 7/31/25 20:35, Frank van der Linden wrote:
> Not sure what the right thing to do would be. Either explicitly boost
> the priority of a thread temporarily during migrate_pages_batch, or
> mitigate the issue by dealing with 'busy' pages more quickly in
> migrate_pages_batch.
>
> - Frank

I think it would help if priority inheritance would kick in as soon as
another thread is waiting due to the migration PTE.


On 8/1/25 11:58, Vlastimil Babka wrote:
> On 8/1/25 04:46, Mike Galbraith wrote:
> > On Thu, 2025-07-31 at 20:41 +0200, Vlastimil Babka wrote:
> >> On 7/31/25 20:34, Frank van der Linden wrote:
> >> > Not sure what the right thing to do would be. Either explicitly boost
> >> > the priority of a thread temporarily during migrate_pages_batch, or
> >> > mitigate the issue by dealing with 'busy' pages more quickly in
> >> > migrate_pages_batch.
> >>
> >> There's a workaround for realtime tasks. If you mlock[all]() their memory,
> >> setting sysctl vm.compact_unevictable_allowed to 0 should exclude these
> >> pages from migration by compaction.
> >
> > Hm, per documentation that's done automatically for PREEMPT_RT...
>
> Oh I see.

Yes, vm.compact_unevictable_allowed is set to 0 in our setup
(default as we have CONFIG_PREEMPT_RT).

> > On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
> > to compaction, which would block the task from becoming active until the fault
> > is resolved.
>
> So it's probably the mlock() part missing since that should otherwise apply
> to kcompactd.

We use mlockall() and I verified the pages have the lo flag set.
Which means, that we have exactly the issue this sysctl flag should prevent us from.

>
> > ...but rummaging, seems other stuff can step on it (contiguous alloc?).
>
> Yeah, there was time CMA was just something for mobile phone hardware. As
> usage increases beyond that maybe we'll have to tackle it. Ideally by not
> having mlock'd pages in CMA areas at all. And if contiguous alloc is
> attempted outside of CMA areas, respect the sysctl there too.

Yeah, we already thought CMA might somehow influcence our issue.
We have 256 MiB of CMA, which our hardware probably uses.

Is there a way to tell a process to not allocate pages inside CMA area?

>
>
> > -Mike
>

Alexander

KUKA Deutschland GmbH   Board of Directors: Michael Jürgens (Chairman), Dirk Busch, Johan Naten, Hui Zhang   Registered Office: Augsburg HRB 14914

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of contents of this e-mail is strictly forbidden.

Please consider the environment before printing this e-mail.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-08-01 11:23         ` Alexander Krabler
@ 2025-08-01 12:57           ` Vlastimil Babka
  2025-08-01 13:40             ` Alexander Krabler
  0 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2025-08-01 12:57 UTC (permalink / raw)
  To: Alexander Krabler, Mike Galbraith, Frank van der Linden
  Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org,
	Dennis Schimmel, Daniel Braunwarth

On 8/1/25 13:23, Alexander Krabler wrote:
> On 7/31/25 20:35, Frank van der Linden wrote:
>> Not sure what the right thing to do would be. Either explicitly boost
>> the priority of a thread temporarily during migrate_pages_batch, or
>> mitigate the issue by dealing with 'busy' pages more quickly in
>> migrate_pages_batch.
>>
>> - Frank
> 
> I think it would help if priority inheritance would kick in as soon as
> another thread is waiting due to the migration PTE.
> 
> 
> On 8/1/25 11:58, Vlastimil Babka wrote:
>> On 8/1/25 04:46, Mike Galbraith wrote:
>> > On Thu, 2025-07-31 at 20:41 +0200, Vlastimil Babka wrote:
>> >> On 7/31/25 20:34, Frank van der Linden wrote:
>> >> > Not sure what the right thing to do would be. Either explicitly boost
>> >> > the priority of a thread temporarily during migrate_pages_batch, or
>> >> > mitigate the issue by dealing with 'busy' pages more quickly in
>> >> > migrate_pages_batch.
>> >>
>> >> There's a workaround for realtime tasks. If you mlock[all]() their memory,
>> >> setting sysctl vm.compact_unevictable_allowed to 0 should exclude these
>> >> pages from migration by compaction.
>> >
>> > Hm, per documentation that's done automatically for PREEMPT_RT...
>>
>> Oh I see.
> 
> Yes, vm.compact_unevictable_allowed is set to 0 in our setup
> (default as we have CONFIG_PREEMPT_RT).
> 
>> > On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
>> > to compaction, which would block the task from becoming active until the fault
>> > is resolved.
>>
>> So it's probably the mlock() part missing since that should otherwise apply
>> to kcompactd.
> 
> We use mlockall() and I verified the pages have the lo flag set.
> Which means, that we have exactly the issue this sysctl flag should prevent us from.

Hm that means something isn't working as intended. Do the pages have also
the unevictable flag?
>>
>> > ...but rummaging, seems other stuff can step on it (contiguous alloc?).
>>
>> Yeah, there was time CMA was just something for mobile phone hardware. As
>> usage increases beyond that maybe we'll have to tackle it. Ideally by not
>> having mlock'd pages in CMA areas at all. And if contiguous alloc is
>> attempted outside of CMA areas, respect the sysctl there too.
> 
> Yeah, we already thought CMA might somehow influcence our issue.
> We have 256 MiB of CMA, which our hardware probably uses.

If the problem is kcompactd, it should not be CMA related.
> Is there a way to tell a process to not allocate pages inside CMA area?

I think not at the moment.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-08-01 12:57           ` Vlastimil Babka
@ 2025-08-01 13:40             ` Alexander Krabler
  2025-08-07 10:48               ` Vlastimil Babka
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Krabler @ 2025-08-01 13:40 UTC (permalink / raw)
  To: Vlastimil Babka, Frank van der Linden, Mike Galbraith
  Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org,
	Dennis Schimmel, Daniel Braunwarth

On 8/1/25 14:57, Vlastimil Babka wrote:
> Hm that means something isn't working as intended. Do the pages have also
> the unevictable flag?

I have checked /proc/<pid>/smaps.
Neither the manpage (https://www.man7.org/linux/man-pages/man5/proc_pid_smaps.5.html)
nor kernel source code (fs/proc/task_mmu.c show_smap_vma_flags) tells me about unevictable flag.

What exactly do you mean?

Alexander
KUKA Deutschland GmbH   Board of Directors: Michael Jürgens (Chairman), Dirk Busch, Johan Naten, Hui Zhang   Registered Office: Augsburg HRB 14914

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of contents of this e-mail is strictly forbidden.

Please consider the environment before printing this e-mail.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-08-01 13:40             ` Alexander Krabler
@ 2025-08-07 10:48               ` Vlastimil Babka
  2025-08-07 12:21                 ` Hugh Dickins
  0 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2025-08-07 10:48 UTC (permalink / raw)
  To: Alexander Krabler, Frank van der Linden, Mike Galbraith,
	Hugh Dickins
  Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org,
	Dennis Schimmel, Daniel Braunwarth

On 8/1/25 15:40, Alexander Krabler wrote:
> On 8/1/25 14:57, Vlastimil Babka wrote:
>> Hm that means something isn't working as intended. Do the pages have also
>> the unevictable flag?
>
> I have checked /proc/<pid>/smaps.
> Neither the manpage
(https://www.man7.org/linux/man-pages/man5/proc_pid_smaps.5.html)
> nor kernel source code (fs/proc/task_mmu.c show_smap_vma_flags) tells me
about unevictable flag.
Ah ok, you checked /proc/pid/smaps. Since you said "pages" I thought you
meant individual pages which would be from /proc/kpageflags

But IIRC getting a page in a state where it has an mlock flag but (not yet)
the unevictable flag should be rare so I would be surprised if that was the
problem here.

Hm or maybe it's actually that the error can happen only in the other
direction - page is unevictable (on unevictable list) while not mlocked, at
least the counters such as UNEVICTABLE_PGRESCUED suggest that.

Hugh knows this code the best. Do you think we should e.g. test the mlocked
flag instead/in addition to unevictable flag in compaction to avoid
violating vm.compact_unevictable_allowed = 0?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-08-07 10:48               ` Vlastimil Babka
@ 2025-08-07 12:21                 ` Hugh Dickins
  2025-08-07 15:49                   ` Alexander Krabler
  0 siblings, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2025-08-07 12:21 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Krabler, Frank van der Linden, Mike Galbraith,
	Hugh Dickins, linux-rt-users@vger.kernel.org, linux-mm@kvack.org,
	Dennis Schimmel, Daniel Braunwarth

On Thu, 7 Aug 2025, Vlastimil Babka wrote:
> On 8/1/25 15:40, Alexander Krabler wrote:
> > On 8/1/25 14:57, Vlastimil Babka wrote:
> >> Hm that means something isn't working as intended. Do the pages have also
> >> the unevictable flag?
> >
> > I have checked /proc/<pid>/smaps.
> > Neither the manpage
> (https://www.man7.org/linux/man-pages/man5/proc_pid_smaps.5.html)
> > nor kernel source code (fs/proc/task_mmu.c show_smap_vma_flags) tells me
> about unevictable flag.
> Ah ok, you checked /proc/pid/smaps. Since you said "pages" I thought you
> meant individual pages which would be from /proc/kpageflags
> 
> But IIRC getting a page in a state where it has an mlock flag but (not yet)
> the unevictable flag should be rare so I would be surprised if that was the
> problem here.

Agreed.

> 
> Hm or maybe it's actually that the error can happen only in the other
> direction - page is unevictable (on unevictable list) while not mlocked, at
> least the counters such as UNEVICTABLE_PGRESCUED suggest that.
> 
> Hugh knows this code the best. Do you think we should e.g. test the mlocked
> flag instead/in addition to unevictable flag in compaction to avoid
> violating vm.compact_unevictable_allowed = 0?

No, checking for unevictable should be sufficient.

But another idea has just this moment occurred to me: anon THP splitting
is another user of migration entries.  Is it possible that kcompactd is
not actually the cause of Alexander's issue, but that his RT tasks have
(bad news! shouldn't be allowed) got anon THPs in them?

Back to bed,
Hugh


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-08-07 12:21                 ` Hugh Dickins
@ 2025-08-07 15:49                   ` Alexander Krabler
  2025-08-08  7:37                     ` Vlastimil Babka
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Krabler @ 2025-08-07 15:49 UTC (permalink / raw)
  To: Hugh Dickins, Vlastimil Babka
  Cc: Frank van der Linden, Mike Galbraith,
	linux-rt-users@vger.kernel.org, linux-mm@kvack.org,
	Dennis Schimmel, Daniel Braunwarth

On Thu, 7 Aug 2025 14:21, Hugh Dickins wrote:
> But another idea has just this moment occurred to me: anon THP splitting
> is another user of migration entries.  Is it possible that kcompactd is
> not actually the cause of Alexander's issue, but that his RT tasks have
> (bad news! shouldn't be allowed) got anon THPs in them?

No, we don't have transparent hugepages enabled.
(From Kconfig, this seems not even possible together with PREEMPT_RT.)

From ftrace events (first message in this thread), we know that kcompactd0
finally wakes our realtime thread(s).

Given the information I got from here, the comment on the code [1]
and an older commit message [2], I suspect CMA somehow influences our problem.

If there is anything we should enable for better diagnosis (e.g. more ftrace events),
we would be willing to try that out.

[1] https://elixir.bootlin.com/linux/v6.16/source/mm/compaction.c#L1136
[2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=e46a28790e594c0876d1a84270926abf75460f61

Thanks,
Alexander

--
KUKA Deutschland GmbH   Board of Directors: Michael Jürgens (Chairman), Dirk Busch, Johan Naten, Hui Zhang   Registered Office: Augsburg HRB 14914

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of contents of this e-mail is strictly forbidden.

Please consider the environment before printing this e-mail.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-08-07 15:49                   ` Alexander Krabler
@ 2025-08-08  7:37                     ` Vlastimil Babka
  2025-08-20 14:29                       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2025-08-08  7:37 UTC (permalink / raw)
  To: Alexander Krabler, Hugh Dickins
  Cc: Frank van der Linden, Mike Galbraith,
	linux-rt-users@vger.kernel.org, linux-mm@kvack.org,
	Dennis Schimmel, Daniel Braunwarth

On 8/7/25 17:49, Alexander Krabler wrote:
> On Thu, 7 Aug 2025 14:21, Hugh Dickins wrote:
>> But another idea has just this moment occurred to me: anon THP splitting
>> is another user of migration entries.  Is it possible that kcompactd is
>> not actually the cause of Alexander's issue, but that his RT tasks have
>> (bad news! shouldn't be allowed) got anon THPs in them?
> 
> No, we don't have transparent hugepages enabled.
> (From Kconfig, this seems not even possible together with PREEMPT_RT.)
> 
> From ftrace events (first message in this thread), we know that kcompactd0
> finally wakes our realtime thread(s).
> 
> Given the information I got from here, the comment on the code [1]
> and an older commit message [2], I suspect CMA somehow influences our problem.

However, kcompactd doesn't perform CMA allocations, only compaction, in a
mode that does not include ISOLATE_UNEVICTABLE. So this is weird.

> If there is anything we should enable for better diagnosis (e.g. more ftrace events),
> we would be willing to try that out.
> 
> [1] https://elixir.bootlin.com/linux/v6.16/source/mm/compaction.c#L1136
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=e46a28790e594c0876d1a84270926abf75460f61
> 
> Thanks,
> Alexander
> 
> --
> KUKA Deutschland GmbH   Board of Directors: Michael Jürgens (Chairman), Dirk Busch, Johan Naten, Hui Zhang   Registered Office: Augsburg HRB 14914
> 
> This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of contents of this e-mail is strictly forbidden.
> 
> Please consider the environment before printing this e-mail.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-08-08  7:37                     ` Vlastimil Babka
@ 2025-08-20 14:29                       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-08-20 14:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Krabler, Hugh Dickins, Frank van der Linden,
	Mike Galbraith, linux-rt-users@vger.kernel.org,
	linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth

On 2025-08-08 09:37:26 [+0200], Vlastimil Babka wrote:
> > Given the information I got from here, the comment on the code [1]
> > and an older commit message [2], I suspect CMA somehow influences our problem.
> 
> However, kcompactd doesn't perform CMA allocations, only compaction, in a
> mode that does not include ISOLATE_UNEVICTABLE. So this is weird.

As per smaps, the RT task should have all VMAs listed as "lo". If use
mlock() then something like an accidental fork() would remove it.
Otherwise it should be there.

At the time of the fault you could add something like

| diff --git a/mm/memory.c b/mm/memory.c
| --- a/mm/memory.c
| +++ b/mm/memory.c
| @@ -4476,6 +4476,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
|         entry = pte_to_swp_entry(vmf->orig_pte);
|         if (unlikely(non_swap_entry(entry))) {
|                 if (is_migration_entry(entry)) {
| +
| +                       if (!strcmp("tRealtime", current->comm)) {
| +                               trace_printk("Migrated: 0x%lx VMA flags: %lx\n",
| +                                            vmf->address, vma->vm_flags);
| +                       }
| +
|                         migration_entry_wait(vma->vm_mm, vmf->pmd,
|                                              vmf->address);
|                 } else if (is_device_exclusive_entry(entry)) {

to see address is gone. Not sure if the PTE flags are of any help here.
Is it easily possible on the other side (isolate_migratepages(), right?)
to figure out which task a certain address space/ page belongs to? So
would if a "bad" page is considered for migration.

Sebastian


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-08-01  9:58       ` Vlastimil Babka
  2025-08-01 11:23         ` Alexander Krabler
@ 2025-08-01 19:27         ` Frank van der Linden
  2025-08-05 14:11           ` Alexander Krabler
  1 sibling, 1 reply; 15+ messages in thread
From: Frank van der Linden @ 2025-08-01 19:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mike Galbraith, Alexander Krabler, linux-rt-users@vger.kernel.org,
	linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth,
	Hugh Dickins

On Fri, Aug 1, 2025 at 2:58 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 8/1/25 04:46, Mike Galbraith wrote:
> > On Thu, 2025-07-31 at 20:41 +0200, Vlastimil Babka wrote:
> >> On 7/31/25 20:34, Frank van der Linden wrote:
> >> > Not sure what the right thing to do would be. Either explicitly boost
> >> > the priority of a thread temporarily during migrate_pages_batch, or
> >> > mitigate the issue by dealing with 'busy' pages more quickly in
> >> > migrate_pages_batch.
> >>
> >> There's a workaround for realtime tasks. If you mlock[all]() their memory,
> >> setting sysctl vm.compact_unevictable_allowed to 0 should exclude these
> >> pages from migration by compaction.
> >
> > Hm, per documentation that's done automatically for PREEMPT_RT...
>
> Oh I see.
>
> > On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
> > to compaction, which would block the task from becoming active until the fault
> > is resolved.
>
> So it's probably the mlock() part missing since that should otherwise apply
> to kcompactd.
>
> > ...but rummaging, seems other stuff can step on it (contiguous alloc?).
>
> Yeah, there was time CMA was just something for mobile phone hardware. As
> usage increases beyond that maybe we'll have to tackle it. Ideally by not
> having mlock'd pages in CMA areas at all. And if contiguous alloc is
> attempted outside of CMA areas, respect the sysctl there too.
>
> There are also things like mbind() migrating pages for NUMA locality but I
> assume people just wouldn't try to do that with realtime workloads.
>

Another idea is to minimize the time that a migration PTE is in place
for an mlocked page, Hugh (cc-ed) mentioned this in an offline
discussion. E.g. skip any mlocked pages in the first pass, and just
add them to a list. Then, do that list separately, but do them one by
one. There is somewhat similar logic in migrate_pages_sync for pages
that might need extra work / locking.

Not sure if avoiding mlocked pages in CMA would work out. I mean, it's
not hard to implement, as it would be pretty much the same as for
pin_user_pages: just move them out of CMA on mlock. I'm just a bit
worried of scenarios where the kernel might run out of space for
unmovable allocations if you have a larger amount of CMA, which would
be made worse by moving more allocations out of CMA. Then again, the
amount of mlocked memory is probably generally small.

- Frank


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Realtime threads delayed due to kcompactd0
  2025-08-01 19:27         ` Frank van der Linden
@ 2025-08-05 14:11           ` Alexander Krabler
  0 siblings, 0 replies; 15+ messages in thread
From: Alexander Krabler @ 2025-08-05 14:11 UTC (permalink / raw)
  To: Frank van der Linden, Vlastimil Babka
  Cc: Mike Galbraith, linux-rt-users@vger.kernel.org,
	linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth,
	Hugh Dickins

On Fri, Aug 1, 2025 at 9:27 PM Frank van der Linden <fvdl@google.com> wrote:
> Another idea is to minimize the time that a migration PTE is in place
> for an mlocked page, Hugh (cc-ed) mentioned this in an offline
> discussion. E.g. skip any mlocked pages in the first pass, and just
> add them to a list. Then, do that list separately, but do them one by
> one. There is somewhat similar logic in migrate_pages_sync for pages
> that might need extra work / locking.

Another idea might be, that the blocked task migrates the page it is blocked on
by itself, removes the migration PTE and continues?
Doing the work itself instead of waiting for another thread to do the work.
Would that be possible?

This would solve the priority inversion problem and should be way more
deterministic than the current situation.

Alexander

KUKA Deutschland GmbH   Board of Directors: Michael Jürgens (Chairman), Dirk Busch, Johan Naten, Hui Zhang   Registered Office: Augsburg HRB 14914

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of contents of this e-mail is strictly forbidden.

Please consider the environment before printing this e-mail.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-08-20 14:29 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-25  5:30 Realtime threads delayed due to kcompactd0 Alexander Krabler
2025-07-31 18:34 ` Frank van der Linden
2025-07-31 18:41   ` Vlastimil Babka
2025-08-01  2:46     ` Mike Galbraith
2025-08-01  9:58       ` Vlastimil Babka
2025-08-01 11:23         ` Alexander Krabler
2025-08-01 12:57           ` Vlastimil Babka
2025-08-01 13:40             ` Alexander Krabler
2025-08-07 10:48               ` Vlastimil Babka
2025-08-07 12:21                 ` Hugh Dickins
2025-08-07 15:49                   ` Alexander Krabler
2025-08-08  7:37                     ` Vlastimil Babka
2025-08-20 14:29                       ` Sebastian Andrzej Siewior
2025-08-01 19:27         ` Frank van der Linden
2025-08-05 14:11           ` Alexander Krabler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).