* Realtime threads delayed due to kcompactd0 @ 2025-07-25 5:30 Alexander Krabler 2025-07-31 18:34 ` Frank van der Linden 0 siblings, 1 reply; 15+ messages in thread From: Alexander Krabler @ 2025-07-25 5:30 UTC (permalink / raw) To: linux-rt-users@vger.kernel.org, linux-mm@kvack.org Cc: Dennis Schimmel, Daniel Braunwarth Hi all, some of our realtime tasks get delayed from time to time due to activity of kcompactd0. Out of nothing, realtime tasks go into uninterruptable sleep for some time. This delay can be as much as 1.1ms, which is not acceptable for us. Our hardware is an aarch64-based SOC with 8 A72 cores, kernel is 6.12.17 with PREEMPT_RT. We have CONFIG_COMPACTION and CONFIG_MIGRATION enabled. Here are some snippets from ftrace: kcompactd0-88 [001] 13112.100041: mm_compaction_begin: zone_start=0x80000 migrate_pfn=0x80000 free_pfn=0xffe00 zone_end=0x100000, mode=sync ... kcompactd0-88 [001] 13112.159782: mm_compaction_isolate_migratepages: range=(0x85800 ~ 0x85841) nr_scanned=65 nr_taken=32 kcompactd0-88 [001] 13112.159810: mm_compaction_isolate_freepages: range=(0xddc40 ~ 0xddc48) nr_scanned=8 nr_taken=8 kcompactd0-88 [001] 13112.160002: irq_handler_entry: irq=11 name=arch_timer kcompactd0-88 [001] 13112.160012: irq_handler_exit: irq=11 ret=handled kcompactd0-88 [001] 13112.160121: mm_compaction_migratepages: nr_migrated=32 nr_failed=0 kcompactd0-88 [001] 13112.160122: mm_compaction_finished: node=0 zone=DMA order=-1 ret=continue kcompactd0-88 [001] 13112.160185: mm_compaction_isolate_migratepages: range=(0x85841 ~ 0x85a00) nr_scanned=447 nr_taken=166 kcompactd0-88 [001] 13112.160204: mm_compaction_isolate_freepages: range=(0xddc48 ~ 0xddd80) nr_scanned=312 nr_taken=196 tRealtime-16499 [004] 13112.160511: sched_switch: tRealtime:16499 [25] D ==> tKRC:16479 [39] tRealtime-16499 [004] 13112.160512: kernel_stack: <stack trace > => __schedule (ffffcde843022d6c) => schedule (ffffcde843023464) => io_schedule (ffffcde8430235ec) => migration_entry_wait_on_locked (ffffcde8424a1ad8) => migration_entry_wait (ffffcde84254c400) => do_swap_page (ffffcde8424f7fac) => __handle_mm_fault (ffffcde8424f8b64) => handle_mm_fault (ffffcde8424f9bc0) => do_page_fault (ffffcde843030380) => do_translation_fault (ffffcde84303072c) => do_mem_abort (ffffcde84222f674) => el0_ia (ffffcde84301eb20) => el0t_64_sync_handler (ffffcde84301f020) => el0t_64_sync (ffffcde842211514) kcompactd0-88 [001] 13112.160557: sched_pi_setprio: comm=kcompactd0 pid=88 oldprio=39 newprio=120 kcompactd0-88 [001] 13112.160569: sched_waking: comm=tKRC pid=16479 prio=39 target_cpu=004 kcompactd0-88 [001] 13112.160986: sched_waking: comm=tKRC pid=16479 prio=39 target_cpu=004 kcompactd0-88 [001] 13112.161412: sched_waking: comm=tOther pid=16520 prio=40 target_cpu=004 kcompactd0-88 [001] 13112.161457: sched_pi_setprio: comm=kcompactd0 pid=88 oldprio=40 newprio=120 kcompactd0-88 [001] 13112.161465: sched_waking: comm=tOther pid=16520 prio=40 target_cpu=004 kcompactd0-88 [001] 13112.161654: sched_waking: comm=tRealtime pid=16499 prio=25 target_cpu=004 In our setup kcompactd0 gets enough CPU time (on core 1), however, it seems strange that it doesn't get the priority inherited from blocked realtime tasks. (It does for short amounts of time, which seems to be due to the locks inside migration_entry_wait_on_locked.) Is there anything we can do here? Thanks, Alexander ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-07-25 5:30 Realtime threads delayed due to kcompactd0 Alexander Krabler @ 2025-07-31 18:34 ` Frank van der Linden 2025-07-31 18:41 ` Vlastimil Babka 0 siblings, 1 reply; 15+ messages in thread From: Frank van der Linden @ 2025-07-31 18:34 UTC (permalink / raw) To: Alexander Krabler Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth On Thu, Jul 24, 2025 at 10:30 PM Alexander Krabler <Alexander.Krabler@kuka.com> wrote: > > Hi all, > > some of our realtime tasks get delayed from time to time due to activity of kcompactd0. > Out of nothing, realtime tasks go into uninterruptable sleep for some time. > This delay can be as much as 1.1ms, which is not acceptable for us. > > Our hardware is an aarch64-based SOC with 8 A72 cores, kernel is 6.12.17 with PREEMPT_RT. > We have CONFIG_COMPACTION and CONFIG_MIGRATION enabled. > > Here are some snippets from ftrace: > kcompactd0-88 [001] 13112.100041: mm_compaction_begin: zone_start=0x80000 migrate_pfn=0x80000 free_pfn=0xffe00 zone_end=0x100000, mode=sync > ... > kcompactd0-88 [001] 13112.159782: mm_compaction_isolate_migratepages: range=(0x85800 ~ 0x85841) nr_scanned=65 nr_taken=32 > kcompactd0-88 [001] 13112.159810: mm_compaction_isolate_freepages: range=(0xddc40 ~ 0xddc48) nr_scanned=8 nr_taken=8 > kcompactd0-88 [001] 13112.160002: irq_handler_entry: irq=11 name=arch_timer > kcompactd0-88 [001] 13112.160012: irq_handler_exit: irq=11 ret=handled > kcompactd0-88 [001] 13112.160121: mm_compaction_migratepages: nr_migrated=32 nr_failed=0 > kcompactd0-88 [001] 13112.160122: mm_compaction_finished: node=0 zone=DMA order=-1 ret=continue > kcompactd0-88 [001] 13112.160185: mm_compaction_isolate_migratepages: range=(0x85841 ~ 0x85a00) nr_scanned=447 nr_taken=166 > kcompactd0-88 [001] 13112.160204: mm_compaction_isolate_freepages: range=(0xddc48 ~ 0xddd80) nr_scanned=312 nr_taken=196 > tRealtime-16499 [004] 13112.160511: sched_switch: tRealtime:16499 [25] D ==> tKRC:16479 [39] > tRealtime-16499 [004] 13112.160512: kernel_stack: <stack trace > > => __schedule (ffffcde843022d6c) > => schedule (ffffcde843023464) > => io_schedule (ffffcde8430235ec) > => migration_entry_wait_on_locked (ffffcde8424a1ad8) > => migration_entry_wait (ffffcde84254c400) > => do_swap_page (ffffcde8424f7fac) > => __handle_mm_fault (ffffcde8424f8b64) > => handle_mm_fault (ffffcde8424f9bc0) > => do_page_fault (ffffcde843030380) > => do_translation_fault (ffffcde84303072c) > => do_mem_abort (ffffcde84222f674) > => el0_ia (ffffcde84301eb20) > => el0t_64_sync_handler (ffffcde84301f020) > => el0t_64_sync (ffffcde842211514) > kcompactd0-88 [001] 13112.160557: sched_pi_setprio: comm=kcompactd0 pid=88 oldprio=39 newprio=120 > kcompactd0-88 [001] 13112.160569: sched_waking: comm=tKRC pid=16479 prio=39 target_cpu=004 > kcompactd0-88 [001] 13112.160986: sched_waking: comm=tKRC pid=16479 prio=39 target_cpu=004 > kcompactd0-88 [001] 13112.161412: sched_waking: comm=tOther pid=16520 prio=40 target_cpu=004 > kcompactd0-88 [001] 13112.161457: sched_pi_setprio: comm=kcompactd0 pid=88 oldprio=40 newprio=120 > kcompactd0-88 [001] 13112.161465: sched_waking: comm=tOther pid=16520 prio=40 target_cpu=004 > kcompactd0-88 [001] 13112.161654: sched_waking: comm=tRealtime pid=16499 prio=25 target_cpu=004 > > In our setup kcompactd0 gets enough CPU time (on core 1), however, it seems strange that it doesn't get the priority inherited from blocked realtime tasks. > (It does for short amounts of time, which seems to be due to the locks inside migration_entry_wait_on_locked.) > > Is there anything we can do here? > > Thanks, > Alexander Yes, we have (likely) seen this issue too, in a !CONFIG_PREEMPT setting. The basic problem is that the calling thread (kcompactd or it could be any thread that goes in to direct compaction) creates a resource that needs to be waited for until it's done, in the form of the migration PTEs. Since a migration PTE is not a lock that is held by the thread doing the migration, there is no priority inheritance in the realtime case, and priority inversion can happen. This issue has always been there, but it has been made more prominent with batch migration. With batch migration, all migration PTEs are set up in the first step, followed by a TLB flush, and then the copy / new map setup is done. So, the migration PTEs stick around for longer, and the chance that other threads block on them is higher. For the !CONFIG_PREEMPT case, the cond_resched() in the loop can also cause the thread creating the migration PTEs to be descheduled while a number of migration PTEs are in place, so there is a similar priority inversion chance. Not sure what the right thing to do would be. Either explicitly boost the priority of a thread temporarily during migrate_pages_batch, or mitigate the issue by dealing with 'busy' pages more quickly in migrate_pages_batch. - Frank ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-07-31 18:34 ` Frank van der Linden @ 2025-07-31 18:41 ` Vlastimil Babka 2025-08-01 2:46 ` Mike Galbraith 0 siblings, 1 reply; 15+ messages in thread From: Vlastimil Babka @ 2025-07-31 18:41 UTC (permalink / raw) To: Frank van der Linden, Alexander Krabler Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth On 7/31/25 20:34, Frank van der Linden wrote: > Not sure what the right thing to do would be. Either explicitly boost > the priority of a thread temporarily during migrate_pages_batch, or > mitigate the issue by dealing with 'busy' pages more quickly in > migrate_pages_batch. There's a workaround for realtime tasks. If you mlock[all]() their memory, setting sysctl vm.compact_unevictable_allowed to 0 should exclude these pages from migration by compaction. Vlastimil > - Frank > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-07-31 18:41 ` Vlastimil Babka @ 2025-08-01 2:46 ` Mike Galbraith 2025-08-01 9:58 ` Vlastimil Babka 0 siblings, 1 reply; 15+ messages in thread From: Mike Galbraith @ 2025-08-01 2:46 UTC (permalink / raw) To: Vlastimil Babka, Frank van der Linden, Alexander Krabler Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth On Thu, 2025-07-31 at 20:41 +0200, Vlastimil Babka wrote: > On 7/31/25 20:34, Frank van der Linden wrote: > > Not sure what the right thing to do would be. Either explicitly boost > > the priority of a thread temporarily during migrate_pages_batch, or > > mitigate the issue by dealing with 'busy' pages more quickly in > > migrate_pages_batch. > > There's a workaround for realtime tasks. If you mlock[all]() their memory, > setting sysctl vm.compact_unevictable_allowed to 0 should exclude these > pages from migration by compaction. Hm, per documentation that's done automatically for PREEMPT_RT... On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due to compaction, which would block the task from becoming active until the fault is resolved. ...but rummaging, seems other stuff can step on it (contiguous alloc?). -Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-08-01 2:46 ` Mike Galbraith @ 2025-08-01 9:58 ` Vlastimil Babka 2025-08-01 11:23 ` Alexander Krabler 2025-08-01 19:27 ` Frank van der Linden 0 siblings, 2 replies; 15+ messages in thread From: Vlastimil Babka @ 2025-08-01 9:58 UTC (permalink / raw) To: Mike Galbraith, Frank van der Linden, Alexander Krabler Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth On 8/1/25 04:46, Mike Galbraith wrote: > On Thu, 2025-07-31 at 20:41 +0200, Vlastimil Babka wrote: >> On 7/31/25 20:34, Frank van der Linden wrote: >> > Not sure what the right thing to do would be. Either explicitly boost >> > the priority of a thread temporarily during migrate_pages_batch, or >> > mitigate the issue by dealing with 'busy' pages more quickly in >> > migrate_pages_batch. >> >> There's a workaround for realtime tasks. If you mlock[all]() their memory, >> setting sysctl vm.compact_unevictable_allowed to 0 should exclude these >> pages from migration by compaction. > > Hm, per documentation that's done automatically for PREEMPT_RT... Oh I see. > On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due > to compaction, which would block the task from becoming active until the fault > is resolved. So it's probably the mlock() part missing since that should otherwise apply to kcompactd. > ...but rummaging, seems other stuff can step on it (contiguous alloc?). Yeah, there was time CMA was just something for mobile phone hardware. As usage increases beyond that maybe we'll have to tackle it. Ideally by not having mlock'd pages in CMA areas at all. And if contiguous alloc is attempted outside of CMA areas, respect the sysctl there too. There are also things like mbind() migrating pages for NUMA locality but I assume people just wouldn't try to do that with realtime workloads. > -Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-08-01 9:58 ` Vlastimil Babka @ 2025-08-01 11:23 ` Alexander Krabler 2025-08-01 12:57 ` Vlastimil Babka 2025-08-01 19:27 ` Frank van der Linden 1 sibling, 1 reply; 15+ messages in thread From: Alexander Krabler @ 2025-08-01 11:23 UTC (permalink / raw) To: Vlastimil Babka, Mike Galbraith, Frank van der Linden Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth On 7/31/25 20:35, Frank van der Linden wrote: > Not sure what the right thing to do would be. Either explicitly boost > the priority of a thread temporarily during migrate_pages_batch, or > mitigate the issue by dealing with 'busy' pages more quickly in > migrate_pages_batch. > > - Frank I think it would help if priority inheritance would kick in as soon as another thread is waiting due to the migration PTE. On 8/1/25 11:58, Vlastimil Babka wrote: > On 8/1/25 04:46, Mike Galbraith wrote: > > On Thu, 2025-07-31 at 20:41 +0200, Vlastimil Babka wrote: > >> On 7/31/25 20:34, Frank van der Linden wrote: > >> > Not sure what the right thing to do would be. Either explicitly boost > >> > the priority of a thread temporarily during migrate_pages_batch, or > >> > mitigate the issue by dealing with 'busy' pages more quickly in > >> > migrate_pages_batch. > >> > >> There's a workaround for realtime tasks. If you mlock[all]() their memory, > >> setting sysctl vm.compact_unevictable_allowed to 0 should exclude these > >> pages from migration by compaction. > > > > Hm, per documentation that's done automatically for PREEMPT_RT... > > Oh I see. Yes, vm.compact_unevictable_allowed is set to 0 in our setup (default as we have CONFIG_PREEMPT_RT). > > On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due > > to compaction, which would block the task from becoming active until the fault > > is resolved. > > So it's probably the mlock() part missing since that should otherwise apply > to kcompactd. We use mlockall() and I verified the pages have the lo flag set. Which means, that we have exactly the issue this sysctl flag should prevent us from. > > > ...but rummaging, seems other stuff can step on it (contiguous alloc?). > > Yeah, there was time CMA was just something for mobile phone hardware. As > usage increases beyond that maybe we'll have to tackle it. Ideally by not > having mlock'd pages in CMA areas at all. And if contiguous alloc is > attempted outside of CMA areas, respect the sysctl there too. Yeah, we already thought CMA might somehow influcence our issue. We have 256 MiB of CMA, which our hardware probably uses. Is there a way to tell a process to not allocate pages inside CMA area? > > > > -Mike > Alexander KUKA Deutschland GmbH Board of Directors: Michael Jürgens (Chairman), Dirk Busch, Johan Naten, Hui Zhang Registered Office: Augsburg HRB 14914 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of contents of this e-mail is strictly forbidden. Please consider the environment before printing this e-mail. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-08-01 11:23 ` Alexander Krabler @ 2025-08-01 12:57 ` Vlastimil Babka 2025-08-01 13:40 ` Alexander Krabler 0 siblings, 1 reply; 15+ messages in thread From: Vlastimil Babka @ 2025-08-01 12:57 UTC (permalink / raw) To: Alexander Krabler, Mike Galbraith, Frank van der Linden Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth On 8/1/25 13:23, Alexander Krabler wrote: > On 7/31/25 20:35, Frank van der Linden wrote: >> Not sure what the right thing to do would be. Either explicitly boost >> the priority of a thread temporarily during migrate_pages_batch, or >> mitigate the issue by dealing with 'busy' pages more quickly in >> migrate_pages_batch. >> >> - Frank > > I think it would help if priority inheritance would kick in as soon as > another thread is waiting due to the migration PTE. > > > On 8/1/25 11:58, Vlastimil Babka wrote: >> On 8/1/25 04:46, Mike Galbraith wrote: >> > On Thu, 2025-07-31 at 20:41 +0200, Vlastimil Babka wrote: >> >> On 7/31/25 20:34, Frank van der Linden wrote: >> >> > Not sure what the right thing to do would be. Either explicitly boost >> >> > the priority of a thread temporarily during migrate_pages_batch, or >> >> > mitigate the issue by dealing with 'busy' pages more quickly in >> >> > migrate_pages_batch. >> >> >> >> There's a workaround for realtime tasks. If you mlock[all]() their memory, >> >> setting sysctl vm.compact_unevictable_allowed to 0 should exclude these >> >> pages from migration by compaction. >> > >> > Hm, per documentation that's done automatically for PREEMPT_RT... >> >> Oh I see. > > Yes, vm.compact_unevictable_allowed is set to 0 in our setup > (default as we have CONFIG_PREEMPT_RT). > >> > On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due >> > to compaction, which would block the task from becoming active until the fault >> > is resolved. >> >> So it's probably the mlock() part missing since that should otherwise apply >> to kcompactd. > > We use mlockall() and I verified the pages have the lo flag set. > Which means, that we have exactly the issue this sysctl flag should prevent us from. Hm that means something isn't working as intended. Do the pages have also the unevictable flag? >> >> > ...but rummaging, seems other stuff can step on it (contiguous alloc?). >> >> Yeah, there was time CMA was just something for mobile phone hardware. As >> usage increases beyond that maybe we'll have to tackle it. Ideally by not >> having mlock'd pages in CMA areas at all. And if contiguous alloc is >> attempted outside of CMA areas, respect the sysctl there too. > > Yeah, we already thought CMA might somehow influcence our issue. > We have 256 MiB of CMA, which our hardware probably uses. If the problem is kcompactd, it should not be CMA related. > Is there a way to tell a process to not allocate pages inside CMA area? I think not at the moment. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-08-01 12:57 ` Vlastimil Babka @ 2025-08-01 13:40 ` Alexander Krabler 2025-08-07 10:48 ` Vlastimil Babka 0 siblings, 1 reply; 15+ messages in thread From: Alexander Krabler @ 2025-08-01 13:40 UTC (permalink / raw) To: Vlastimil Babka, Frank van der Linden, Mike Galbraith Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth On 8/1/25 14:57, Vlastimil Babka wrote: > Hm that means something isn't working as intended. Do the pages have also > the unevictable flag? I have checked /proc/<pid>/smaps. Neither the manpage (https://www.man7.org/linux/man-pages/man5/proc_pid_smaps.5.html) nor kernel source code (fs/proc/task_mmu.c show_smap_vma_flags) tells me about unevictable flag. What exactly do you mean? Alexander KUKA Deutschland GmbH Board of Directors: Michael Jürgens (Chairman), Dirk Busch, Johan Naten, Hui Zhang Registered Office: Augsburg HRB 14914 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of contents of this e-mail is strictly forbidden. Please consider the environment before printing this e-mail. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-08-01 13:40 ` Alexander Krabler @ 2025-08-07 10:48 ` Vlastimil Babka 2025-08-07 12:21 ` Hugh Dickins 0 siblings, 1 reply; 15+ messages in thread From: Vlastimil Babka @ 2025-08-07 10:48 UTC (permalink / raw) To: Alexander Krabler, Frank van der Linden, Mike Galbraith, Hugh Dickins Cc: linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth On 8/1/25 15:40, Alexander Krabler wrote: > On 8/1/25 14:57, Vlastimil Babka wrote: >> Hm that means something isn't working as intended. Do the pages have also >> the unevictable flag? > > I have checked /proc/<pid>/smaps. > Neither the manpage (https://www.man7.org/linux/man-pages/man5/proc_pid_smaps.5.html) > nor kernel source code (fs/proc/task_mmu.c show_smap_vma_flags) tells me about unevictable flag. Ah ok, you checked /proc/pid/smaps. Since you said "pages" I thought you meant individual pages which would be from /proc/kpageflags But IIRC getting a page in a state where it has an mlock flag but (not yet) the unevictable flag should be rare so I would be surprised if that was the problem here. Hm or maybe it's actually that the error can happen only in the other direction - page is unevictable (on unevictable list) while not mlocked, at least the counters such as UNEVICTABLE_PGRESCUED suggest that. Hugh knows this code the best. Do you think we should e.g. test the mlocked flag instead/in addition to unevictable flag in compaction to avoid violating vm.compact_unevictable_allowed = 0? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-08-07 10:48 ` Vlastimil Babka @ 2025-08-07 12:21 ` Hugh Dickins 2025-08-07 15:49 ` Alexander Krabler 0 siblings, 1 reply; 15+ messages in thread From: Hugh Dickins @ 2025-08-07 12:21 UTC (permalink / raw) To: Vlastimil Babka Cc: Alexander Krabler, Frank van der Linden, Mike Galbraith, Hugh Dickins, linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth On Thu, 7 Aug 2025, Vlastimil Babka wrote: > On 8/1/25 15:40, Alexander Krabler wrote: > > On 8/1/25 14:57, Vlastimil Babka wrote: > >> Hm that means something isn't working as intended. Do the pages have also > >> the unevictable flag? > > > > I have checked /proc/<pid>/smaps. > > Neither the manpage > (https://www.man7.org/linux/man-pages/man5/proc_pid_smaps.5.html) > > nor kernel source code (fs/proc/task_mmu.c show_smap_vma_flags) tells me > about unevictable flag. > Ah ok, you checked /proc/pid/smaps. Since you said "pages" I thought you > meant individual pages which would be from /proc/kpageflags > > But IIRC getting a page in a state where it has an mlock flag but (not yet) > the unevictable flag should be rare so I would be surprised if that was the > problem here. Agreed. > > Hm or maybe it's actually that the error can happen only in the other > direction - page is unevictable (on unevictable list) while not mlocked, at > least the counters such as UNEVICTABLE_PGRESCUED suggest that. > > Hugh knows this code the best. Do you think we should e.g. test the mlocked > flag instead/in addition to unevictable flag in compaction to avoid > violating vm.compact_unevictable_allowed = 0? No, checking for unevictable should be sufficient. But another idea has just this moment occurred to me: anon THP splitting is another user of migration entries. Is it possible that kcompactd is not actually the cause of Alexander's issue, but that his RT tasks have (bad news! shouldn't be allowed) got anon THPs in them? Back to bed, Hugh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-08-07 12:21 ` Hugh Dickins @ 2025-08-07 15:49 ` Alexander Krabler 2025-08-08 7:37 ` Vlastimil Babka 0 siblings, 1 reply; 15+ messages in thread From: Alexander Krabler @ 2025-08-07 15:49 UTC (permalink / raw) To: Hugh Dickins, Vlastimil Babka Cc: Frank van der Linden, Mike Galbraith, linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth On Thu, 7 Aug 2025 14:21, Hugh Dickins wrote: > But another idea has just this moment occurred to me: anon THP splitting > is another user of migration entries. Is it possible that kcompactd is > not actually the cause of Alexander's issue, but that his RT tasks have > (bad news! shouldn't be allowed) got anon THPs in them? No, we don't have transparent hugepages enabled. (From Kconfig, this seems not even possible together with PREEMPT_RT.) From ftrace events (first message in this thread), we know that kcompactd0 finally wakes our realtime thread(s). Given the information I got from here, the comment on the code [1] and an older commit message [2], I suspect CMA somehow influences our problem. If there is anything we should enable for better diagnosis (e.g. more ftrace events), we would be willing to try that out. [1] https://elixir.bootlin.com/linux/v6.16/source/mm/compaction.c#L1136 [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=e46a28790e594c0876d1a84270926abf75460f61 Thanks, Alexander -- KUKA Deutschland GmbH Board of Directors: Michael Jürgens (Chairman), Dirk Busch, Johan Naten, Hui Zhang Registered Office: Augsburg HRB 14914 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of contents of this e-mail is strictly forbidden. Please consider the environment before printing this e-mail. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-08-07 15:49 ` Alexander Krabler @ 2025-08-08 7:37 ` Vlastimil Babka 2025-08-20 14:29 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 15+ messages in thread From: Vlastimil Babka @ 2025-08-08 7:37 UTC (permalink / raw) To: Alexander Krabler, Hugh Dickins Cc: Frank van der Linden, Mike Galbraith, linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth On 8/7/25 17:49, Alexander Krabler wrote: > On Thu, 7 Aug 2025 14:21, Hugh Dickins wrote: >> But another idea has just this moment occurred to me: anon THP splitting >> is another user of migration entries. Is it possible that kcompactd is >> not actually the cause of Alexander's issue, but that his RT tasks have >> (bad news! shouldn't be allowed) got anon THPs in them? > > No, we don't have transparent hugepages enabled. > (From Kconfig, this seems not even possible together with PREEMPT_RT.) > > From ftrace events (first message in this thread), we know that kcompactd0 > finally wakes our realtime thread(s). > > Given the information I got from here, the comment on the code [1] > and an older commit message [2], I suspect CMA somehow influences our problem. However, kcompactd doesn't perform CMA allocations, only compaction, in a mode that does not include ISOLATE_UNEVICTABLE. So this is weird. > If there is anything we should enable for better diagnosis (e.g. more ftrace events), > we would be willing to try that out. > > [1] https://elixir.bootlin.com/linux/v6.16/source/mm/compaction.c#L1136 > [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=e46a28790e594c0876d1a84270926abf75460f61 > > Thanks, > Alexander > > -- > KUKA Deutschland GmbH Board of Directors: Michael Jürgens (Chairman), Dirk Busch, Johan Naten, Hui Zhang Registered Office: Augsburg HRB 14914 > > This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of contents of this e-mail is strictly forbidden. > > Please consider the environment before printing this e-mail. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-08-08 7:37 ` Vlastimil Babka @ 2025-08-20 14:29 ` Sebastian Andrzej Siewior 0 siblings, 0 replies; 15+ messages in thread From: Sebastian Andrzej Siewior @ 2025-08-20 14:29 UTC (permalink / raw) To: Vlastimil Babka Cc: Alexander Krabler, Hugh Dickins, Frank van der Linden, Mike Galbraith, linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth On 2025-08-08 09:37:26 [+0200], Vlastimil Babka wrote: > > Given the information I got from here, the comment on the code [1] > > and an older commit message [2], I suspect CMA somehow influences our problem. > > However, kcompactd doesn't perform CMA allocations, only compaction, in a > mode that does not include ISOLATE_UNEVICTABLE. So this is weird. As per smaps, the RT task should have all VMAs listed as "lo". If use mlock() then something like an accidental fork() would remove it. Otherwise it should be there. At the time of the fault you could add something like | diff --git a/mm/memory.c b/mm/memory.c | --- a/mm/memory.c | +++ b/mm/memory.c | @@ -4476,6 +4476,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) | entry = pte_to_swp_entry(vmf->orig_pte); | if (unlikely(non_swap_entry(entry))) { | if (is_migration_entry(entry)) { | + | + if (!strcmp("tRealtime", current->comm)) { | + trace_printk("Migrated: 0x%lx VMA flags: %lx\n", | + vmf->address, vma->vm_flags); | + } | + | migration_entry_wait(vma->vm_mm, vmf->pmd, | vmf->address); | } else if (is_device_exclusive_entry(entry)) { to see address is gone. Not sure if the PTE flags are of any help here. Is it easily possible on the other side (isolate_migratepages(), right?) to figure out which task a certain address space/ page belongs to? So would if a "bad" page is considered for migration. Sebastian ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-08-01 9:58 ` Vlastimil Babka 2025-08-01 11:23 ` Alexander Krabler @ 2025-08-01 19:27 ` Frank van der Linden 2025-08-05 14:11 ` Alexander Krabler 1 sibling, 1 reply; 15+ messages in thread From: Frank van der Linden @ 2025-08-01 19:27 UTC (permalink / raw) To: Vlastimil Babka Cc: Mike Galbraith, Alexander Krabler, linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth, Hugh Dickins On Fri, Aug 1, 2025 at 2:58 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 8/1/25 04:46, Mike Galbraith wrote: > > On Thu, 2025-07-31 at 20:41 +0200, Vlastimil Babka wrote: > >> On 7/31/25 20:34, Frank van der Linden wrote: > >> > Not sure what the right thing to do would be. Either explicitly boost > >> > the priority of a thread temporarily during migrate_pages_batch, or > >> > mitigate the issue by dealing with 'busy' pages more quickly in > >> > migrate_pages_batch. > >> > >> There's a workaround for realtime tasks. If you mlock[all]() their memory, > >> setting sysctl vm.compact_unevictable_allowed to 0 should exclude these > >> pages from migration by compaction. > > > > Hm, per documentation that's done automatically for PREEMPT_RT... > > Oh I see. > > > On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due > > to compaction, which would block the task from becoming active until the fault > > is resolved. > > So it's probably the mlock() part missing since that should otherwise apply > to kcompactd. > > > ...but rummaging, seems other stuff can step on it (contiguous alloc?). > > Yeah, there was time CMA was just something for mobile phone hardware. As > usage increases beyond that maybe we'll have to tackle it. Ideally by not > having mlock'd pages in CMA areas at all. And if contiguous alloc is > attempted outside of CMA areas, respect the sysctl there too. > > There are also things like mbind() migrating pages for NUMA locality but I > assume people just wouldn't try to do that with realtime workloads. > Another idea is to minimize the time that a migration PTE is in place for an mlocked page, Hugh (cc-ed) mentioned this in an offline discussion. E.g. skip any mlocked pages in the first pass, and just add them to a list. Then, do that list separately, but do them one by one. There is somewhat similar logic in migrate_pages_sync for pages that might need extra work / locking. Not sure if avoiding mlocked pages in CMA would work out. I mean, it's not hard to implement, as it would be pretty much the same as for pin_user_pages: just move them out of CMA on mlock. I'm just a bit worried of scenarios where the kernel might run out of space for unmovable allocations if you have a larger amount of CMA, which would be made worse by moving more allocations out of CMA. Then again, the amount of mlocked memory is probably generally small. - Frank ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Realtime threads delayed due to kcompactd0 2025-08-01 19:27 ` Frank van der Linden @ 2025-08-05 14:11 ` Alexander Krabler 0 siblings, 0 replies; 15+ messages in thread From: Alexander Krabler @ 2025-08-05 14:11 UTC (permalink / raw) To: Frank van der Linden, Vlastimil Babka Cc: Mike Galbraith, linux-rt-users@vger.kernel.org, linux-mm@kvack.org, Dennis Schimmel, Daniel Braunwarth, Hugh Dickins On Fri, Aug 1, 2025 at 9:27 PM Frank van der Linden <fvdl@google.com> wrote: > Another idea is to minimize the time that a migration PTE is in place > for an mlocked page, Hugh (cc-ed) mentioned this in an offline > discussion. E.g. skip any mlocked pages in the first pass, and just > add them to a list. Then, do that list separately, but do them one by > one. There is somewhat similar logic in migrate_pages_sync for pages > that might need extra work / locking. Another idea might be, that the blocked task migrates the page it is blocked on by itself, removes the migration PTE and continues? Doing the work itself instead of waiting for another thread to do the work. Would that be possible? This would solve the priority inversion problem and should be way more deterministic than the current situation. Alexander KUKA Deutschland GmbH Board of Directors: Michael Jürgens (Chairman), Dirk Busch, Johan Naten, Hui Zhang Registered Office: Augsburg HRB 14914 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of contents of this e-mail is strictly forbidden. Please consider the environment before printing this e-mail. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-08-20 14:29 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-25 5:30 Realtime threads delayed due to kcompactd0 Alexander Krabler 2025-07-31 18:34 ` Frank van der Linden 2025-07-31 18:41 ` Vlastimil Babka 2025-08-01 2:46 ` Mike Galbraith 2025-08-01 9:58 ` Vlastimil Babka 2025-08-01 11:23 ` Alexander Krabler 2025-08-01 12:57 ` Vlastimil Babka 2025-08-01 13:40 ` Alexander Krabler 2025-08-07 10:48 ` Vlastimil Babka 2025-08-07 12:21 ` Hugh Dickins 2025-08-07 15:49 ` Alexander Krabler 2025-08-08 7:37 ` Vlastimil Babka 2025-08-20 14:29 ` Sebastian Andrzej Siewior 2025-08-01 19:27 ` Frank van der Linden 2025-08-05 14:11 ` Alexander Krabler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).