[PATCH v4] x86/sgx: Fix RCU Tasks stall in EPC sanitization loop

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v4] x86/sgx: Fix RCU Tasks stall in EPC sanitization loop
@ 2026-06-24 14:20 Jun Miao
  2026-06-24 22:23 ` Huang, Kai
  0 siblings, 1 reply; 3+ messages in thread
From: Jun Miao @ 2026-06-24 14:20 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang
  Cc: challvy.tee, fan.du, linux-kernel, linux-sgx, qiang.zhang,
	jun.miao

The kernel resets all EPC pages to a clean state in a loop before using them
for enclaves.  The number of EPC pages could be large (e.g., GBs) thus
resetting them could take a fair amount of time.  Because of that, during
early boot, the kernel resets EPC pages through a kernel thread ksgxd() and
there's a cond_resched() after resetting each EPC page.

This is fine in most cases, but becomes a problem when there's other kernel
code waiting for RCU-Tasks grace period but the cond_resched() in ksgxd()
never triggers rescheduling.  Because cond_resched() doesn't report quiescent
state when it doesn't trigger rescheduling, the thread that is waiting for
RCU-Tasks grace period will need to wait until all EPC pages are reset.

For instance, BPF LSM subsystem can invoke synchronize_rcu_tasks() at kernel
boot time.  A VM with a large EPC assigned and have BPF LSM enabled can take
a long time to boot, with a call trace triggered:

    rcu_tasks_wait_gp: rcu_tasks grace period number 1 (since boot) is 130631 jiffies old.
    INFO: task systemd:1 blocked for more than 122 seconds.
    ...
    task:systemd  state:D stack:0  pid:1  tpid:1  ppid:0  flags:0x00000002
    Call Trace:
    ...
    schedule_timeout+0x157/0x170
    wait_for_completion+0x88/0x150
    __wait_rcu_gp+0x17e/0x190
    synchronize_rcu_tasks_generic+0x64/0x60
    ...
    synchronize_rcu_tasks+0x15/0x20
    register_ftrace_direct+0x31f/0x350
    ...
    bpf_trampoline_link_prog+0x33/0x60
    bpf_tracing_prog_attach+0x3c5/0x5f0

Replace cond_resched() with cond_resched_tasks_rcu_qs() which explicitly report quiescent
regardless whether actual rescheduling is triggered.  Resetting all EPC pages in ksgxd()
isn't performance critical so the extra cost of cond_resched_tasks_rcu_qs() isn't a problem.

Tests showed this reduced the VM kernel boot time from ~50s to ~700ms.

Reported-by: Challvy Tee <challvy.tee@gmail.com>
Link: https://github.com/systemd/systemd/issues/40423
Fixes: e7e0545299d8 ("x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections")
Tested-by: Challvy Tee <challvy.tee@gmail.com>
Suggested-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Fan Du <fan.du@intel.com>
Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Jun Miao <jun.miao@intel.com>

---
v1 -> v2:
 - Clarify the RCU Tasks stall root cause.
 - Use cond_resched_rcu_qs() following the Kai`s suggestion.

v2 -> v3:
 - cee439398933 ("rcu: Rename cond_resched_rcu_qs() to cond_resched_tasks_rcu_qs()")

v3 ->v4:
 - Trim down/rewrite changelog following Kai`s suggestion.

---
 arch/x86/kernel/cpu/sgx/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 4505f808af5e..7ba3d0a5a05d 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -106,7 +106,7 @@ static unsigned long __sgx_sanitize_pages(struct list_head *dirty_page_list)
 			left_dirty++;
 		}

-		cond_resched();
+		cond_resched_tasks_rcu_qs();
 	}

 	list_splice(&dirty, dirty_page_list);
-- 
2.32.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v4] x86/sgx: Fix RCU Tasks stall in EPC sanitization loop
  2026-06-24 14:20 [PATCH v4] x86/sgx: Fix RCU Tasks stall in EPC sanitization loop Jun Miao
@ 2026-06-24 22:23 ` Huang, Kai
  2026-06-25  1:42   ` Miao, Jun
  0 siblings, 1 reply; 3+ messages in thread
From: Huang, Kai @ 2026-06-24 22:23 UTC (permalink / raw)
  To: Miao, Jun, jarkko@kernel.org, dave.hansen@linux.intel.com
  Cc: linux-sgx@vger.kernel.org, Du, Fan, challvy.tee@gmail.com,
	linux-kernel@vger.kernel.org, qiang.zhang@linux.dev

Some nits in spell/grammar below.  Sorry I didn't go through any spell/grammar
check last night since it was already very late for me.

Btw, the v4 is quicker than I expected anyway.  We are still in merge window so
maintainers are not expected to review this.  For more info, see "Merge window"
section in Documentation/process/maintainer-tip.rst.

On Wed, 2026-06-24 at 22:20 +0800, Jun Miao wrote:
> The kernel resets all EPC pages to a clean state in a loop before using them
> for enclaves.  The number of EPC pages could be large (e.g., GBs) thus
> resetting them could take a fair amount of time.  Because of that, during

s/fair/significant

> early boot, the kernel resets EPC pages through a kernel thread ksgxd() and
> there's a cond_resched() after resetting each EPC page.
> 
> This is fine in most cases, but becomes a problem when there's other kernel
> code waiting for RCU-Tasks grace period but the cond_resched() in ksgxd()
> never triggers rescheduling.  Because cond_resched() doesn't report quiescent
> state when it doesn't trigger rescheduling, the thread that is waiting for
> RCU-Tasks grace period will need to wait until all EPC pages are reset.
> 
> For instance, BPF LSM subsystem can invoke synchronize_rcu_tasks() at kernel
> boot time.  A VM with a large EPC assigned and have BPF LSM enabled can take
						 ^
Remove the 'have'.

> a long time to boot, with a call trace triggered:
> 
>     rcu_tasks_wait_gp: rcu_tasks grace period number 1 (since boot) is 130631 jiffies old.
>     INFO: task systemd:1 blocked for more than 122 seconds.
>     ...
>     task:systemd  state:D stack:0  pid:1  tpid:1  ppid:0  flags:0x00000002
>     Call Trace:
>     ...
>     schedule_timeout+0x157/0x170
>     wait_for_completion+0x88/0x150
>     __wait_rcu_gp+0x17e/0x190
>     synchronize_rcu_tasks_generic+0x64/0x60
>     ...
>     synchronize_rcu_tasks+0x15/0x20
>     register_ftrace_direct+0x31f/0x350
>     ...
>     bpf_trampoline_link_prog+0x33/0x60
>     bpf_tracing_prog_attach+0x3c5/0x5f0
> 
> Replace cond_resched() with cond_resched_tasks_rcu_qs() which explicitly report quiescent

s/report quiescent/reports quiescent state

> regardless whether actual rescheduling is triggered.  Resetting all EPC pages in ksgxd()

s/regardless/regardless of

> isn't performance critical so the extra cost of cond_resched_tasks_rcu_qs() isn't a problem.
> 
> Tests showed this reduced the VM kernel boot time from ~50s to ~700ms.
> 
> Reported-by: Challvy Tee <challvy.tee@gmail.com>
> Link: https://github.com/systemd/systemd/issues/40423
> Fixes: e7e0545299d8 ("x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections")
> Tested-by: Challvy Tee <challvy.tee@gmail.com>
> Suggested-by: Kai Huang <kai.huang@intel.com>
> Co-developed-by: Fan Du <fan.du@intel.com>
> Signed-off-by: Fan Du <fan.du@intel.com>
> Signed-off-by: Jun Miao <jun.miao@intel.com>


Btw, English isn't my first language, so feel free to use AI to improve the
above changelog if you want (I just did but not sure the changes are all better
though).

Anyway, feel free to add:

Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [PATCH v4] x86/sgx: Fix RCU Tasks stall in EPC sanitization loop
  2026-06-24 22:23 ` Huang, Kai
@ 2026-06-25  1:42   ` Miao, Jun
  0 siblings, 0 replies; 3+ messages in thread
From: Miao, Jun @ 2026-06-25  1:42 UTC (permalink / raw)
  To: Huang, Kai, jarkko@kernel.org, dave.hansen@linux.intel.com
  Cc: linux-sgx@vger.kernel.org, Du, Fan, challvy.tee@gmail.com,
	linux-kernel@vger.kernel.org, qiang.zhang@linux.dev

Hi Kai,
>Some nits in spell/grammar below.  Sorry I didn't go through any spell/grammar
>check last night since it was already very late for me.
>
>Btw, the v4 is quicker than I expected anyway.  We are still in merge window so
>maintainers are not expected to review this.  For more info, see "Merge window"
>section in Documentation/process/maintainer-tip.rst.
>
>On Wed, 2026-06-24 at 22:20 +0800, Jun Miao wrote:
>> The kernel resets all EPC pages to a clean state in a loop before
>> using them for enclaves.  The number of EPC pages could be large
>> (e.g., GBs) thus resetting them could take a fair amount of time.
>> Because of that, during
>
>s/fair/significant
>
>> early boot, the kernel resets EPC pages through a kernel thread
>> ksgxd() and there's a cond_resched() after resetting each EPC page.
>>
>> This is fine in most cases, but becomes a problem when there's other
>> kernel code waiting for RCU-Tasks grace period but the cond_resched()
>> in ksgxd() never triggers rescheduling.  Because cond_resched()
>> doesn't report quiescent state when it doesn't trigger rescheduling,
>> the thread that is waiting for RCU-Tasks grace period will need to wait until all
>EPC pages are reset.
>>
>> For instance, BPF LSM subsystem can invoke synchronize_rcu_tasks() at
>> kernel boot time.  A VM with a large EPC assigned and have BPF LSM
>> enabled can take
>						 ^
>Remove the 'have'.
>
>> a long time to boot, with a call trace triggered:
>>
>>     rcu_tasks_wait_gp: rcu_tasks grace period number 1 (since boot) is 130631
>jiffies old.
>>     INFO: task systemd:1 blocked for more than 122 seconds.
>>     ...
>>     task:systemd  state:D stack:0  pid:1  tpid:1  ppid:0  flags:0x00000002
>>     Call Trace:
>>     ...
>>     schedule_timeout+0x157/0x170
>>     wait_for_completion+0x88/0x150
>>     __wait_rcu_gp+0x17e/0x190
>>     synchronize_rcu_tasks_generic+0x64/0x60
>>     ...
>>     synchronize_rcu_tasks+0x15/0x20
>>     register_ftrace_direct+0x31f/0x350
>>     ...
>>     bpf_trampoline_link_prog+0x33/0x60
>>     bpf_tracing_prog_attach+0x3c5/0x5f0
>>
>> Replace cond_resched() with cond_resched_tasks_rcu_qs() which
>> explicitly report quiescent
>
>s/report quiescent/reports quiescent state
>
>> regardless whether actual rescheduling is triggered.  Resetting all
>> EPC pages in ksgxd()
>
>s/regardless/regardless of
>
>> isn't performance critical so the extra cost of cond_resched_tasks_rcu_qs() isn't
>a problem.
>>
>> Tests showed this reduced the VM kernel boot time from ~50s to ~700ms.
>>
>> Reported-by: Challvy Tee <challvy.tee@gmail.com>
>> Link: https://github.com/systemd/systemd/issues/40423
>> Fixes: e7e0545299d8 ("x86/sgx: Initialize metadata for Enclave Page
>> Cache (EPC) sections")
>> Tested-by: Challvy Tee <challvy.tee@gmail.com>
>> Suggested-by: Kai Huang <kai.huang@intel.com>
>> Co-developed-by: Fan Du <fan.du@intel.com>
>> Signed-off-by: Fan Du <fan.du@intel.com>
>> Signed-off-by: Jun Miao <jun.miao@intel.com>
>
>
>Btw, English isn't my first language, so feel free to use AI to improve the above
>changelog if you want (I just did but not sure the changes are all better though).

Thank you for your detailed guidance. I was still a bit careless.
Got it all your review, I will review commit log through AI again before push V5 to make it more perfect.

>Anyway, feel free to add:
>
>Reviewed-by: Kai Huang <kai.huang@intel.com>

It is my pleasure :-)

---
Warm regards
Jun Miao


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-25  1:42 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 14:20 [PATCH v4] x86/sgx: Fix RCU Tasks stall in EPC sanitization loop Jun Miao
2026-06-24 22:23 ` Huang, Kai
2026-06-25  1:42   ` Miao, Jun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox