public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/2] x86: Allow variable-sized event frame
@ 2025-05-12  6:43 Xin Li (Intel)
  2025-05-12  6:43 ` [PATCH v4 1/2] x86/fred: " Xin Li (Intel)
  2025-05-12  6:43 ` [PATCH v4 2/2] x86: Remove the padding space at top of the init stack Xin Li (Intel)
  0 siblings, 2 replies; 6+ messages in thread
From: Xin Li (Intel) @ 2025-05-12  6:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: luto, tglx, mingo, bp, dave.hansen, x86, hpa, peterz, brgerst

This was initially posted as part of the FRED patch series and turned
down due to its unacceptable quality:
  https://lore.kernel.org/lkml/20230410081438.1750-31-xin3.li@intel.com/

Now comes another attempt to meet the bar.


A FRED event frame could contain different amount of information for
different event types, e.g., #MCE could push extra bytes of information,
or perhaps even for different instances of the same event type. Thus
the size of an event frame pushed by a FRED CPU is not fixed and the
address of a pt_regs structure that is used to save the user level
context of current task is not at a fixed offset from top of current
task kernel stack.

This patch set adds a new field named 'user_pt_regs' in the thread_info
structure to save the address of user level context pt_regs structure,
thus to eliminate the need of any advance information of event frame
size and allow a FRED CPU to push variable-sized event frame.

With the above change, we can remove the padding space at top of the
init stack because there is no user level context for init task.


Link to v3: https://lore.kernel.org/lkml/20250321053735.2479875-1-xin@zytor.com/

Change in v4:
* Drop the patch that zaps TOP_OF_KERNEL_STACK_PADDING on x86_64
  (Brian Gerst, hpa).

Change in v3:
* Replace "(struct pt_regs *)TOP_OF_INIT_STACK - 1" with
  (struct pt_regs *)__top_init_kernel_stack (Brian Gerst).
* Add declaration for __top_init_kernel_stack[] (Intel lkp).

Change in v2:
* Rebase on latest tip/master.


Xin Li (Intel) (2):
  x86/fred: Allow variable-sized event frame
  x86: Remove the padding space at top of the init stack

 arch/x86/entry/entry_fred.c        | 10 ++++++++++
 arch/x86/include/asm/processor.h   | 28 ++++++++++++++++++++--------
 arch/x86/include/asm/thread_info.h | 11 ++++++++---
 arch/x86/kernel/process.c          | 22 ++++++++++++++++++++++
 arch/x86/kernel/vmlinux.lds.S      |  7 +++++--
 include/linux/thread_info.h        |  1 +
 kernel/fork.c                      |  6 ++++++
 7 files changed, 72 insertions(+), 13 deletions(-)


base-commit: 8e3f385164626dc6bbf000decf04aa98e943e07e
-- 
2.49.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v4 1/2] x86/fred: Allow variable-sized event frame
  2025-05-12  6:43 [PATCH v4 0/2] x86: Allow variable-sized event frame Xin Li (Intel)
@ 2025-05-12  6:43 ` Xin Li (Intel)
  2025-05-12  9:30   ` Peter Zijlstra
  2025-05-12  6:43 ` [PATCH v4 2/2] x86: Remove the padding space at top of the init stack Xin Li (Intel)
  1 sibling, 1 reply; 6+ messages in thread
From: Xin Li (Intel) @ 2025-05-12  6:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: luto, tglx, mingo, bp, dave.hansen, x86, hpa, peterz, brgerst

A FRED event frame could contain different amount of information for
different event types, or perhaps even for different instances of the
same event type. Thus the size of an event frame pushed by a FRED CPU
is not fixed and the address of the pt_regs structure that is used to
save a user level context of current task is not at a fixed offset
from top of current task kernel stack.

Add a new field named 'user_pt_regs' in the thread_info structure to
save the address of user level context pt_regs structure, thus to
eliminate the need of any advance information of event frame size
and allow a FRED CPU to push variable-sized event frame.

For IDT user level event delivery, a pt_regs structure is pushed by
hardware and software _always_ at a fixed offset from top of current
task kernel stack, so simply initialize user_pt_regs to point to the
pt_regs structure no matter whether one is pushed or not.

While for FRED user level event delivery, user_pt_regs is updated with
a pt_regs structure pointer generated in asm_fred_entrypoint_user().

Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
---

Change in v3A:
* Add declaration for __top_init_kernel_stack[] (Intel lkp).

Change in v3:
* Replace "(struct pt_regs *)TOP_OF_INIT_STACK - 1" with
  (struct pt_regs *)__top_init_kernel_stack (Brian Gerst).
---
 arch/x86/entry/entry_fred.c        | 10 ++++++++++
 arch/x86/include/asm/processor.h   | 12 ++++++------
 arch/x86/include/asm/thread_info.h | 11 ++++++++---
 arch/x86/kernel/process.c          | 22 ++++++++++++++++++++++
 include/linux/thread_info.h        |  1 +
 kernel/fork.c                      |  6 ++++++
 6 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
index f004a4dc74c2..a5f5bdd16ad8 100644
--- a/arch/x86/entry/entry_fred.c
+++ b/arch/x86/entry/entry_fred.c
@@ -228,6 +228,16 @@ __visible noinstr void fred_entry_from_user(struct pt_regs *regs)
 	/* Invalidate orig_ax so that syscall_get_nr() works correctly */
 	regs->orig_ax = -1;
 
+	/*
+	 * A FRED event frame could contain different amount of information
+	 * for different event types, or perhaps even for different instances
+	 * of the same event type. Thus the size of an event frame pushed by
+	 * a FRED CPU is not fixed and the address of the pt_regs structure
+	 * that is used to save a user level context of current task is not
+	 * at a fixed offset from top of current task stack.
+	 */
+	current->thread_info.user_pt_regs = regs;
+
 	switch (regs->fred_ss.type) {
 	case EVENT_TYPE_EXTINT:
 		return fred_extint(regs);
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 50d34698036d..42e5a6a41403 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -645,12 +645,12 @@ static __always_inline void prefetchw(const void *x)
 
 #define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1))
 
-#define task_pt_regs(task) \
-({									\
-	unsigned long __ptr = (unsigned long)task_stack_page(task);	\
-	__ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;		\
-	((struct pt_regs *)__ptr) - 1;					\
-})
+/*
+ * Note, this can't be converted to an inline function as this header
+ * file defines 'struct thread_struct' which is used in the task_struct
+ * structure definition.
+ */
+#define task_pt_regs(task) ((task)->thread_info.user_pt_regs)
 
 #ifdef CONFIG_X86_32
 #define INIT_THREAD  {							  \
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 9282465eea21..07c6a6a92c65 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -56,6 +56,7 @@
  */
 #ifndef __ASSEMBLER__
 struct task_struct;
+struct pt_regs;
 #include <asm/cpufeature.h>
 #include <linux/atomic.h>
 
@@ -66,11 +67,15 @@ struct thread_info {
 #ifdef CONFIG_SMP
 	u32			cpu;		/* current CPU */
 #endif
+	struct pt_regs		*user_pt_regs;
 };
 
-#define INIT_THREAD_INFO(tsk)			\
-{						\
-	.flags		= 0,			\
+extern unsigned long __top_init_kernel_stack[];
+
+#define INIT_THREAD_INFO(tsk)						\
+{									\
+	.flags		= 0,						\
+	.user_pt_regs	= (struct pt_regs *)__top_init_kernel_stack,	\
 }
 
 #else /* !__ASSEMBLER__ */
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 4b668bc683c4..9823880ceaa0 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -104,6 +104,28 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 	return 0;
 }
 
+/*
+ * Initialize thread_info.user_pt_regs for IDT event delivery.
+ *
+ * For IDT user level event delivery, a pt_regs structure is pushed by both
+ * hardware and software and always resides at a fixed offset from top of
+ * current task kernel stack, thus thread_info.user_pt_regs is a per-task
+ * constant and NEVER changes after initialization.
+ *
+ * While for FRED user level event delivery, user_pt_regs is updated in
+ * fred_entry_from_user() immediately after user level event delivery.
+ *
+ * Note: thread_info.user_pt_regs of the init task is initialized at build
+ * time.
+ */
+void arch_init_user_pt_regs(struct task_struct *tsk)
+{
+	unsigned long top_of_stack = (unsigned long)task_stack_page(tsk) + THREAD_SIZE;
+
+	top_of_stack -= TOP_OF_KERNEL_STACK_PADDING;
+	tsk->thread_info.user_pt_regs = (struct pt_regs *)top_of_stack - 1;
+}
+
 #ifdef CONFIG_X86_64
 void arch_release_task_struct(struct task_struct *tsk)
 {
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index dd925d84fa46..f95d38ed04d7 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -225,6 +225,7 @@ void arch_task_cache_init(void); /* for CONFIG_SH */
 void arch_release_task_struct(struct task_struct *tsk);
 int arch_dup_task_struct(struct task_struct *dst,
 				struct task_struct *src);
+void arch_init_user_pt_regs(struct task_struct *tsk);
 
 #endif	/* __KERNEL__ */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1f5d8083eeb2..5b806e74f766 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1110,6 +1110,10 @@ int __weak arch_dup_task_struct(struct task_struct *dst,
 	return 0;
 }
 
+void __weak arch_init_user_pt_regs(struct task_struct *tsk)
+{
+}
+
 void set_task_stack_end_magic(struct task_struct *tsk)
 {
 	unsigned long *stackend;
@@ -1137,6 +1141,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (err)
 		goto free_tsk;
 
+	arch_init_user_pt_regs(tsk);
+
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	refcount_set(&tsk->stack_refcount, 1);
 #endif
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v4 2/2] x86: Remove the padding space at top of the init stack
  2025-05-12  6:43 [PATCH v4 0/2] x86: Allow variable-sized event frame Xin Li (Intel)
  2025-05-12  6:43 ` [PATCH v4 1/2] x86/fred: " Xin Li (Intel)
@ 2025-05-12  6:43 ` Xin Li (Intel)
  1 sibling, 0 replies; 6+ messages in thread
From: Xin Li (Intel) @ 2025-05-12  6:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: luto, tglx, mingo, bp, dave.hansen, x86, hpa, peterz, brgerst

Because the owner of the init stack, init task, doesn't have any user
level context, there will NEVER be an actual pt_regs structure pushed
at top of the init stack.

However a zeroed pt_regs structure is created at build time and kept
at top of the init stack for task_pt_regs() to function properly with
the init task in the same manner as a normal task with user level
context.

Besides, task_pt_regs() no longer converts a fixed offset from top of
a task kernel stack to a pt_regs structure pointer, but rather returns
whatever in the thread_info.user_pt_regs field, which is initialized
at build time to '(struct pt_regs *)TOP_OF_INIT_STACK - 1' for the
init task.

As a result, there is no point to reserve any padding space at top of
the init stack, so remove the padding space.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
---
 arch/x86/include/asm/processor.h | 16 ++++++++++++++--
 arch/x86/kernel/vmlinux.lds.S    |  7 +++++--
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 42e5a6a41403..5c5378232bd4 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -640,8 +640,20 @@ static __always_inline void prefetchw(const void *x)
 			  "m" (*(const char *)x));
 }
 
-#define TOP_OF_INIT_STACK ((unsigned long)&init_stack + sizeof(init_stack) - \
-			   TOP_OF_KERNEL_STACK_PADDING)
+extern unsigned long __end_init_stack[];
+
+/*
+ * No need to reserve extra padding space above the pt_regs structure
+ * at top of the init stack, because its owner init task doesn't have
+ * any user level context, thus there will NEVER be an actual pt_regs
+ * structure pushed at top of the init stack.
+ *
+ * However a zeroed pt_regs structure is created at build time and kept
+ * at top of the init stack for task_pt_regs() to function properly with
+ * the init task in the same manner as a normal task with user level
+ * context.
+ */
+#define TOP_OF_INIT_STACK ((unsigned long)&__end_init_stack)
 
 #define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1))
 
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index d813f64a89d6..6266e500f7fb 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -181,8 +181,11 @@ SECTIONS
 		/* init_task */
 		INIT_TASK_DATA(THREAD_SIZE)
 
-		/* equivalent to task_pt_regs(&init_task) */
-		__top_init_kernel_stack = __end_init_stack - TOP_OF_KERNEL_STACK_PADDING - PTREGS_SIZE;
+		/*
+		 * task_pt_regs(&init_task) is:
+		 *	'(struct pt_regs *)&__end_init_stack - 1'
+		 */
+		__top_init_kernel_stack = __end_init_stack - PTREGS_SIZE;
 
 #ifdef CONFIG_X86_32
 		/* 32 bit has nosave before _edata */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 1/2] x86/fred: Allow variable-sized event frame
  2025-05-12  6:43 ` [PATCH v4 1/2] x86/fred: " Xin Li (Intel)
@ 2025-05-12  9:30   ` Peter Zijlstra
  2025-05-12 12:10     ` Brian Gerst
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2025-05-12  9:30 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: linux-kernel, luto, tglx, mingo, bp, dave.hansen, x86, hpa,
	brgerst

On Sun, May 11, 2025 at 11:43:52PM -0700, Xin Li (Intel) wrote:


> -#define INIT_THREAD_INFO(tsk)			\
> -{						\
> -	.flags		= 0,			\
> +extern unsigned long __top_init_kernel_stack[];
> +
> +#define INIT_THREAD_INFO(tsk)						\
> +{									\
> +	.flags		= 0,						\
> +	.user_pt_regs	= (struct pt_regs *)__top_init_kernel_stack,	\

Should that not have a -1 on or something?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 1/2] x86/fred: Allow variable-sized event frame
  2025-05-12  9:30   ` Peter Zijlstra
@ 2025-05-12 12:10     ` Brian Gerst
  2025-05-13  5:02       ` Xin Li
  0 siblings, 1 reply; 6+ messages in thread
From: Brian Gerst @ 2025-05-12 12:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Xin Li (Intel), linux-kernel, luto, tglx, mingo, bp, dave.hansen,
	x86, hpa

On Mon, May 12, 2025 at 5:30 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, May 11, 2025 at 11:43:52PM -0700, Xin Li (Intel) wrote:
>
>
> > -#define INIT_THREAD_INFO(tsk)                        \
> > -{                                            \
> > -     .flags          = 0,                    \
> > +extern unsigned long __top_init_kernel_stack[];
> > +
> > +#define INIT_THREAD_INFO(tsk)                                                \
> > +{                                                                    \
> > +     .flags          = 0,                                            \
> > +     .user_pt_regs   = (struct pt_regs *)__top_init_kernel_stack,    \
>
> Should that not have a -1 on or something?

No, that symbol already accounts for subtracting the size of pt_regs.


Brian Gerst

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 1/2] x86/fred: Allow variable-sized event frame
  2025-05-12 12:10     ` Brian Gerst
@ 2025-05-13  5:02       ` Xin Li
  0 siblings, 0 replies; 6+ messages in thread
From: Xin Li @ 2025-05-13  5:02 UTC (permalink / raw)
  To: Brian Gerst, Peter Zijlstra
  Cc: linux-kernel, luto, tglx, mingo, bp, dave.hansen, x86, hpa

On 5/12/2025 5:10 AM, Brian Gerst wrote:
> On Mon, May 12, 2025 at 5:30 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Sun, May 11, 2025 at 11:43:52PM -0700, Xin Li (Intel) wrote:
>>
>>
>>> -#define INIT_THREAD_INFO(tsk)                        \
>>> -{                                            \
>>> -     .flags          = 0,                    \
>>> +extern unsigned long __top_init_kernel_stack[];
>>> +
>>> +#define INIT_THREAD_INFO(tsk)                                                \
>>> +{                                                                    \
>>> +     .flags          = 0,                                            \
>>> +     .user_pt_regs   = (struct pt_regs *)__top_init_kernel_stack,    \
>>
>> Should that not have a -1 on or something?
> 
> No, that symbol already accounts for subtracting the size of pt_regs.

Thanks for your answer :)


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-05-13  5:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-12  6:43 [PATCH v4 0/2] x86: Allow variable-sized event frame Xin Li (Intel)
2025-05-12  6:43 ` [PATCH v4 1/2] x86/fred: " Xin Li (Intel)
2025-05-12  9:30   ` Peter Zijlstra
2025-05-12 12:10     ` Brian Gerst
2025-05-13  5:02       ` Xin Li
2025-05-12  6:43 ` [PATCH v4 2/2] x86: Remove the padding space at top of the init stack Xin Li (Intel)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox