* [PATCH RFC 0/6] Implement per-processor data areas for i386.
@ 2006-08-27 8:44 Jeremy Fitzhardinge
2006-08-27 8:44 ` [PATCH RFC 1/6] Basic definitions for i386-pda Jeremy Fitzhardinge
` (8 more replies)
0 siblings, 9 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 8:44 UTC (permalink / raw)
To: linux-kernel
Cc: Chuck Ebbert, Zachary Amsden, Jan Beulich, Andi Kleen,
Andrew Morton
This patch implements per-processor data areas by using %gs as the
base segment of the per-processor memory. This has two principle
advantages:
- It allows very simple direct access to per-processor data by
effectively using an effective address of the form %gs:offset, where
offset is the offset into struct i386_pda. These sequences are faster
and smaller than the current mechanism using current_thread_info().
- It also allows per-CPU data to be allocated as each CPU is brought
up, rather than statically allocating it based on the maximum number
of CPUs which could be brought up.
I haven't measured performance yet, but when using the PDA for "current"
and "smp_processor_id", I see a 5715 byte reduction in .text segment
size for my kernel.
Unfortunately, these patches don't actually work yet. I'm not sure why;
I'm hoping review will turn something up.
Some background for people unfamiliar with x86 segmentation:
This uses the x86 segmentation stuff in a way similar to NPTL's way of
implementing Thread-Local Storage. It relies on the fact that each CPU
has its own Global Descriptor Table (GDT), which is basically an array
of base-length pairs (with some extra stuff). When a segment register
is loaded with a descriptor (approximately, an index in the GDT), and
you use that segment register for memory access, the address has the
base added to it, and the resulting address is used.
In other words, if you imagine the GDT containing an entry:
Index Offset
123: 0xc0211000 (allocated PDA)
and you load %gs with this selector:
mov $123, %gs
and then use GS later on:
mov %gs:4, %eax
This has the effect of
mov 0xc0211004, %eax
and because the GDT is per-CPU, the offset (= 0xc0211000 = memory
allocated for this CPU's PDA) can be a CPU-specific value while leaving
everything else constant.
This means that something like "current" or "smp_processor_id()" can
collapse to a single instruction:
mov %gs:PDA_current, %reg
TODO:
- Make it work. It works UP on a test QEMU machine, but it doesn't
yet work on real hardware, or SMP (though not working SMP on QEMU is
more likely to be a QEMU problem). Not sure what the problem is yet;
I'm hoping review will reveal something.
- Measure performance impact. The patch adds a segment register
save/restore on entry/exit to the kernel. This expense should be
offset by savings in using the PDA while in the kernel, but I haven't
measured this yet. Space savings are already appealing though.
- Modify more things to use the PDA. The more that uses it, the more
the cost of the %gs save/restore is amortized. smp_processor_id and
current are the obvious first choices, which are implemented in this
series.
- Make it a config option? UP systems don't need to do any of this,
other than having a single pre-allocated PDA. Unfortunately, it gets
a bit messy to do this given the changes needed in handling %gs.
--
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH RFC 1/6] Basic definitions for i386-pda
2006-08-27 8:44 [PATCH RFC 0/6] Implement per-processor data areas for i386 Jeremy Fitzhardinge
@ 2006-08-27 8:44 ` Jeremy Fitzhardinge
2006-08-27 8:44 ` [PATCH RFC 2/6] Initialize the per-CPU data area Jeremy Fitzhardinge
` (7 subsequent siblings)
8 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 8:44 UTC (permalink / raw)
To: linux-kernel
Cc: Chuck Ebbert, Zachary Amsden, Jan Beulich, Andi Kleen,
Andrew Morton
[-- Attachment #1: i386-pda-definitions.patch --]
[-- Type: text/plain, Size: 4401 bytes --]
This patch has the basic definitions of struct i386_pda, and the
segment selector in the GDT.
asm-i386/pda.h is more or less a direct copy of asm-x86_64/pda.h. The
most interesting difference is the use of _proxy_pda, which is used to
give gcc a model for the actual memory operations on the real pda
structure. No actual reference is ever made to _proxy_pda, so it is
never defined.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
---
arch/i386/kernel/asm-offsets.c | 4 ++
arch/i386/kernel/head.S | 2 -
include/asm-i386/pda.h | 67 ++++++++++++++++++++++++++++++++++++++++
include/asm-i386/segment.h | 5 ++
4 files changed, 76 insertions(+), 2 deletions(-)
===================================================================
--- a/arch/i386/kernel/asm-offsets.c
+++ b/arch/i386/kernel/asm-offsets.c
@@ -15,6 +15,7 @@
#include <asm/processor.h>
#include <asm/thread_info.h>
#include <asm/elf.h>
+#include <asm/pda.h>
#define DEFINE(sym, val) \
asm volatile("\n->" #sym " %0 " #val : : "i" (val))
@@ -74,4 +75,7 @@ void foo(void)
DEFINE(VDSO_PRELINK, VDSO_PRELINK);
OFFSET(crypto_tfm_ctx_offset, crypto_tfm, __crt_ctx);
+
+ BLANK();
+ OFFSET(PDA_pcurrent, i386_pda, pcurrent);
}
===================================================================
--- a/arch/i386/kernel/head.S
+++ b/arch/i386/kernel/head.S
@@ -591,7 +591,7 @@ ENTRY(cpu_gdt_table)
.quad 0x004092000000ffff /* 0xc8 APM DS data */
.quad 0x0000920000000000 /* 0xd0 - ESPFIX 16-bit SS */
- .quad 0x0000000000000000 /* 0xd8 - unused */
+ .quad 0x0000000000000000 /* 0xd8 - PDA */
.quad 0x0000000000000000 /* 0xe0 - unused */
.quad 0x0000000000000000 /* 0xe8 - unused */
.quad 0x0000000000000000 /* 0xf0 - unused */
===================================================================
--- a/include/asm-i386/segment.h
+++ b/include/asm-i386/segment.h
@@ -39,7 +39,7 @@
* 25 - APM BIOS support
*
* 26 - ESPFIX small SS
- * 27 - unused
+ * 27 - PDA [ per-cpu private data area ]
* 28 - unused
* 29 - unused
* 30 - unused
@@ -73,6 +73,9 @@
#define GDT_ENTRY_ESPFIX_SS (GDT_ENTRY_KERNEL_BASE + 14)
#define __ESPFIX_SS (GDT_ENTRY_ESPFIX_SS * 8)
+
+#define GDT_ENTRY_PDA (GDT_ENTRY_KERNEL_BASE + 15)
+#define __KERNEL_PDA (GDT_ENTRY_PDA * 8)
#define GDT_ENTRY_DOUBLEFAULT_TSS 31
===================================================================
--- /dev/null
+++ b/include/asm-i386/pda.h
@@ -0,0 +1,67 @@
+#ifndef _I386_PDA_H
+#define _I386_PDA_H
+
+struct i386_pda
+{
+ struct task_struct *pcurrent; /* current process */
+ int cpu_number;
+};
+
+extern struct i386_pda *_cpu_pda[];
+
+#define cpu_pda(i) (_cpu_pda[i])
+
+#define pda_offset(field) offsetof(struct i386_pda, field)
+
+extern void __bad_pda_field(void);
+
+extern struct i386_pda _proxy_pda;
+
+#define pda_to_op(op,field,val) \
+ do { \
+ typedef typeof(_proxy_pda.field) T__; \
+ switch (sizeof(_proxy_pda.field)) { \
+ case 2: \
+ asm(op "w %1,%%gs:%P2" \
+ : "=m" (_proxy_pda.field) \
+ :"ri" ((T__)val), \
+ "i"(pda_offset(field))); \
+ break; \
+ case 4: \
+ asm(op "l %1,%%gs:%P2" \
+ : "=m" (_proxy_pda.field) \
+ :"ri" ((T__)val), \
+ "i"(pda_offset(field))); \
+ break; \
+ default: __bad_pda_field(); \
+ } \
+ } while (0)
+
+#define pda_from_op(op,field) \
+ ({ \
+ typeof(_proxy_pda.field) ret__; \
+ switch (sizeof(_proxy_pda.field)) { \
+ case 2: \
+ asm(op "w %%gs:%P1,%0" \
+ : "=r" (ret__) \
+ : "i" (pda_offset(field)), \
+ "m" (_proxy_pda.field)); \
+ break; \
+ case 4: \
+ asm(op "l %%gs:%P1,%0" \
+ : "=r" (ret__) \
+ : "i" (pda_offset(field)), \
+ "m" (_proxy_pda.field)); \
+ break; \
+ default: __bad_pda_field(); \
+ } \
+ ret__; })
+
+
+#define read_pda(field) pda_from_op("mov",field)
+#define write_pda(field,val) pda_to_op("mov",field,val)
+#define add_pda(field,val) pda_to_op("add",field,val)
+#define sub_pda(field,val) pda_to_op("sub",field,val)
+#define or_pda(field,val) pda_to_op("or",field,val)
+
+#endif /* _I386_PDA_H */
--
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH RFC 2/6] Initialize the per-CPU data area.
2006-08-27 8:44 [PATCH RFC 0/6] Implement per-processor data areas for i386 Jeremy Fitzhardinge
2006-08-27 8:44 ` [PATCH RFC 1/6] Basic definitions for i386-pda Jeremy Fitzhardinge
@ 2006-08-27 8:44 ` Jeremy Fitzhardinge
2006-08-27 8:44 ` [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel Jeremy Fitzhardinge
` (6 subsequent siblings)
8 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 8:44 UTC (permalink / raw)
To: linux-kernel
Cc: Chuck Ebbert, Zachary Amsden, Jan Beulich, Andi Kleen,
Andrew Morton
[-- Attachment #1: i386-pda-init.patch --]
[-- Type: text/plain, Size: 3578 bytes --]
When a CPU is brought up, a PDA and GDT are allocated for it. The
GDT's __KERNEL_PDA entry is pointed to the allocated PDA memory, so
that all references using this segment selector will refer to the PDA.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
---
arch/i386/kernel/cpu/common.c | 49 ++++++++++++++++++++++++++++++++++++++---
1 file changed, 46 insertions(+), 3 deletions(-)
===================================================================
--- a/arch/i386/kernel/cpu/common.c
+++ b/arch/i386/kernel/cpu/common.c
@@ -18,6 +18,7 @@
#include <asm/apic.h>
#include <mach_apic.h>
#endif
+#include <asm/pda.h>
#include "cpu.h"
@@ -26,6 +27,9 @@ EXPORT_PER_CPU_SYMBOL(cpu_gdt_descr);
DEFINE_PER_CPU(unsigned char, cpu_16bit_stack[CPU_16BIT_STACK_SIZE]);
EXPORT_PER_CPU_SYMBOL(cpu_16bit_stack);
+
+struct i386_pda *_cpu_pda[NR_CPUS] __read_mostly;
+EXPORT_SYMBOL(_cpu_pda);
static int cachesize_override __cpuinitdata = -1;
static int disable_x86_fxsr __cpuinitdata;
@@ -582,6 +586,26 @@ void __init early_cpu_init(void)
disable_pse = 1;
#endif
}
+
+static __cpuinit void pda_init(int cpu)
+{
+ struct i386_pda *pda = cpu_pda(cpu);
+
+ memset(pda, 0, sizeof(*pda));
+
+ pda->cpu_number = cpu;
+ pda->pcurrent = current;
+
+ printk("cpu %d current %p\n", cpu, current);
+}
+
+struct pt_regs * __devinit idle_regs(struct pt_regs *regs)
+{
+ memset(regs, 0, sizeof(struct pt_regs));
+ regs->xgs = __KERNEL_PDA;
+ return regs;
+}
+
/*
* cpu_init() initializes state that is per-CPU. Some data is already
* initialized (naturally) in the bootstrap process, such as the GDT
@@ -594,6 +618,7 @@ void __cpuinit cpu_init(void)
struct tss_struct * t = &per_cpu(init_tss, cpu);
struct thread_struct *thread = ¤t->thread;
struct desc_struct *gdt;
+ struct i386_pda *pda;
__u32 stk16_off = (__u32)&per_cpu(cpu_16bit_stack, cpu);
struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
@@ -615,6 +640,10 @@ void __cpuinit cpu_init(void)
/* The CPU hotplug case */
if (cpu_gdt_descr->address) {
gdt = (struct desc_struct *)cpu_gdt_descr->address;
+ pda = cpu_pda(cpu);
+
+ BUG_ON(pda == NULL);
+
memset(gdt, 0, PAGE_SIZE);
goto old_gdt;
}
@@ -625,13 +654,17 @@ void __cpuinit cpu_init(void)
* CPUs, when bootmem will have gone away
*/
if (NODE_DATA(0)->bdata->node_bootmem_map) {
- gdt = (struct desc_struct *)alloc_bootmem_pages(PAGE_SIZE);
+ gdt = alloc_bootmem_pages(PAGE_SIZE);
+ pda = alloc_bootmem(sizeof(*pda));
+
/* alloc_bootmem_pages panics on failure, so no check */
memset(gdt, 0, PAGE_SIZE);
} else {
gdt = (struct desc_struct *)get_zeroed_page(GFP_KERNEL);
- if (unlikely(!gdt)) {
- printk(KERN_CRIT "CPU%d failed to allocate GDT\n", cpu);
+ pda = kmalloc_node(sizeof(*pda), GFP_KERNEL, cpu_to_node(cpu));
+
+ if (unlikely(!gdt || !pda)) {
+ printk(KERN_CRIT "CPU%d failed to allocate GDT or PDA\n", cpu);
for (;;)
local_irq_enable();
}
@@ -651,6 +684,16 @@ old_gdt:
cpu_gdt_descr->size = GDT_SIZE - 1;
cpu_gdt_descr->address = (unsigned long)gdt;
+
+ /* Set up the PDA in this CPU's GDT */
+ cpu_pda(cpu) = pda;
+
+ pda_init(cpu);
+
+ pack_descriptor((u32 *)&gdt[GDT_ENTRY_PDA].a,
+ (u32 *)&gdt[GDT_ENTRY_PDA].b,
+ (unsigned long)pda, sizeof(*pda) - 1,
+ 0x80 | DESCTYPE_S | 0x2, 0); /* present read-write data segment */
load_gdt(cpu_gdt_descr);
load_idt(&idt_descr);
--
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel.
2006-08-27 8:44 [PATCH RFC 0/6] Implement per-processor data areas for i386 Jeremy Fitzhardinge
2006-08-27 8:44 ` [PATCH RFC 1/6] Basic definitions for i386-pda Jeremy Fitzhardinge
2006-08-27 8:44 ` [PATCH RFC 2/6] Initialize the per-CPU data area Jeremy Fitzhardinge
@ 2006-08-27 8:44 ` Jeremy Fitzhardinge
2006-08-27 9:49 ` Keith Owens
2006-08-27 15:57 ` Andi Kleen
2006-08-27 8:44 ` [PATCH RFC 4/6] Fix places where using %gs changes the usermode ABI Jeremy Fitzhardinge
` (5 subsequent siblings)
8 siblings, 2 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 8:44 UTC (permalink / raw)
To: linux-kernel
Cc: Chuck Ebbert, Zachary Amsden, Jan Beulich, Andi Kleen,
Andrew Morton
[-- Attachment #1: i386-pda-use-gs.patch --]
[-- Type: text/plain, Size: 11727 bytes --]
This patch is the meat of the PDA change. This patch makes several
related changes:
1: Most significantly, %gs is now used in the kernel. This means that on
entry, the old value of %gs is saved away, and it is reloaded with
__KERNEL_PDA.
2: entry.S constructs the stack in the shape of struct pt_regs, and this
is passed around the kernel so that the process's saved register
state can be accessed.
Unfortunately struct pt_regs doesn't currently have space for %gs
(or %fs). This patch extends pt_regs to add space for fs and gs,
though only gs is actually saved/restored from it (reserving space
for fs makes simplifies the changes to ptrace).
3: Because %gs is now saved on the stack like %ds, %es and the integer
registers, there are a number of places where it no longer needs to
be handled specially.
4: And since kernel threads run in kernel space and call normal kernel
code, they need to be created with their %gs == __KERNEL_PDA.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
---
arch/i386/kernel/cpu/common.c | 7 ++-
arch/i386/kernel/entry.S | 92 +++++++++++++++++++++++++++-------------
arch/i386/kernel/process.c | 27 +++++------
arch/i386/kernel/signal.c | 6 --
arch/i386/kernel/vm86.c | 4 -
include/asm-i386/mmu_context.h | 4 -
include/asm-i386/pda.h | 2
include/asm-i386/processor.h | 4 +
include/asm-i386/ptrace-abi.h | 2
9 files changed, 93 insertions(+), 55 deletions(-)
===================================================================
--- a/arch/i386/kernel/cpu/common.c
+++ b/arch/i386/kernel/cpu/common.c
@@ -717,8 +717,11 @@ old_gdt:
__set_tss_desc(cpu, GDT_ENTRY_DOUBLEFAULT_TSS, &doublefault_tss);
#endif
- /* Clear %fs and %gs. */
- asm volatile ("movl %0, %%fs; movl %0, %%gs" : : "r" (0));
+ /* Clear %fs. */
+ asm volatile ("mov %0, %%fs" : : "r" (0));
+
+ /* Set %gs for this CPU's PDA */
+ asm volatile ("mov %0, %%gs" : : "r" (__KERNEL_PDA));
/* Clear all 6 debug registers: */
set_debugreg(0, 0);
===================================================================
--- a/arch/i386/kernel/entry.S
+++ b/arch/i386/kernel/entry.S
@@ -30,12 +30,14 @@
* 18(%esp) - %eax
* 1C(%esp) - %ds
* 20(%esp) - %es
- * 24(%esp) - orig_eax
- * 28(%esp) - %eip
- * 2C(%esp) - %cs
- * 30(%esp) - %eflags
- * 34(%esp) - %oldesp
- * 38(%esp) - %oldss
+ * 24(%esp) - %fs (not saved/restored)
+ * 28(%esp) - %gs
+ * 2C(%esp) - orig_eax
+ * 30(%esp) - %eip
+ * 34(%esp) - %cs
+ * 38(%esp) - %eflags
+ * 3C(%esp) - %oldesp
+ * 40(%esp) - %oldss
*
* "current" is in register %ebx during any slow entries.
*/
@@ -62,12 +64,14 @@ EAX = 0x18
EAX = 0x18
DS = 0x1C
ES = 0x20
-ORIG_EAX = 0x24
-EIP = 0x28
-CS = 0x2C
-EFLAGS = 0x30
-OLDESP = 0x34
-OLDSS = 0x38
+FS = 0x24
+GS = 0x28
+ORIG_EAX = 0x2C
+EIP = 0x30
+CS = 0x34
+EFLAGS = 0x38
+OLDESP = 0x3C
+OLDSS = 0x40
CF_MASK = 0x00000001
TF_MASK = 0x00000100
@@ -107,6 +111,11 @@ 1:
#define SAVE_ALL \
cld; \
+ pushl %gs; \
+ CFI_ADJUST_CFA_OFFSET 4;\
+ /*CFI_REL_OFFSET gs, 0;*/\
+ subl $4, %esp; /* %fs */\
+ CFI_ADJUST_CFA_OFFSET 4;\
pushl %es; \
CFI_ADJUST_CFA_OFFSET 4;\
/*CFI_REL_OFFSET es, 0;*/\
@@ -136,8 +145,10 @@ 1:
CFI_REL_OFFSET ebx, 0;\
movl $(__USER_DS), %edx; \
movl %edx, %ds; \
- movl %edx, %es;
-
+ movl %edx, %es; \
+ movl $(__KERNEL_PDA), %edx; \
+ movl %edx, %gs
+
#define RESTORE_INT_REGS \
popl %ebx; \
CFI_ADJUST_CFA_OFFSET -4;\
@@ -169,17 +180,24 @@ 2: popl %es; \
2: popl %es; \
CFI_ADJUST_CFA_OFFSET -4;\
/*CFI_RESTORE es;*/\
-.section .fixup,"ax"; \
-3: movl $0,(%esp); \
+ addl $4,%esp; \
+ CFI_ADJUST_CFA_OFFSET -4;\
+3: popl %gs; \
+ CFI_ADJUST_CFA_OFFSET -4;\
+ /*CFI_RESTORE gs;*/\
+.pushsection .fixup,"ax"; \
+4: movl $0,(%esp); \
jmp 1b; \
-4: movl $0,(%esp); \
+5: movl $0,(%esp); \
jmp 2b; \
-.previous; \
+6: movl $0,(%esp); \
+ jmp 3b; \
.section __ex_table,"a";\
.align 4; \
- .long 1b,3b; \
- .long 2b,4b; \
-.previous
+ .long 1b,4b; \
+ .long 2b,5b; \
+ .long 3b,6b; \
+.popsection
#define RING0_INT_FRAME \
CFI_STARTPROC simple;\
@@ -239,6 +257,7 @@ check_userspace:
andl $(VM_MASK | SEGMENT_RPL_MASK), %eax
cmpl $USER_RPL, %eax
jb resume_kernel # not returning to v8086 or userspace
+
ENTRY(resume_userspace)
DISABLE_INTERRUPTS # make sure we don't miss an interrupt
# setting need_resched or sigpending
@@ -332,11 +351,18 @@ 1: movl (%ebp),%ebp
/* if something modifies registers it must also disable sysexit */
movl EIP(%esp), %edx
movl OLDESP(%esp), %ecx
+1: movw GS(%esp), %gs
xorl %ebp,%ebp
TRACE_IRQS_ON
ENABLE_INTERRUPTS_SYSEXIT
CFI_ENDPROC
-
+.pushsection .fixup,"ax"; \
+2: movl $0,GS(%esp); \
+ jmp 1b; \
+.section __ex_table,"a";\
+ .align 4; \
+ .long 1b,2b; \
+.popsection
# system call handler stub
ENTRY(system_call)
@@ -600,6 +626,10 @@ KPROBE_ENTRY(page_fault)
CFI_ADJUST_CFA_OFFSET 4
ALIGN
error_code:
+ subl $4, %esp
+ CFI_ADJUST_CFA_OFFSET 4
+ pushl %es
+ CFI_ADJUST_CFA_OFFSET 4
pushl %ds
CFI_ADJUST_CFA_OFFSET 4
/*CFI_REL_OFFSET ds, 0*/
@@ -627,21 +657,23 @@ error_code:
CFI_ADJUST_CFA_OFFSET 4
CFI_REL_OFFSET ebx, 0
cld
- pushl %es
- CFI_ADJUST_CFA_OFFSET 4
- /*CFI_REL_OFFSET es, 0*/
+ pushl %gs
+ CFI_ADJUST_CFA_OFFSET 4
+ /*CFI_REL_OFFSET gs, 0*/
UNWIND_ESPFIX_STACK
popl %ecx
CFI_ADJUST_CFA_OFFSET -4
- /*CFI_REGISTER es, ecx*/
- movl ES(%esp), %edi # get the function address
+ /*CFI_REGISTER gs, ecx*/
+ movl GS(%esp), %edi # get the function address
movl ORIG_EAX(%esp), %edx # get the error code
movl %eax, ORIG_EAX(%esp)
- movl %ecx, ES(%esp)
- /*CFI_REL_OFFSET es, ES*/
+ movl %ecx, GS(%esp)
+ /*CFI_REL_OFFSET gs, GS*/
movl $(__USER_DS), %ecx
movl %ecx, %ds
movl %ecx, %es
+ movl $(__KERNEL_PDA), %ecx
+ movl %ecx, %gs
movl %esp,%eax # pt_regs pointer
call *%edi
jmp ret_from_exception
@@ -938,6 +970,8 @@ ENTRY(arch_unwind_init_running)
movl %ebx, EAX(%edx)
movl $__USER_DS, DS(%edx)
movl $__USER_DS, ES(%edx)
+ movl $0, FS(%edx)
+ movl $0, GS(%edx)
movl %ebx, ORIG_EAX(%edx)
movl %ecx, EIP(%edx)
movl 12(%esp), %ecx
===================================================================
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -56,6 +56,7 @@
#include <asm/tlbflush.h>
#include <asm/cpu.h>
+#include <asm/pda.h>
asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
@@ -336,6 +337,7 @@ int kernel_thread(int (*fn)(void *), voi
regs.xds = __USER_DS;
regs.xes = __USER_DS;
+ regs.xgs = __KERNEL_PDA;
regs.orig_eax = -1;
regs.eip = (unsigned long) kernel_thread_helper;
regs.xcs = __KERNEL_CS | get_kernel_rpl();
@@ -421,7 +423,6 @@ int copy_thread(int nr, unsigned long cl
p->thread.eip = (unsigned long) ret_from_fork;
savesegment(fs,p->thread.fs);
- savesegment(gs,p->thread.gs);
tsk = current;
if (unlikely(test_tsk_thread_flag(tsk, TIF_IO_BITMAP))) {
@@ -652,16 +653,16 @@ struct task_struct fastcall * __switch_t
load_esp0(tss, next);
/*
- * Save away %fs and %gs. No need to save %es and %ds, as
- * those are always kernel segments while inside the kernel.
- * Doing this before setting the new TLS descriptors avoids
- * the situation where we temporarily have non-reloadable
- * segments in %fs and %gs. This could be an issue if the
- * NMI handler ever used %fs or %gs (it does not today), or
- * if the kernel is running inside of a hypervisor layer.
+ * Save away %fs. No need to save %gs, as it was saved on the
+ * stack on entry. No need to save %es and %ds, as those are
+ * always kernel segments while inside the kernel. Doing this
+ * before setting the new TLS descriptors avoids the situation
+ * where we temporarily have non-reloadable segments in %fs
+ * and %gs. This could be an issue if the NMI handler ever
+ * used %fs or %gs (it does not today), or if the kernel is
+ * running inside of a hypervisor layer.
*/
savesegment(fs, prev->fs);
- savesegment(gs, prev->gs);
/*
* Load the per-thread Thread-Local Storage descriptor.
@@ -669,16 +670,14 @@ struct task_struct fastcall * __switch_t
load_TLS(next, cpu);
/*
- * Restore %fs and %gs if needed.
+ * Restore %fs if needed.
*
- * Glibc normally makes %fs be zero, and %gs is one of
- * the TLS segments.
+ * Glibc normally makes %fs be zero.
*/
if (unlikely(prev->fs | next->fs))
loadsegment(fs, next->fs);
- if (prev->gs | next->gs)
- loadsegment(gs, next->gs);
+ write_pda(pcurrent, next_p);
/*
* Restore IOPL if needed.
===================================================================
--- a/arch/i386/kernel/signal.c
+++ b/arch/i386/kernel/signal.c
@@ -128,7 +128,7 @@ restore_sigcontext(struct pt_regs *regs,
X86_EFLAGS_TF | X86_EFLAGS_SF | X86_EFLAGS_ZF | \
X86_EFLAGS_AF | X86_EFLAGS_PF | X86_EFLAGS_CF)
- GET_SEG(gs);
+ COPY_SEG(gs);
GET_SEG(fs);
COPY_SEG(es);
COPY_SEG(ds);
@@ -244,9 +244,7 @@ setup_sigcontext(struct sigcontext __use
{
int tmp, err = 0;
- tmp = 0;
- savesegment(gs, tmp);
- err |= __put_user(tmp, (unsigned int __user *)&sc->gs);
+ err |= __put_user(regs->xgs, (unsigned int __user *)&sc->gs);
savesegment(fs, tmp);
err |= __put_user(tmp, (unsigned int __user *)&sc->fs);
===================================================================
--- a/arch/i386/kernel/vm86.c
+++ b/arch/i386/kernel/vm86.c
@@ -294,7 +294,7 @@ static void do_sys_vm86(struct kernel_vm
info->regs32->eax = 0;
tsk->thread.saved_esp0 = tsk->thread.esp0;
savesegment(fs, tsk->thread.saved_fs);
- savesegment(gs, tsk->thread.saved_gs);
+ tsk->thread.saved_gs = info->regs32->xgs;
tss = &per_cpu(init_tss, get_cpu());
tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0;
@@ -306,7 +306,7 @@ static void do_sys_vm86(struct kernel_vm
tsk->thread.screen_bitmap = info->screen_bitmap;
if (info->flags & VM86_SCREEN_BITMAP)
mark_screen_rdonly(tsk->mm);
- __asm__ __volatile__("xorl %eax,%eax; movl %eax,%fs; movl %eax,%gs\n\t");
+ __asm__ __volatile__("movl %0,%%fs\n\t" : : "r" (0));
__asm__ __volatile__("movl %%eax, %0\n" :"=r"(eax));
/*call audit_syscall_exit since we do not exit via the normal paths */
===================================================================
--- a/include/asm-i386/mmu_context.h
+++ b/include/asm-i386/mmu_context.h
@@ -62,8 +62,8 @@ static inline void switch_mm(struct mm_s
#endif
}
-#define deactivate_mm(tsk, mm) \
- asm("movl %0,%%fs ; movl %0,%%gs": :"r" (0))
+#define deactivate_mm(tsk, mm) \
+ asm("movl %0,%%fs": :"r" (0));
#define activate_mm(prev, next) \
switch_mm((prev),(next),NULL)
===================================================================
--- a/include/asm-i386/processor.h
+++ b/include/asm-i386/processor.h
@@ -471,6 +471,7 @@ struct thread_struct {
.vm86_info = NULL, \
.sysenter_cs = __KERNEL_CS, \
.io_bitmap_ptr = NULL, \
+ .gs = __KERNEL_PDA, \
}
/*
@@ -498,7 +499,8 @@ static inline void load_esp0(struct tss_
}
#define start_thread(regs, new_eip, new_esp) do { \
- __asm__("movl %0,%%fs ; movl %0,%%gs": :"r" (0)); \
+ __asm__("movl %0,%%fs": :"r" (0)); \
+ regs->xgs = 0; \
set_fs(USER_DS); \
regs->xds = __USER_DS; \
regs->xes = __USER_DS; \
===================================================================
--- a/include/asm-i386/ptrace-abi.h
+++ b/include/asm-i386/ptrace-abi.h
@@ -33,6 +33,8 @@ struct pt_regs {
long eax;
int xds;
int xes;
+ int xfs;
+ int xgs;
long orig_eax;
long eip;
int xcs;
--
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH RFC 4/6] Fix places where using %gs changes the usermode ABI.
2006-08-27 8:44 [PATCH RFC 0/6] Implement per-processor data areas for i386 Jeremy Fitzhardinge
` (2 preceding siblings ...)
2006-08-27 8:44 ` [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel Jeremy Fitzhardinge
@ 2006-08-27 8:44 ` Jeremy Fitzhardinge
2006-08-27 15:59 ` Andi Kleen
2006-08-27 8:44 ` [PATCH RFC 5/6] Implement smp_processor_id() with the PDA Jeremy Fitzhardinge
` (4 subsequent siblings)
8 siblings, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 8:44 UTC (permalink / raw)
To: linux-kernel
Cc: Chuck Ebbert, Zachary Amsden, Jan Beulich, Andi Kleen,
Andrew Morton
[-- Attachment #1: i386-pda-fix-abi.patch --]
[-- Type: text/plain, Size: 2990 bytes --]
There are a few places where the change in struct pt_regs and the use
of %gs affect the userspace ABI. These are primarily debugging
interfaces where thread state can be inspected or extracted.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
---
arch/i386/kernel/process.c | 6 +++---
arch/i386/kernel/ptrace.c | 8 +-------
include/asm-i386/elf.h | 2 +-
include/asm-i386/unwind.h | 1 +
4 files changed, 6 insertions(+), 11 deletions(-)
===================================================================
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -305,8 +305,8 @@ void show_regs(struct pt_regs * regs)
regs->eax,regs->ebx,regs->ecx,regs->edx);
printk("ESI: %08lx EDI: %08lx EBP: %08lx",
regs->esi, regs->edi, regs->ebp);
- printk(" DS: %04x ES: %04x\n",
- 0xffff & regs->xds,0xffff & regs->xes);
+ printk(" DS: %04x ES: %04x GS: %04x\n",
+ 0xffff & regs->xds,0xffff & regs->xes, 0xffff & regs->xgs);
cr0 = read_cr0();
cr2 = read_cr2();
@@ -500,7 +500,7 @@ void dump_thread(struct pt_regs * regs,
dump->regs.ds = regs->xds;
dump->regs.es = regs->xes;
savesegment(fs,dump->regs.fs);
- savesegment(gs,dump->regs.gs);
+ dump->regs.gs = regs->xgs;
dump->regs.orig_eax = regs->orig_eax;
dump->regs.eip = regs->eip;
dump->regs.cs = regs->xcs;
===================================================================
--- a/arch/i386/kernel/ptrace.c
+++ b/arch/i386/kernel/ptrace.c
@@ -94,13 +94,9 @@ static int putreg(struct task_struct *ch
return -EIO;
child->thread.fs = value;
return 0;
- case GS:
- if (value && (value & 3) != 3)
- return -EIO;
- child->thread.gs = value;
- return 0;
case DS:
case ES:
+ case GS:
if (value && (value & 3) != 3)
return -EIO;
value &= 0xffff;
@@ -132,8 +128,6 @@ static unsigned long getreg(struct task_
retval = child->thread.fs;
break;
case GS:
- retval = child->thread.gs;
- break;
case DS:
case ES:
case SS:
===================================================================
--- a/include/asm-i386/elf.h
+++ b/include/asm-i386/elf.h
@@ -88,7 +88,7 @@ typedef struct user_fxsr_struct elf_fpxr
pr_reg[7] = regs->xds; \
pr_reg[8] = regs->xes; \
savesegment(fs,pr_reg[9]); \
- savesegment(gs,pr_reg[10]); \
+ pr_reg[10] = regs->xgs; \
pr_reg[11] = regs->orig_eax; \
pr_reg[12] = regs->eip; \
pr_reg[13] = regs->xcs; \
===================================================================
--- a/include/asm-i386/unwind.h
+++ b/include/asm-i386/unwind.h
@@ -64,6 +64,7 @@ static inline void arch_unw_init_blocked
info->regs.xss = __KERNEL_DS;
info->regs.xds = __USER_DS;
info->regs.xes = __USER_DS;
+ info->regs.xgs = __KERNEL_PDA;
}
extern asmlinkage int arch_unwind_init_running(struct unwind_frame_info *,
--
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH RFC 5/6] Implement smp_processor_id() with the PDA.
2006-08-27 8:44 [PATCH RFC 0/6] Implement per-processor data areas for i386 Jeremy Fitzhardinge
` (3 preceding siblings ...)
2006-08-27 8:44 ` [PATCH RFC 4/6] Fix places where using %gs changes the usermode ABI Jeremy Fitzhardinge
@ 2006-08-27 8:44 ` Jeremy Fitzhardinge
2006-08-27 8:44 ` [PATCH RFC 6/6] Implement "current" " Jeremy Fitzhardinge
` (3 subsequent siblings)
8 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 8:44 UTC (permalink / raw)
To: linux-kernel
Cc: Chuck Ebbert, Zachary Amsden, Jan Beulich, Andi Kleen,
Andrew Morton
[-- Attachment #1: i386-pda-smp_processor_id.patch --]
[-- Type: text/plain, Size: 3565 bytes --]
Use the cpu_number in the PDA to implement raw_smp_processor_id. This
is a little simpler than using thread_info, though the cpu field in
thread_info cannot be removed since it is used for things other than
getting the current CPU in common code.
The slightly subtle part of this patch is dealing with very early uses
of smp_processor_id(). This is handled on the boot CPU by setting up
a very early PDA, which is later replaced when cpu_init() is called on
the boot CPU. For other CPUs, it uses the thread_info cpu field until
the PDA has been set up.
This is more or less an example of using the PDA, and to give it a
proper exercising.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
---
arch/i386/kernel/asm-offsets.c | 2 +-
arch/i386/kernel/cpu/common.c | 16 +++++++++++++++-
include/asm-i386/smp.h | 3 ++-
3 files changed, 18 insertions(+), 3 deletions(-)
===================================================================
--- a/arch/i386/kernel/asm-offsets.c
+++ b/arch/i386/kernel/asm-offsets.c
@@ -52,7 +52,6 @@ void foo(void)
OFFSET(TI_exec_domain, thread_info, exec_domain);
OFFSET(TI_flags, thread_info, flags);
OFFSET(TI_status, thread_info, status);
- OFFSET(TI_cpu, thread_info, cpu);
OFFSET(TI_preempt_count, thread_info, preempt_count);
OFFSET(TI_addr_limit, thread_info, addr_limit);
OFFSET(TI_restart_block, thread_info, restart_block);
@@ -78,4 +77,5 @@ void foo(void)
BLANK();
OFFSET(PDA_pcurrent, i386_pda, pcurrent);
+ OFFSET(PDA_cpu, i386_pda, cpu_number);
}
===================================================================
--- a/arch/i386/kernel/cpu/common.c
+++ b/arch/i386/kernel/cpu/common.c
@@ -606,6 +606,20 @@ struct pt_regs * __devinit idle_regs(str
return regs;
}
+/* Set up a very early PDA for the boot CPU so that smp_processor_id will work */
+void __init smp_setup_processor_id(void)
+{
+ static const __initdata struct i386_pda boot_pda;
+
+ pack_descriptor((u32 *)&cpu_gdt_table[GDT_ENTRY_PDA].a,
+ (u32 *)&cpu_gdt_table[GDT_ENTRY_PDA].b,
+ (unsigned long)&boot_pda, sizeof(struct i386_pda) - 1,
+ 0x80 | DESCTYPE_S | 0x2, 0); /* present read-write data segment */
+
+ /* Set %gs for this CPU's PDA */
+ asm volatile ("mov %0, %%gs" : : "r" (__KERNEL_PDA));
+}
+
/*
* cpu_init() initializes state that is per-CPU. Some data is already
* initialized (naturally) in the bootstrap process, such as the GDT
@@ -614,7 +628,7 @@ struct pt_regs * __devinit idle_regs(str
*/
void __cpuinit cpu_init(void)
{
- int cpu = smp_processor_id();
+ int cpu = current_thread_info()->cpu; /* need to use this until the PDA is set up */
struct tss_struct * t = &per_cpu(init_tss, cpu);
struct thread_struct *thread = ¤t->thread;
struct desc_struct *gdt;
===================================================================
--- a/include/asm-i386/smp.h
+++ b/include/asm-i386/smp.h
@@ -8,6 +8,7 @@
#include <linux/kernel.h>
#include <linux/threads.h>
#include <linux/cpumask.h>
+#include <asm/pda.h>
#endif
#ifdef CONFIG_X86_LOCAL_APIC
@@ -58,7 +59,7 @@ extern void cpu_uninit(void);
* from the initial startup. We map APIC_BASE very early in page_setup(),
* so this is correct in the x86 case.
*/
-#define raw_smp_processor_id() (current_thread_info()->cpu)
+#define raw_smp_processor_id() (read_pda(cpu_number))
extern cpumask_t cpu_callout_map;
extern cpumask_t cpu_callin_map;
--
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH RFC 6/6] Implement "current" with the PDA.
2006-08-27 8:44 [PATCH RFC 0/6] Implement per-processor data areas for i386 Jeremy Fitzhardinge
` (4 preceding siblings ...)
2006-08-27 8:44 ` [PATCH RFC 5/6] Implement smp_processor_id() with the PDA Jeremy Fitzhardinge
@ 2006-08-27 8:44 ` Jeremy Fitzhardinge
2006-08-27 16:01 ` Andi Kleen
2006-08-27 9:47 ` [PATCH RFC 0/6] Implement per-processor data areas for i386 Arjan van de Ven
` (2 subsequent siblings)
8 siblings, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 8:44 UTC (permalink / raw)
To: linux-kernel
Cc: Chuck Ebbert, Zachary Amsden, Jan Beulich, Andi Kleen,
Andrew Morton
[-- Attachment #1: i386-pda-current.patch --]
[-- Type: text/plain, Size: 3774 bytes --]
Use the pcurrent field in the PDA to implement the "current" macro.
This ends up compiling down to a single instruction to get the current
task.
The slightly tricky part about this patch is that cpu_init() uses
current very early, before the PDA has been set up. In this instance,
it uses current_thread_info()->task to get the current task.
Also, the very early PDA set up for the boot cpu contains the initial
task pointer so current works from a very early stage.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
--
arch/i386/kernel/cpu/common.c | 27 ++++++++++++++++-----------
include/asm-i386/current.h | 11 ++---------
2 files changed, 18 insertions(+), 20 deletions(-)
===================================================================
--- a/include/asm-i386/current.h
+++ b/include/asm-i386/current.h
@@ -1,15 +1,8 @@
#ifndef _I386_CURRENT_H
#define _I386_CURRENT_H
-#include <linux/thread_info.h>
+#include <asm/pda.h>
-struct task_struct;
-
-static __always_inline struct task_struct * get_current(void)
-{
- return current_thread_info()->task;
-}
-
-#define current get_current()
+#define current (read_pda(pcurrent))
#endif /* !(_I386_CURRENT_H) */
===================================================================
--- a/arch/i386/kernel/cpu/common.c
+++ b/arch/i386/kernel/cpu/common.c
@@ -587,16 +587,16 @@ void __init early_cpu_init(void)
#endif
}
-static __cpuinit void pda_init(int cpu)
+static __cpuinit void pda_init(int cpu, struct task_struct *curr)
{
struct i386_pda *pda = cpu_pda(cpu);
memset(pda, 0, sizeof(*pda));
pda->cpu_number = cpu;
- pda->pcurrent = current;
-
- printk("cpu %d current %p\n", cpu, current);
+ pda->pcurrent = curr;
+
+ printk("cpu %d current %p\n", cpu, curr);
}
struct pt_regs * __devinit idle_regs(struct pt_regs *regs)
@@ -609,7 +609,7 @@ struct pt_regs * __devinit idle_regs(str
/* Set up a very early PDA for the boot CPU so that smp_processor_id will work */
void __init smp_setup_processor_id(void)
{
- static const __initdata struct i386_pda boot_pda;
+ static __initdata struct i386_pda boot_pda;
pack_descriptor((u32 *)&cpu_gdt_table[GDT_ENTRY_PDA].a,
(u32 *)&cpu_gdt_table[GDT_ENTRY_PDA].b,
@@ -618,6 +618,8 @@ void __init smp_setup_processor_id(void)
/* Set %gs for this CPU's PDA */
asm volatile ("mov %0, %%gs" : : "r" (__KERNEL_PDA));
+
+ boot_pda.pcurrent = current_thread_info()->task;
}
/*
@@ -628,9 +630,12 @@ void __init smp_setup_processor_id(void)
*/
void __cpuinit cpu_init(void)
{
- int cpu = current_thread_info()->cpu; /* need to use this until the PDA is set up */
+ /* Need to use thread_info for cpu and current until the PDA is set up */
+ int cpu = current_thread_info()->cpu;
+ struct task_struct *curr = current_thread_info()->task;
+
struct tss_struct * t = &per_cpu(init_tss, cpu);
- struct thread_struct *thread = ¤t->thread;
+ struct thread_struct *thread = &curr->thread;
struct desc_struct *gdt;
struct i386_pda *pda;
__u32 stk16_off = (__u32)&per_cpu(cpu_16bit_stack, cpu);
@@ -702,7 +707,7 @@ old_gdt:
/* Set up the PDA in this CPU's GDT */
cpu_pda(cpu) = pda;
- pda_init(cpu);
+ pda_init(cpu, curr);
pack_descriptor((u32 *)&gdt[GDT_ENTRY_PDA].a,
(u32 *)&gdt[GDT_ENTRY_PDA].b,
@@ -716,10 +721,10 @@ old_gdt:
* Set up and load the per-CPU TSS and LDT
*/
atomic_inc(&init_mm.mm_count);
- current->active_mm = &init_mm;
- if (current->mm)
+ curr->active_mm = &init_mm;
+ if (curr->mm)
BUG();
- enter_lazy_tlb(&init_mm, current);
+ enter_lazy_tlb(&init_mm, curr);
load_esp0(t, thread);
set_tss_desc(cpu,t);
--
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
2006-08-27 8:44 [PATCH RFC 0/6] Implement per-processor data areas for i386 Jeremy Fitzhardinge
` (5 preceding siblings ...)
2006-08-27 8:44 ` [PATCH RFC 6/6] Implement "current" " Jeremy Fitzhardinge
@ 2006-08-27 9:47 ` Arjan van de Ven
2006-08-27 16:46 ` Jeremy Fitzhardinge
2006-08-27 16:01 ` Andi Kleen
2006-08-27 17:21 ` Andreas Mohr
8 siblings, 1 reply; 35+ messages in thread
From: Arjan van de Ven @ 2006-08-27 9:47 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andi Kleen, Andrew Morton
> - Measure performance impact. The patch adds a segment register
> save/restore on entry/exit to the kernel. This expense should be
> offset by savings in using the PDA while in the kernel, but I haven't
> measured this yet. Space savings are already appealing though.
this will be interesting; x86-64 has a nice instruction to help with
this; 32 bit does not... so far conventional wisdom has been that
without the instruction it's not going to be worth it.
When you're benchmarking this please use multiple CPU generations from
different vendors; I suspect this is one of those things that vary
greatly between models
> - Make it a config option? UP systems don't need to do any of this,
> other than having a single pre-allocated PDA. Unfortunately, it gets
> a bit messy to do this given the changes needed in handling %gs.
A config option for this is a mistake imo. Not every patch is worth a
config option! It's good or it's not, if it's good it should be there
always, if it's not.... Something this fundamental to the core doesn't
really have a "but it's optional" argument going for it, unlike
individual drivers or subsystems...
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel.
2006-08-27 8:44 ` [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel Jeremy Fitzhardinge
@ 2006-08-27 9:49 ` Keith Owens
2006-08-27 10:01 ` Jeremy Fitzhardinge
2006-08-27 15:57 ` Andi Kleen
1 sibling, 1 reply; 35+ messages in thread
From: Keith Owens @ 2006-08-27 9:49 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andi Kleen, Andrew Morton
Jeremy Fitzhardinge (on Sun, 27 Aug 2006 01:44:20 -0700) wrote:
>2: entry.S constructs the stack in the shape of struct pt_regs, and this
> is passed around the kernel so that the process's saved register
> state can be accessed.
>--- a/arch/i386/kernel/entry.S
>+++ b/arch/i386/kernel/entry.S
>@@ -30,12 +30,14 @@
> * 18(%esp) - %eax
> * 1C(%esp) - %ds
> * 20(%esp) - %es
>+ * 24(%esp) - %fs (not saved/restored)
>+ * 28(%esp) - %gs
>+ * 2C(%esp) - orig_eax
>+ * 30(%esp) - %eip
>+ * 34(%esp) - %cs
>+ * 38(%esp) - %eflags
>+ * 3C(%esp) - %oldesp
>+ * 40(%esp) - %oldss
...
>+FS = 0x24
>+GS = 0x28
>+ORIG_EAX = 0x2C
>+EIP = 0x30
>+CS = 0x34
>+EFLAGS = 0x38
>+OLDESP = 0x3C
>+OLDSS = 0x40
Now would be a good time to get rid of those hard coded numbers and use
asm-offsets instead.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel.
2006-08-27 9:49 ` Keith Owens
@ 2006-08-27 10:01 ` Jeremy Fitzhardinge
0 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 10:01 UTC (permalink / raw)
To: Keith Owens
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andi Kleen, Andrew Morton
Keith Owens wrote:
> Now would be a good time to get rid of those hard coded numbers and use
> asm-offsets instead.
Yeah, I was considering that. I'll do a patch.
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel.
2006-08-27 8:44 ` [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel Jeremy Fitzhardinge
2006-08-27 9:49 ` Keith Owens
@ 2006-08-27 15:57 ` Andi Kleen
2006-08-27 16:36 ` Jeremy Fitzhardinge
2006-08-27 17:20 ` Jeremy Fitzhardinge
1 sibling, 2 replies; 35+ messages in thread
From: Andi Kleen @ 2006-08-27 15:57 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andrew Morton
>
> - /* Clear %fs and %gs. */
> - asm volatile ("movl %0, %%fs; movl %0, %%gs" : : "r" (0));
> + /* Clear %fs. */
> + asm volatile ("mov %0, %%fs" : : "r" (0));
> +
> + /* Set %gs for this CPU's PDA */
> + asm volatile ("mov %0, %%gs" : : "r" (__KERNEL_PDA));
I would add memory clobbers here to make sure the dependency on read/write pda
is right.
> +1: movw GS(%esp), %gs
movl is recommended in 32bit mode
> --- a/arch/i386/kernel/signal.c
> +++ b/arch/i386/kernel/signal.c
> @@ -128,7 +128,7 @@ restore_sigcontext(struct pt_regs *regs,
> X86_EFLAGS_TF | X86_EFLAGS_SF | X86_EFLAGS_ZF | \
> X86_EFLAGS_AF | X86_EFLAGS_PF | X86_EFLAGS_CF)
>
> - GET_SEG(gs);
> + COPY_SEG(gs);
> GET_SEG(fs);
> COPY_SEG(es);
> COPY_SEG(ds);
> @@ -244,9 +244,7 @@ setup_sigcontext(struct sigcontext __use
> {
> int tmp, err = 0;
>
> - tmp = 0;
> - savesegment(gs, tmp);
> - err |= __put_user(tmp, (unsigned int __user *)&sc->gs);
> + err |= __put_user(regs->xgs, (unsigned int __user *)&sc->gs);
> savesegment(fs, tmp);
> err |= __put_user(tmp, (unsigned int __user *)&sc->fs);
Hmm, changing it for the sc looks a bit bogus. If everything
is right nothing should change for user space, but this changes something.
> @@ -306,7 +306,7 @@ static void do_sys_vm86(struct kernel_vm
> tsk->thread.screen_bitmap = info->screen_bitmap;
> if (info->flags & VM86_SCREEN_BITMAP)
> mark_screen_rdonly(tsk->mm);
> - __asm__ __volatile__("xorl %eax,%eax; movl %eax,%fs; movl %eax,%gs\n\t");
> + __asm__ __volatile__("movl %0,%%fs\n\t" : : "r" (0));
This is actually a useful bug fix on its own.
-Andi
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 4/6] Fix places where using %gs changes the usermode ABI.
2006-08-27 8:44 ` [PATCH RFC 4/6] Fix places where using %gs changes the usermode ABI Jeremy Fitzhardinge
@ 2006-08-27 15:59 ` Andi Kleen
2006-08-27 16:37 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2006-08-27 15:59 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andrew Morton
> ===================================================================
> --- a/arch/i386/kernel/ptrace.c
> +++ b/arch/i386/kernel/ptrace.c
> @@ -94,13 +94,9 @@ static int putreg(struct task_struct *ch
> return -EIO;
> child->thread.fs = value;
> return 0;
> - case GS:
> - if (value && (value & 3) != 3)
> - return -EIO;
> - child->thread.gs = value;
> - return 0;
> case DS:
> case ES:
> + case GS:
> if (value && (value & 3) != 3)
> return -EIO;
> value &= 0xffff;
> @@ -132,8 +128,6 @@ static unsigned long getreg(struct task_
> retval = child->thread.fs;
> break;
> case GS:
> - retval = child->thread.gs;
> - break;
Didn't the eip/esp offsets change too?
-Andi
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 6/6] Implement "current" with the PDA.
2006-08-27 8:44 ` [PATCH RFC 6/6] Implement "current" " Jeremy Fitzhardinge
@ 2006-08-27 16:01 ` Andi Kleen
2006-08-27 16:38 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2006-08-27 16:01 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andrew Morton
On Sunday 27 August 2006 10:44, Jeremy Fitzhardinge wrote:
> Use the pcurrent field in the PDA to implement the "current" macro.
> This ends up compiling down to a single instruction to get the current
> task.
>
> The slightly tricky part about this patch is that cpu_init() uses
> current very early, before the PDA has been set up. In this instance,
> it uses current_thread_info()->task to get the current task.
>
> Also, the very early PDA set up for the boot cpu contains the initial
> task pointer so current works from a very early stage.
With that you could remove the code in do_IRQ now that setups up a fake thread_info
for interrupt stacks.
-Andi
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
2006-08-27 8:44 [PATCH RFC 0/6] Implement per-processor data areas for i386 Jeremy Fitzhardinge
` (6 preceding siblings ...)
2006-08-27 9:47 ` [PATCH RFC 0/6] Implement per-processor data areas for i386 Arjan van de Ven
@ 2006-08-27 16:01 ` Andi Kleen
2006-08-27 16:41 ` Jeremy Fitzhardinge
2006-08-27 17:21 ` Andreas Mohr
8 siblings, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2006-08-27 16:01 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andrew Morton
Very cool.
> - Make it work. It works UP on a test QEMU machine, but it doesn't
> yet work on real hardware, or SMP (though not working SMP on QEMU is
> more likely to be a QEMU problem). Not sure what the problem is yet;
> I'm hoping review will reveal something.
I bet qemu doesn't have a real descriptor cache unlike real CPUs.
So likely it is some disconnect between changing the backing GDT
and referencing the register. Reload %gs more aggressively?
Comparing with SimNow! (which should behave more like a real CPU)
might be also interesting.
> - Measure performance impact. The patch adds a segment register
> save/restore on entry/exit to the kernel. This expense should be
> offset by savings in using the PDA while in the kernel, but I haven't
> measured this yet. Space savings are already appealing though.
> - Modify more things to use the PDA. The more that uses it, the more
> the cost of the %gs save/restore is amortized. smp_processor_id and
> current are the obvious first choices, which are implemented in this
> series.
per cpu data would be the prime candidate. It is pretty simple.
> - Make it a config option? UP systems don't need to do any of this,
> other than having a single pre-allocated PDA. Unfortunately, it gets
> a bit messy to do this given the changes needed in handling %gs.
Please don't.
(weak point:)
- The stack protector code might work one day on i386 too.
-Andi
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel.
2006-08-27 15:57 ` Andi Kleen
@ 2006-08-27 16:36 ` Jeremy Fitzhardinge
2006-08-27 17:20 ` Jeremy Fitzhardinge
1 sibling, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 16:36 UTC (permalink / raw)
To: Andi Kleen
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andrew Morton
Andi Kleen wrote:
>>
>> - /* Clear %fs and %gs. */
>> - asm volatile ("movl %0, %%fs; movl %0, %%gs" : : "r" (0));
>> + /* Clear %fs. */
>> + asm volatile ("mov %0, %%fs" : : "r" (0));
>> +
>> + /* Set %gs for this CPU's PDA */
>> + asm volatile ("mov %0, %%gs" : : "r" (__KERNEL_PDA));
>>
>
> I would add memory clobbers here to make sure the dependency on read/write pda
> is right.
>
Yep. And the "m" args in the pda asm isn't quite right for rmw PDA ops
(not that there are any at the moment).
>> +1: movw GS(%esp), %gs
>>
>
> movl is recommended in 32bit mode
>
OK. I thought the assembler objected to me about it.
>> --- a/arch/i386/kernel/signal.c
>> +++ b/arch/i386/kernel/signal.c
>> @@ -128,7 +128,7 @@ restore_sigcontext(struct pt_regs *regs,
>> X86_EFLAGS_TF | X86_EFLAGS_SF | X86_EFLAGS_ZF | \
>> X86_EFLAGS_AF | X86_EFLAGS_PF | X86_EFLAGS_CF)
>>
>> - GET_SEG(gs);
>> + COPY_SEG(gs);
>> GET_SEG(fs);
>> COPY_SEG(es);
>> COPY_SEG(ds);
>> @@ -244,9 +244,7 @@ setup_sigcontext(struct sigcontext __use
>> {
>> int tmp, err = 0;
>>
>> - tmp = 0;
>> - savesegment(gs, tmp);
>> - err |= __put_user(tmp, (unsigned int __user *)&sc->gs);
>> + err |= __put_user(regs->xgs, (unsigned int __user *)&sc->gs);
>> savesegment(fs, tmp);
>> err |= __put_user(tmp, (unsigned int __user *)&sc->fs);
>>
>
> Hmm, changing it for the sc looks a bit bogus. If everything
> is right nothing should change for user space, but this changes something.
>
The sigcontext contains the userspace register state at the time of the
signal. Since userspace %gs is stored in the on-stack pt_regs, that
should be where it fetches it from to fill out the sigcontext, rather
than the kernel's internal value of %gs - in other words, it should be
the same as ds and es. Or am I missing something?
>> @@ -306,7 +306,7 @@ static void do_sys_vm86(struct kernel_vm
>> tsk->thread.screen_bitmap = info->screen_bitmap;
>> if (info->flags & VM86_SCREEN_BITMAP)
>> mark_screen_rdonly(tsk->mm);
>> - __asm__ __volatile__("xorl %eax,%eax; movl %eax,%fs; movl %eax,%gs\n\t");
>> + __asm__ __volatile__("movl %0,%%fs\n\t" : : "r" (0));
>>
>
> This is actually a useful bug fix on its own.
>
Yep. But there seems to be some other very dubious code in there as
well (the asm("mov %%eax,%0" : "=r" (eax)) sequence). I was wondering
about what it all does...
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 4/6] Fix places where using %gs changes the usermode ABI.
2006-08-27 15:59 ` Andi Kleen
@ 2006-08-27 16:37 ` Jeremy Fitzhardinge
0 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 16:37 UTC (permalink / raw)
To: Andi Kleen
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andrew Morton
Andi Kleen wrote:
> Didn't the eip/esp offsets change too?
Yup. I'll recheck.
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 6/6] Implement "current" with the PDA.
2006-08-27 16:01 ` Andi Kleen
@ 2006-08-27 16:38 ` Jeremy Fitzhardinge
0 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 16:38 UTC (permalink / raw)
To: Andi Kleen
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andrew Morton
Andi Kleen wrote:
> With that you could remove the code in do_IRQ now that setups up a fake thread_info
> for interrupt stacks.
Oh, OK. I was wondering what that was for.
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
2006-08-27 16:01 ` Andi Kleen
@ 2006-08-27 16:41 ` Jeremy Fitzhardinge
0 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 16:41 UTC (permalink / raw)
To: Andi Kleen
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andrew Morton
Andi Kleen wrote:
> I bet qemu doesn't have a real descriptor cache unlike real CPUs.
> So likely it is some disconnect between changing the backing GDT
> and referencing the register. Reload %gs more aggressively?
>
The GDT only gets touched once in cpu_init(), and %gs is reloaded on
every kernel entry, so I don't think that's it. I seems to have
interrupt issues with SMP.
And either way, it still doesn't work on real hardware...
> Comparing with SimNow! (which should behave more like a real CPU)
> might be also interesting.
>
Yeah, I'll have to try that out.
>> - Measure performance impact. The patch adds a segment register
>> save/restore on entry/exit to the kernel. This expense should be
>> offset by savings in using the PDA while in the kernel, but I haven't
>> measured this yet. Space savings are already appealing though.
>> - Modify more things to use the PDA. The more that uses it, the more
>> the cost of the %gs save/restore is amortized. smp_processor_id and
>> current are the obvious first choices, which are implemented in this
>> series.
>>
>
> per cpu data would be the prime candidate. It is pretty simple.
>
Well, it has to be arch-specific per-cpu data, since the PDA is arch
specific. But there should be various pieces of interrupt state that
adapt well to it.
>> - Make it a config option? UP systems don't need to do any of this,
>> other than having a single pre-allocated PDA. Unfortunately, it gets
>> a bit messy to do this given the changes needed in handling %gs.
>>
>
> Please don't.
>
Yeah, that wasn't really a serious thought...
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
2006-08-27 9:47 ` [PATCH RFC 0/6] Implement per-processor data areas for i386 Arjan van de Ven
@ 2006-08-27 16:46 ` Jeremy Fitzhardinge
2006-08-27 17:44 ` Arjan van de Ven
0 siblings, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 16:46 UTC (permalink / raw)
To: Arjan van de Ven
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andi Kleen, Andrew Morton
Arjan van de Ven wrote:
> this will be interesting; x86-64 has a nice instruction to help with
> this; 32 bit does not... so far conventional wisdom has been that
> without the instruction it's not going to be worth it.
>
Hm, swapgs may be quick, but it isn't very easy to use since it doesn't
stack, and so requires careful handling for recursive kernel entries,
which involves extra tests and conditional jumps. I tried doing
something similar with my earlier patches, but it got all too messy.
Stacking %gs like the other registers turns out pretty cleanly.
> When you're benchmarking this please use multiple CPU generations from
> different vendors; I suspect this is one of those things that vary
> greatly between models
>
Hm, it seems to me that unless the existing %ds/%es register
save/restores are a significant part of the existing cost of going
through entry.S, adding %gs to the set shouldn't make too much
difference. And I'm not sure about the relative cost of using a %gs
override vs. the normal current_task_info() masking, but I'm assuming
they're at worst equal, with the %gs override having a code-size advantage.
But yes, it definitely needs measurement.
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel.
2006-08-27 15:57 ` Andi Kleen
2006-08-27 16:36 ` Jeremy Fitzhardinge
@ 2006-08-27 17:20 ` Jeremy Fitzhardinge
2006-08-27 18:19 ` Andi Kleen
1 sibling, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 17:20 UTC (permalink / raw)
To: Andi Kleen
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andrew Morton
Andi Kleen wrote:
>> +1: movw GS(%esp), %gs
>>
>
> movl is recommended in 32bit mode
>
arch/i386/kernel/entry.S: Assembler messages:
arch/i386/kernel/entry.S:334: Error: suffix or operands invalid for `mov'
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
2006-08-27 8:44 [PATCH RFC 0/6] Implement per-processor data areas for i386 Jeremy Fitzhardinge
` (7 preceding siblings ...)
2006-08-27 16:01 ` Andi Kleen
@ 2006-08-27 17:21 ` Andreas Mohr
2006-08-27 17:34 ` Jeremy Fitzhardinge
2006-08-27 18:04 ` Andi Kleen
8 siblings, 2 replies; 35+ messages in thread
From: Andreas Mohr @ 2006-08-27 17:21 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andi Kleen, Andrew Morton
Hi,
On Sun, Aug 27, 2006 at 01:44:17AM -0700, Jeremy Fitzhardinge wrote:
> This patch implements per-processor data areas by using %gs as the
> base segment of the per-processor memory. This has two principle
> advantages:
>
> - It allows very simple direct access to per-processor data by
> effectively using an effective address of the form %gs:offset, where
> offset is the offset into struct i386_pda. These sequences are faster
> and smaller than the current mechanism using current_thread_info().
Yess!!
Something like that had to be done eventually about the inefficient
current_thread_info() mechanism, but I wasn't sure what exactly.
> I haven't measured performance yet, but when using the PDA for "current"
> and "smp_processor_id", I see a 5715 byte reduction in .text segment
> size for my kernel.
This is interesting, since even by doing a non-elegant
current->... --> struct task_struct *tsk = current;
replacement for excessive uses of current, I was able to gain almost 10kB
within a single file already!
I guess it's due to having tried that on an older installation with gcc 3.2,
which probably does less efficient opcode merging of current_thread_info()
requests compared to a current gcc version.
IOW, .text segment reduction could be quite a bit higher for older gcc:s.
> This uses the x86 segmentation stuff in a way similar to NPTL's way of
> implementing Thread-Local Storage. It relies on the fact that each CPU
> has its own Global Descriptor Table (GDT), which is basically an array
> of base-length pairs (with some extra stuff). When a segment register
> is loaded with a descriptor (approximately, an index in the GDT), and
> you use that segment register for memory access, the address has the
> base added to it, and the resulting address is used.
Not a problem for more daring user-space apps (i.e. Wine), I hope?
Andreas Mohr
--
No programming skills!? Why not help translate many Linux applications!
https://launchpad.ubuntu.com/rosetta
(or alternatively buy nicely packaged Linux distros/OSS software to help
support Linux developers creating shiny new things for you?)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
2006-08-27 17:21 ` Andreas Mohr
@ 2006-08-27 17:34 ` Jeremy Fitzhardinge
2006-08-27 18:23 ` Andreas Mohr
2006-08-27 18:04 ` Andi Kleen
1 sibling, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 17:34 UTC (permalink / raw)
To: Andreas Mohr
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andi Kleen, Andrew Morton
Andreas Mohr wrote:
> This is interesting, since even by doing a non-elegant
> current->... --> struct task_struct *tsk = current;
> replacement for excessive uses of current, I was able to gain almost 10kB
> within a single file already!
> I guess it's due to having tried that on an older installation with gcc 3.2,
> which probably does less efficient opcode merging of current_thread_info()
> requests compared to a current gcc version.
> IOW, .text segment reduction could be quite a bit higher for older gcc:s.
>
That doesn't sound likely. current_thread_info() is only about 2
instructions. Are you looking at the .text segment size, or the actual
file size? The latter can be very misleading in the presence of debug info.
>> This uses the x86 segmentation stuff in a way similar to NPTL's way of
>> implementing Thread-Local Storage. It relies on the fact that each CPU
>> has its own Global Descriptor Table (GDT), which is basically an array
>> of base-length pairs (with some extra stuff). When a segment register
>> is loaded with a descriptor (approximately, an index in the GDT), and
>> you use that segment register for memory access, the address has the
>> base added to it, and the resulting address is used.
>>
>
> Not a problem for more daring user-space apps (i.e. Wine), I hope?
>
No. The LDT is still available for userspace use.
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
2006-08-27 16:46 ` Jeremy Fitzhardinge
@ 2006-08-27 17:44 ` Arjan van de Ven
2006-08-27 18:07 ` Andi Kleen
0 siblings, 1 reply; 35+ messages in thread
From: Arjan van de Ven @ 2006-08-27 17:44 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andi Kleen, Andrew Morton
On Sun, 2006-08-27 at 09:46 -0700, Jeremy Fitzhardinge wrote:
> Arjan van de Ven wrote:
> > this will be interesting; x86-64 has a nice instruction to help with
> > this; 32 bit does not... so far conventional wisdom has been that
> > without the instruction it's not going to be worth it.
> >
>
> Hm, swapgs may be quick, but it isn't very easy to use since it doesn't
> stack, and so requires careful handling for recursive kernel entries,
> which involves extra tests and conditional jumps. I tried doing
> something similar with my earlier patches, but it got all too messy.
> Stacking %gs like the other registers turns out pretty cleanly.
>
> > When you're benchmarking this please use multiple CPU generations from
> > different vendors; I suspect this is one of those things that vary
> > greatly between models
> >
>
> Hm, it seems to me that unless the existing %ds/%es register
> save/restores are a significant part of the existing cost of going
> through entry.S,
iirc the %fs one is at least. But it has been a while since I've looked
at this part of the kernel via performance traces.
> adding %gs to the set shouldn't make too much
> difference. And I'm not sure about the relative cost of using a %gs
> override vs. the normal current_task_info() masking, but I'm assuming
> they're at worst equal, with the %gs override having a code-size advantage.
your worst case scenario would be if the segment override would make it
a "complex" instruction, so not parallel decodable. That'd mean it would
basically cost you 6 or 7 instruction slots that can't be filled...
while an and and such at least run nicely in parallel with other stuff.
I don't know which if any processors actually do this, but it's rare/new
enough that I'd not be surprised if there are some.
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
2006-08-27 17:21 ` Andreas Mohr
2006-08-27 17:34 ` Jeremy Fitzhardinge
@ 2006-08-27 18:04 ` Andi Kleen
2006-08-27 18:27 ` Andreas Mohr
1 sibling, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2006-08-27 18:04 UTC (permalink / raw)
To: Andreas Mohr
Cc: Jeremy Fitzhardinge, linux-kernel, Chuck Ebbert, Zachary Amsden,
Jan Beulich, Andrew Morton
> Something like that had to be done eventually about the inefficient
> current_thread_info() mechanism,
Inefficient? It's two fast instructions. I won't call that inefficient.
> I guess it's due to having tried that on an older installation with gcc 3.2,
> which probably does less efficient opcode merging of current_thread_info()
> requests compared to a current gcc version.
gcc normally doesn't merge inline assembly at all.
-Andi
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
2006-08-27 17:44 ` Arjan van de Ven
@ 2006-08-27 18:07 ` Andi Kleen
2006-08-27 18:27 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2006-08-27 18:07 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Jeremy Fitzhardinge, linux-kernel, Chuck Ebbert, Zachary Amsden,
Jan Beulich, Andrew Morton
> your worst case scenario would be if the segment override would make it
> a "complex" instruction, so not parallel decodable. That'd mean it would
> basically cost you 6 or 7 instruction slots that can't be filled...
> while an and and such at least run nicely in parallel with other stuff.
> I don't know which if any processors actually do this, but it's rare/new
> enough that I'd not be surprised if there are some.
On AMD K7/K8 a segment register prefix is a single cycle penalty.
I couldn't find anything in the Intel optimization manuals on it, but I assume
it's also not dramatic.
-Andi
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel.
2006-08-27 17:20 ` Jeremy Fitzhardinge
@ 2006-08-27 18:19 ` Andi Kleen
2006-08-27 20:03 ` Jan Engelhardt
` (3 more replies)
0 siblings, 4 replies; 35+ messages in thread
From: Andi Kleen @ 2006-08-27 18:19 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andrew Morton
On Sunday 27 August 2006 19:20, Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> >> +1: movw GS(%esp), %gs
> >>
> >
> > movl is recommended in 32bit mode
> >
>
> arch/i386/kernel/entry.S: Assembler messages:
> arch/i386/kernel/entry.S:334: Error: suffix or operands invalid for `mov'
Looks like a gas bug to me.
-Andi
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
2006-08-27 17:34 ` Jeremy Fitzhardinge
@ 2006-08-27 18:23 ` Andreas Mohr
0 siblings, 0 replies; 35+ messages in thread
From: Andreas Mohr @ 2006-08-27 18:23 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andi Kleen, Andrew Morton
Hi,
On Sun, Aug 27, 2006 at 10:34:46AM -0700, Jeremy Fitzhardinge wrote:
> Andreas Mohr wrote:
> >This is interesting, since even by doing a non-elegant
> >current->... --> struct task_struct *tsk = current;
> >replacement for excessive uses of current, I was able to gain almost 10kB
> >within a single file already!
> >I guess it's due to having tried that on an older installation with gcc
> >3.2,
> >which probably does less efficient opcode merging of current_thread_info()
> >requests compared to a current gcc version.
> >IOW, .text segment reduction could be quite a bit higher for older gcc:s.
> >
>
> That doesn't sound likely. current_thread_info() is only about 2
> instructions. Are you looking at the .text segment size, or the actual
> file size? The latter can be very misleading in the presence of debug info.
Got me. File size only. And that was 10kB of around 170kB, BTW.
Andreas Mohr
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
2006-08-27 18:04 ` Andi Kleen
@ 2006-08-27 18:27 ` Andreas Mohr
2006-08-27 18:35 ` Andi Kleen
0 siblings, 1 reply; 35+ messages in thread
From: Andreas Mohr @ 2006-08-27 18:27 UTC (permalink / raw)
To: Andi Kleen
Cc: Jeremy Fitzhardinge, linux-kernel, Chuck Ebbert, Zachary Amsden,
Jan Beulich, Andrew Morton
Hi,
On Sun, Aug 27, 2006 at 08:04:38PM +0200, Andi Kleen wrote:
>
> > Something like that had to be done eventually about the inefficient
> > current_thread_info() mechanism,
>
> Inefficient? It's two fast instructions. I won't call that inefficient.
And that AGI stall?
> > I guess it's due to having tried that on an older installation with gcc 3.2,
> > which probably does less efficient opcode merging of current_thread_info()
> > requests compared to a current gcc version.
>
> gcc normally doesn't merge inline assembly at all.
Depends on use of volatile, right?
OK, so probably there was no merging of separate requests,
but opcode intermingling could have played a role.
Andreas Mohr
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
2006-08-27 18:07 ` Andi Kleen
@ 2006-08-27 18:27 ` Jeremy Fitzhardinge
0 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 18:27 UTC (permalink / raw)
To: Andi Kleen
Cc: Arjan van de Ven, linux-kernel, Chuck Ebbert, Zachary Amsden,
Jan Beulich, Andrew Morton
Andi Kleen wrote:
> On AMD K7/K8 a segment register prefix is a single cycle penalty.
>
> I couldn't find anything in the Intel optimization manuals on it, but I assume
> it's also not dramatic.
>
All I could find was:
* avoid multiple prefixes (which was the least important guideline
in instruction selection)
* avoid using multiple segment registers (the pentium M only has one
level of segment register renaming)
* avoid prefixes which take the instruction length over 7 bytes
None of these apply to the use of %gs to access PDA.
Most of the discussion about prefixes is in avoiding the 0x66 16-bit prefix.
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
2006-08-27 18:27 ` Andreas Mohr
@ 2006-08-27 18:35 ` Andi Kleen
0 siblings, 0 replies; 35+ messages in thread
From: Andi Kleen @ 2006-08-27 18:35 UTC (permalink / raw)
To: Andreas Mohr
Cc: Jeremy Fitzhardinge, linux-kernel, Chuck Ebbert, Zachary Amsden,
Jan Beulich, Andrew Morton
On Sunday 27 August 2006 20:27, Andreas Mohr wrote:
> Hi,
>
> On Sun, Aug 27, 2006 at 08:04:38PM +0200, Andi Kleen wrote:
> >
> > > Something like that had to be done eventually about the inefficient
> > > current_thread_info() mechanism,
> >
> > Inefficient? It's two fast instructions. I won't call that inefficient.
>
> And that AGI stall?
What AGI stall?
[btw AGI stall is an outdated concept on modern x86 CPUs]
> > > I guess it's due to having tried that on an older installation with gcc 3.2,
> > > which probably does less efficient opcode merging of current_thread_info()
> > > requests compared to a current gcc version.
> >
> > gcc normally doesn't merge inline assembly at all.
>
> Depends on use of volatile, right?
No. It can only merge statements it knows anything about, and it doesn't
about inline assembly.
> OK, so probably there was no merging of separate requests,
> but opcode intermingling could have played a role.
It seems to make some difference if it's able to move asm around
and if they don't have memory clobbers. memory clobbers really seem
to cause much worse code in the whole function.
But current_thread_info didn't have that.
-Andi
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel.
2006-08-27 18:19 ` Andi Kleen
@ 2006-08-27 20:03 ` Jan Engelhardt
2006-08-27 23:38 ` Jeremy Fitzhardinge
` (2 subsequent siblings)
3 siblings, 0 replies; 35+ messages in thread
From: Jan Engelhardt @ 2006-08-27 20:03 UTC (permalink / raw)
To: Andi Kleen
Cc: Jeremy Fitzhardinge, linux-kernel, Chuck Ebbert, Zachary Amsden,
Jan Beulich, Andrew Morton
>> >> +1: movw GS(%esp), %gs
>> >
>> > movl is recommended in 32bit mode
>>
>> arch/i386/kernel/entry.S: Assembler messages:
>> arch/i386/kernel/entry.S:334: Error: suffix or operands invalid for `mov'
>
>Looks like a gas bug to me.
So the suffixing (MOVW vs MOVL) does not depend on the source operand?
%gs is only 16 bit wide so MOVW seems reasonable. If the destination
operand was a 32 bit location, something like MOVZX would be needed, would
not it?
Jan Engelhardt
--
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel.
2006-08-27 18:19 ` Andi Kleen
2006-08-27 20:03 ` Jan Engelhardt
@ 2006-08-27 23:38 ` Jeremy Fitzhardinge
2006-08-28 9:51 ` Jan Beulich
2006-08-28 17:24 ` H. Peter Anvin
3 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 23:38 UTC (permalink / raw)
To: Andi Kleen
Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
Andrew Morton
Andi Kleen wrote:
>>> movl is recommended in 32bit mode
>>>
>>>
>> arch/i386/kernel/entry.S: Assembler messages:
>> arch/i386/kernel/entry.S:334: Error: suffix or operands invalid for `mov'
>>
>
> Looks like a gas bug to me.
>
Well, it generates the 32-bit form of the instruction (no 0x66 prefix)
anyway.
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel.
2006-08-27 18:19 ` Andi Kleen
2006-08-27 20:03 ` Jan Engelhardt
2006-08-27 23:38 ` Jeremy Fitzhardinge
@ 2006-08-28 9:51 ` Jan Beulich
2006-08-28 14:54 ` H. J. Lu
2006-08-28 17:24 ` H. Peter Anvin
3 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2006-08-28 9:51 UTC (permalink / raw)
To: Jeremy Fitzhardinge, Andi Kleen
Cc: Chuck Ebbert, H. J. Lu, Andrew Morton, linux-kernel,
Zachary Amsden
>>> Andi Kleen <ak@suse.de> 27.08.06 20:19 >>>
>On Sunday 27 August 2006 19:20, Jeremy Fitzhardinge wrote:
>> Andi Kleen wrote:
>> >> +1: movw GS(%esp), %gs
>> >>
>> >
>> > movl is recommended in 32bit mode
>> >
>>
>> arch/i386/kernel/entry.S: Assembler messages:
>> arch/i386/kernel/entry.S:334: Error: suffix or operands invalid for `mov'
>
>Looks like a gas bug to me.
This was an intentional change (by H.J. if I recall right) as using movl
with segment registers gives the incorrect impression that one gets a
32-bit memory access (especially for stores this is important, since
there's really nothing stored to the upper 16 bits). One should always
use suffix-less 'mov' for segment register accesses.
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel.
2006-08-28 9:51 ` Jan Beulich
@ 2006-08-28 14:54 ` H. J. Lu
0 siblings, 0 replies; 35+ messages in thread
From: H. J. Lu @ 2006-08-28 14:54 UTC (permalink / raw)
To: Jan Beulich
Cc: Jeremy Fitzhardinge, Andi Kleen, Chuck Ebbert, Andrew Morton,
linux-kernel, Zachary Amsden
On Mon, Aug 28, 2006 at 10:51:00AM +0100, Jan Beulich wrote:
> >>> Andi Kleen <ak@suse.de> 27.08.06 20:19 >>>
> >On Sunday 27 August 2006 19:20, Jeremy Fitzhardinge wrote:
> >> Andi Kleen wrote:
> >> >> +1: movw GS(%esp), %gs
> >> >>
> >> >
> >> > movl is recommended in 32bit mode
> >> >
> >>
> >> arch/i386/kernel/entry.S: Assembler messages:
> >> arch/i386/kernel/entry.S:334: Error: suffix or operands invalid for `mov'
> >
> >Looks like a gas bug to me.
>
> This was an intentional change (by H.J. if I recall right) as using movl
> with segment registers gives the incorrect impression that one gets a
> 32-bit memory access (especially for stores this is important, since
> there's really nothing stored to the upper 16 bits). One should always
> use suffix-less 'mov' for segment register accesses.
mov will generate the optimal opcode with old and new assemblers.
H.J.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel.
2006-08-27 18:19 ` Andi Kleen
` (2 preceding siblings ...)
2006-08-28 9:51 ` Jan Beulich
@ 2006-08-28 17:24 ` H. Peter Anvin
3 siblings, 0 replies; 35+ messages in thread
From: H. Peter Anvin @ 2006-08-28 17:24 UTC (permalink / raw)
To: Andi Kleen
Cc: Jeremy Fitzhardinge, linux-kernel, Chuck Ebbert, Zachary Amsden,
Jan Beulich, Andrew Morton
Andi Kleen wrote:
> On Sunday 27 August 2006 19:20, Jeremy Fitzhardinge wrote:
>> Andi Kleen wrote:
>>>> +1: movw GS(%esp), %gs
>>>>
>>> movl is recommended in 32bit mode
>>>
>> arch/i386/kernel/entry.S: Assembler messages:
>> arch/i386/kernel/entry.S:334: Error: suffix or operands invalid for `mov'
>
> Looks like a gas bug to me.
>
Not so much a bug as a misfeature. They changed the meaning of "movw"
and "movl" on segment registers between versions.
Either way, just "mov" should work.
-hpa
^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2006-08-28 17:25 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-27 8:44 [PATCH RFC 0/6] Implement per-processor data areas for i386 Jeremy Fitzhardinge
2006-08-27 8:44 ` [PATCH RFC 1/6] Basic definitions for i386-pda Jeremy Fitzhardinge
2006-08-27 8:44 ` [PATCH RFC 2/6] Initialize the per-CPU data area Jeremy Fitzhardinge
2006-08-27 8:44 ` [PATCH RFC 3/6] Use %gs as the PDA base-segment in the kernel Jeremy Fitzhardinge
2006-08-27 9:49 ` Keith Owens
2006-08-27 10:01 ` Jeremy Fitzhardinge
2006-08-27 15:57 ` Andi Kleen
2006-08-27 16:36 ` Jeremy Fitzhardinge
2006-08-27 17:20 ` Jeremy Fitzhardinge
2006-08-27 18:19 ` Andi Kleen
2006-08-27 20:03 ` Jan Engelhardt
2006-08-27 23:38 ` Jeremy Fitzhardinge
2006-08-28 9:51 ` Jan Beulich
2006-08-28 14:54 ` H. J. Lu
2006-08-28 17:24 ` H. Peter Anvin
2006-08-27 8:44 ` [PATCH RFC 4/6] Fix places where using %gs changes the usermode ABI Jeremy Fitzhardinge
2006-08-27 15:59 ` Andi Kleen
2006-08-27 16:37 ` Jeremy Fitzhardinge
2006-08-27 8:44 ` [PATCH RFC 5/6] Implement smp_processor_id() with the PDA Jeremy Fitzhardinge
2006-08-27 8:44 ` [PATCH RFC 6/6] Implement "current" " Jeremy Fitzhardinge
2006-08-27 16:01 ` Andi Kleen
2006-08-27 16:38 ` Jeremy Fitzhardinge
2006-08-27 9:47 ` [PATCH RFC 0/6] Implement per-processor data areas for i386 Arjan van de Ven
2006-08-27 16:46 ` Jeremy Fitzhardinge
2006-08-27 17:44 ` Arjan van de Ven
2006-08-27 18:07 ` Andi Kleen
2006-08-27 18:27 ` Jeremy Fitzhardinge
2006-08-27 16:01 ` Andi Kleen
2006-08-27 16:41 ` Jeremy Fitzhardinge
2006-08-27 17:21 ` Andreas Mohr
2006-08-27 17:34 ` Jeremy Fitzhardinge
2006-08-27 18:23 ` Andreas Mohr
2006-08-27 18:04 ` Andi Kleen
2006-08-27 18:27 ` Andreas Mohr
2006-08-27 18:35 ` Andi Kleen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox