[PATCH RFC 0/6] Implement per-processor data areas for i386.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/6] Implement per-processor data areas for i386.
@ 2006-08-27  8:44 Jeremy Fitzhardinge
  2006-08-27  9:47 ` Arjan van de Ven
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27  8:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chuck Ebbert, Zachary Amsden, Jan Beulich, Andi Kleen,
	Andrew Morton

This patch implements per-processor data areas by using %gs as the
base segment of the per-processor memory.  This has two principle
advantages:

- It allows very simple direct access to per-processor data by
  effectively using an effective address of the form %gs:offset, where
  offset is the offset into struct i386_pda.  These sequences are faster
  and smaller than the current mechanism using current_thread_info().

- It also allows per-CPU data to be allocated as each CPU is brought
  up, rather than statically allocating it based on the maximum number
  of CPUs which could be brought up.

I haven't measured performance yet, but when using the PDA for "current"
and "smp_processor_id", I see a 5715 byte reduction in .text segment
size for my kernel.

Unfortunately, these patches don't actually work yet.  I'm not sure why;
I'm hoping review will turn something up.


Some background for people unfamiliar with x86 segmentation:

This uses the x86 segmentation stuff in a way similar to NPTL's way of
implementing Thread-Local Storage.  It relies on the fact that each CPU
has its own Global Descriptor Table (GDT), which is basically an array
of base-length pairs (with some extra stuff).  When a segment register
is loaded with a descriptor (approximately, an index in the GDT), and
you use that segment register for memory access, the address has the
base added to it, and the resulting address is used.

In other words, if you imagine the GDT containing an entry:
	Index	Offset
	123:	0xc0211000 (allocated PDA)
and you load %gs with this selector:
	mov $123, %gs
and then use GS later on:
	mov %gs:4, %eax
This has the effect of
	mov 0xc0211004, %eax
and because the GDT is per-CPU, the offset (= 0xc0211000 = memory
allocated for this CPU's PDA) can be a CPU-specific value while leaving
everything else constant.

This means that something like "current" or "smp_processor_id()" can
collapse to a single instruction:
	mov %gs:PDA_current, %reg


TODO: 
- Make it work.  It works UP on a test QEMU machine, but it doesn't
  yet work on real hardware, or SMP (though not working SMP on QEMU is
  more likely to be a QEMU problem).  Not sure what the problem is yet;
  I'm hoping review will reveal something.
- Measure performance impact.  The patch adds a segment register
  save/restore on entry/exit to the kernel.  This expense should be
  offset by savings in using the PDA while in the kernel, but I haven't
  measured this yet.  Space savings are already appealing though.
- Modify more things to use the PDA.  The more that uses it, the more
  the cost of the %gs save/restore is amortized.  smp_processor_id and
  current are the obvious first choices, which are implemented in this
  series.
- Make it a config option?  UP systems don't need to do any of this,
  other than having a single pre-allocated PDA.  Unfortunately, it gets
  a bit messy to do this given the changes needed in handling %gs.
--


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-27  8:44 Jeremy Fitzhardinge
@ 2006-08-27  9:47 ` Arjan van de Ven
  2006-08-27 16:46   ` Jeremy Fitzhardinge
  2006-08-27 16:01 ` Andi Kleen
  2006-08-27 17:21 ` Andreas Mohr
  2 siblings, 1 reply; 23+ messages in thread
From: Arjan van de Ven @ 2006-08-27  9:47 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
	Andi Kleen, Andrew Morton

> - Measure performance impact.  The patch adds a segment register
>   save/restore on entry/exit to the kernel.  This expense should be
>   offset by savings in using the PDA while in the kernel, but I haven't
>   measured this yet.  Space savings are already appealing though.

this will be interesting; x86-64 has a nice instruction to help with
this; 32 bit does not... so far conventional wisdom has been that
without the instruction it's not going to be worth it.

When you're benchmarking this please use multiple CPU generations from
different vendors; I suspect this is one of those things that vary
greatly between models

> - Make it a config option?  UP systems don't need to do any of this,
>   other than having a single pre-allocated PDA.  Unfortunately, it gets
>   a bit messy to do this given the changes needed in handling %gs.


A config option for this is a mistake imo. Not every patch is worth a
config option! It's good or it's not, if it's good it should be there
always, if it's not....  Something this fundamental to the core doesn't
really have a "but it's optional" argument going for it, unlike
individual drivers or subsystems...

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-27  8:44 Jeremy Fitzhardinge
  2006-08-27  9:47 ` Arjan van de Ven
@ 2006-08-27 16:01 ` Andi Kleen
  2006-08-27 16:41   ` Jeremy Fitzhardinge
  2006-08-27 17:21 ` Andreas Mohr
  2 siblings, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2006-08-27 16:01 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
	Andrew Morton


Very cool.

> - Make it work.  It works UP on a test QEMU machine, but it doesn't
>   yet work on real hardware, or SMP (though not working SMP on QEMU is
>   more likely to be a QEMU problem).  Not sure what the problem is yet;
>   I'm hoping review will reveal something.

I bet qemu doesn't have a real descriptor cache unlike real CPUs.
So likely it is some disconnect between changing the backing GDT
and referencing the register. Reload %gs more aggressively?

Comparing with SimNow! (which should behave more like a real CPU)
might be also interesting.

> - Measure performance impact.  The patch adds a segment register
>   save/restore on entry/exit to the kernel.  This expense should be
>   offset by savings in using the PDA while in the kernel, but I haven't
>   measured this yet.  Space savings are already appealing though.
> - Modify more things to use the PDA.  The more that uses it, the more
>   the cost of the %gs save/restore is amortized.  smp_processor_id and
>   current are the obvious first choices, which are implemented in this
>   series.

per cpu data would be the prime candidate. It is pretty simple.

> - Make it a config option?  UP systems don't need to do any of this,
>   other than having a single pre-allocated PDA.  Unfortunately, it gets
>   a bit messy to do this given the changes needed in handling %gs.

Please don't.

(weak point:)

- The stack protector code might work one day on i386 too.

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-27 16:01 ` Andi Kleen
@ 2006-08-27 16:41   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 23+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 16:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
	Andrew Morton

Andi Kleen wrote:
> I bet qemu doesn't have a real descriptor cache unlike real CPUs.
> So likely it is some disconnect between changing the backing GDT
> and referencing the register. Reload %gs more aggressively?
>   

The GDT only gets touched once in cpu_init(), and %gs is reloaded on 
every kernel entry, so I don't think that's it.  I seems to have 
interrupt issues with SMP.

And either way, it still doesn't work on real hardware...

> Comparing with SimNow! (which should behave more like a real CPU)
> might be also interesting.
>   

Yeah, I'll have to try that out.

>> - Measure performance impact.  The patch adds a segment register
>>   save/restore on entry/exit to the kernel.  This expense should be
>>   offset by savings in using the PDA while in the kernel, but I haven't
>>   measured this yet.  Space savings are already appealing though.
>> - Modify more things to use the PDA.  The more that uses it, the more
>>   the cost of the %gs save/restore is amortized.  smp_processor_id and
>>   current are the obvious first choices, which are implemented in this
>>   series.
>>     
>
> per cpu data would be the prime candidate. It is pretty simple.
>   

Well, it has to be arch-specific per-cpu data, since the PDA is arch 
specific.  But there should be various pieces of interrupt state that 
adapt well to it.

>> - Make it a config option?  UP systems don't need to do any of this,
>>   other than having a single pre-allocated PDA.  Unfortunately, it gets
>>   a bit messy to do this given the changes needed in handling %gs.
>>     
>
> Please don't.
>   

Yeah, that wasn't really a serious thought...

    J

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-27  9:47 ` Arjan van de Ven
@ 2006-08-27 16:46   ` Jeremy Fitzhardinge
  2006-08-27 17:44     ` Arjan van de Ven
  0 siblings, 1 reply; 23+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 16:46 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
	Andi Kleen, Andrew Morton

Arjan van de Ven wrote:
> this will be interesting; x86-64 has a nice instruction to help with
> this; 32 bit does not... so far conventional wisdom has been that
> without the instruction it's not going to be worth it.
>   

Hm, swapgs may be quick, but it isn't very easy to use since it doesn't 
stack, and so requires careful handling for recursive kernel entries, 
which involves extra tests and conditional jumps.  I tried doing 
something similar with my earlier patches, but it got all too messy.  
Stacking %gs like the other registers turns out pretty cleanly.

> When you're benchmarking this please use multiple CPU generations from
> different vendors; I suspect this is one of those things that vary
> greatly between models
>   

Hm, it seems to me that unless the existing %ds/%es register 
save/restores are a significant part of the existing cost of going 
through entry.S, adding %gs to the set shouldn't make too much 
difference.  And I'm not sure about the relative cost of using a %gs 
override vs. the normal current_task_info() masking, but I'm assuming 
they're at worst equal, with the %gs override having a code-size advantage.

But yes, it definitely needs measurement.

    J

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-27  8:44 Jeremy Fitzhardinge
  2006-08-27  9:47 ` Arjan van de Ven
  2006-08-27 16:01 ` Andi Kleen
@ 2006-08-27 17:21 ` Andreas Mohr
  2006-08-27 17:34   ` Jeremy Fitzhardinge
  2006-08-27 18:04   ` Andi Kleen
  2 siblings, 2 replies; 23+ messages in thread
From: Andreas Mohr @ 2006-08-27 17:21 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
	Andi Kleen, Andrew Morton

Hi,

On Sun, Aug 27, 2006 at 01:44:17AM -0700, Jeremy Fitzhardinge wrote:
> This patch implements per-processor data areas by using %gs as the
> base segment of the per-processor memory.  This has two principle
> advantages:
> 
> - It allows very simple direct access to per-processor data by
>   effectively using an effective address of the form %gs:offset, where
>   offset is the offset into struct i386_pda.  These sequences are faster
>   and smaller than the current mechanism using current_thread_info().

Yess!!
Something like that had to be done eventually about the inefficient
current_thread_info() mechanism, but I wasn't sure what exactly.

> I haven't measured performance yet, but when using the PDA for "current"
> and "smp_processor_id", I see a 5715 byte reduction in .text segment
> size for my kernel.

This is interesting, since even by doing a non-elegant
current->... --> struct task_struct *tsk = current;
replacement for excessive uses of current, I was able to gain almost 10kB
within a single file already!
I guess it's due to having tried that on an older installation with gcc 3.2,
which probably does less efficient opcode merging of current_thread_info()
requests compared to a current gcc version.
IOW, .text segment reduction could be quite a bit higher for older gcc:s.

> This uses the x86 segmentation stuff in a way similar to NPTL's way of
> implementing Thread-Local Storage.  It relies on the fact that each CPU
> has its own Global Descriptor Table (GDT), which is basically an array
> of base-length pairs (with some extra stuff).  When a segment register
> is loaded with a descriptor (approximately, an index in the GDT), and
> you use that segment register for memory access, the address has the
> base added to it, and the resulting address is used.

Not a problem for more daring user-space apps (i.e. Wine), I hope?

Andreas Mohr

-- 
No programming skills!? Why not help translate many Linux applications! 
https://launchpad.ubuntu.com/rosetta
(or alternatively buy nicely packaged Linux distros/OSS software to help
support Linux developers creating shiny new things for you?)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-27 17:21 ` Andreas Mohr
@ 2006-08-27 17:34   ` Jeremy Fitzhardinge
  2006-08-27 18:23     ` Andreas Mohr
  2006-08-27 18:04   ` Andi Kleen
  1 sibling, 1 reply; 23+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 17:34 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
	Andi Kleen, Andrew Morton

Andreas Mohr wrote:
> This is interesting, since even by doing a non-elegant
> current->... --> struct task_struct *tsk = current;
> replacement for excessive uses of current, I was able to gain almost 10kB
> within a single file already!
> I guess it's due to having tried that on an older installation with gcc 3.2,
> which probably does less efficient opcode merging of current_thread_info()
> requests compared to a current gcc version.
> IOW, .text segment reduction could be quite a bit higher for older gcc:s.
>   

That doesn't sound likely.  current_thread_info() is only about 2 
instructions.  Are you looking at the .text segment size, or the actual 
file size?  The latter can be very misleading in the presence of debug info.

>> This uses the x86 segmentation stuff in a way similar to NPTL's way of
>> implementing Thread-Local Storage.  It relies on the fact that each CPU
>> has its own Global Descriptor Table (GDT), which is basically an array
>> of base-length pairs (with some extra stuff).  When a segment register
>> is loaded with a descriptor (approximately, an index in the GDT), and
>> you use that segment register for memory access, the address has the
>> base added to it, and the resulting address is used.
>>     
>
> Not a problem for more daring user-space apps (i.e. Wine), I hope?
>   

No.  The LDT is still available for userspace use.

    J

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-27 16:46   ` Jeremy Fitzhardinge
@ 2006-08-27 17:44     ` Arjan van de Ven
  2006-08-27 18:07       ` Andi Kleen
  0 siblings, 1 reply; 23+ messages in thread
From: Arjan van de Ven @ 2006-08-27 17:44 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
	Andi Kleen, Andrew Morton

On Sun, 2006-08-27 at 09:46 -0700, Jeremy Fitzhardinge wrote:
> Arjan van de Ven wrote:
> > this will be interesting; x86-64 has a nice instruction to help with
> > this; 32 bit does not... so far conventional wisdom has been that
> > without the instruction it's not going to be worth it.
> >   
> 
> Hm, swapgs may be quick, but it isn't very easy to use since it doesn't 
> stack, and so requires careful handling for recursive kernel entries, 
> which involves extra tests and conditional jumps.  I tried doing 
> something similar with my earlier patches, but it got all too messy.  
> Stacking %gs like the other registers turns out pretty cleanly.
> 
> > When you're benchmarking this please use multiple CPU generations from
> > different vendors; I suspect this is one of those things that vary
> > greatly between models
> >   
> 
> Hm, it seems to me that unless the existing %ds/%es register 
> save/restores are a significant part of the existing cost of going 
> through entry.S, 

iirc the %fs one is at least. But it has been a while since I've looked
at this part of the kernel via performance traces.

> adding %gs to the set shouldn't make too much 
> difference.  And I'm not sure about the relative cost of using a %gs 
> override vs. the normal current_task_info() masking, but I'm assuming 
> they're at worst equal, with the %gs override having a code-size advantage.

your worst case scenario would be if the segment override would make it
a "complex" instruction, so not parallel decodable. That'd mean it would
basically cost you 6 or 7 instruction slots that can't be filled...
while an and and such at least run nicely in parallel with other stuff.
I don't know which if any processors actually do this, but it's rare/new
enough that I'd not be surprised if there are some.



-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-27 17:21 ` Andreas Mohr
  2006-08-27 17:34   ` Jeremy Fitzhardinge
@ 2006-08-27 18:04   ` Andi Kleen
  2006-08-27 18:27     ` Andreas Mohr
  1 sibling, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2006-08-27 18:04 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Jeremy Fitzhardinge, linux-kernel, Chuck Ebbert, Zachary Amsden,
	Jan Beulich, Andrew Morton


> Something like that had to be done eventually about the inefficient
> current_thread_info() mechanism, 

Inefficient? It's two fast instructions. I won't call that inefficient.

> I guess it's due to having tried that on an older installation with gcc 3.2,
> which probably does less efficient opcode merging of current_thread_info()
> requests compared to a current gcc version.

gcc normally doesn't merge inline assembly at all.

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-27 17:44     ` Arjan van de Ven
@ 2006-08-27 18:07       ` Andi Kleen
  2006-08-27 18:27         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2006-08-27 18:07 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jeremy Fitzhardinge, linux-kernel, Chuck Ebbert, Zachary Amsden,
	Jan Beulich, Andrew Morton

 
> your worst case scenario would be if the segment override would make it
> a "complex" instruction, so not parallel decodable. That'd mean it would
> basically cost you 6 or 7 instruction slots that can't be filled...
> while an and and such at least run nicely in parallel with other stuff.
> I don't know which if any processors actually do this, but it's rare/new
> enough that I'd not be surprised if there are some.

On AMD K7/K8 a segment register prefix is a single cycle penalty.

I couldn't find anything in the Intel optimization manuals on it, but I assume
it's also not dramatic.

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-27 17:34   ` Jeremy Fitzhardinge
@ 2006-08-27 18:23     ` Andreas Mohr
  0 siblings, 0 replies; 23+ messages in thread
From: Andreas Mohr @ 2006-08-27 18:23 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: linux-kernel, Chuck Ebbert, Zachary Amsden, Jan Beulich,
	Andi Kleen, Andrew Morton

Hi,

On Sun, Aug 27, 2006 at 10:34:46AM -0700, Jeremy Fitzhardinge wrote:
> Andreas Mohr wrote:
> >This is interesting, since even by doing a non-elegant
> >current->... --> struct task_struct *tsk = current;
> >replacement for excessive uses of current, I was able to gain almost 10kB
> >within a single file already!
> >I guess it's due to having tried that on an older installation with gcc 
> >3.2,
> >which probably does less efficient opcode merging of current_thread_info()
> >requests compared to a current gcc version.
> >IOW, .text segment reduction could be quite a bit higher for older gcc:s.
> >  
> 
> That doesn't sound likely.  current_thread_info() is only about 2 
> instructions.  Are you looking at the .text segment size, or the actual 
> file size?  The latter can be very misleading in the presence of debug info.

Got me. File size only. And that was 10kB of around 170kB, BTW.

Andreas Mohr

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-27 18:04   ` Andi Kleen
@ 2006-08-27 18:27     ` Andreas Mohr
  2006-08-27 18:35       ` Andi Kleen
  0 siblings, 1 reply; 23+ messages in thread
From: Andreas Mohr @ 2006-08-27 18:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jeremy Fitzhardinge, linux-kernel, Chuck Ebbert, Zachary Amsden,
	Jan Beulich, Andrew Morton

Hi,

On Sun, Aug 27, 2006 at 08:04:38PM +0200, Andi Kleen wrote:
> 
> > Something like that had to be done eventually about the inefficient
> > current_thread_info() mechanism, 
> 
> Inefficient? It's two fast instructions. I won't call that inefficient.

And that AGI stall?

> > I guess it's due to having tried that on an older installation with gcc 3.2,
> > which probably does less efficient opcode merging of current_thread_info()
> > requests compared to a current gcc version.
> 
> gcc normally doesn't merge inline assembly at all.

Depends on use of volatile, right?

OK, so probably there was no merging of separate requests,
but opcode intermingling could have played a role.

Andreas Mohr

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-27 18:07       ` Andi Kleen
@ 2006-08-27 18:27         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 23+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-27 18:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Arjan van de Ven, linux-kernel, Chuck Ebbert, Zachary Amsden,
	Jan Beulich, Andrew Morton

Andi Kleen wrote:
> On AMD K7/K8 a segment register prefix is a single cycle penalty.
>
> I couldn't find anything in the Intel optimization manuals on it, but I assume
> it's also not dramatic.
>   

All I could find was:

    * avoid multiple prefixes (which was the least important guideline
      in instruction selection)
    * avoid using multiple segment registers (the pentium M only has one
      level of segment register renaming)
    * avoid prefixes which take the instruction length over 7 bytes

None of these apply to the use of %gs to access PDA.

Most of the discussion about prefixes is in avoiding the 0x66 16-bit prefix.

    J

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-27 18:27     ` Andreas Mohr
@ 2006-08-27 18:35       ` Andi Kleen
  0 siblings, 0 replies; 23+ messages in thread
From: Andi Kleen @ 2006-08-27 18:35 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Jeremy Fitzhardinge, linux-kernel, Chuck Ebbert, Zachary Amsden,
	Jan Beulich, Andrew Morton

On Sunday 27 August 2006 20:27, Andreas Mohr wrote:
> Hi,
> 
> On Sun, Aug 27, 2006 at 08:04:38PM +0200, Andi Kleen wrote:
> > 
> > > Something like that had to be done eventually about the inefficient
> > > current_thread_info() mechanism, 
> > 
> > Inefficient? It's two fast instructions. I won't call that inefficient.
> 
> And that AGI stall?

What AGI stall?

[btw AGI stall is an outdated concept on modern x86 CPUs]

> > > I guess it's due to having tried that on an older installation with gcc 3.2,
> > > which probably does less efficient opcode merging of current_thread_info()
> > > requests compared to a current gcc version.
> > 
> > gcc normally doesn't merge inline assembly at all.
> 
> Depends on use of volatile, right?

No.  It can only merge statements it knows anything about, and it doesn't
about inline assembly.

> OK, so probably there was no merging of separate requests,
> but opcode intermingling could have played a role.

It seems to make some difference if it's able to move asm around
and if they don't have memory clobbers. memory clobbers really seem
to cause much worse code in the whole function.

But current_thread_info didn't have that.

-Andi


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
@ 2006-08-28  9:06 Chuck Ebbert
  0 siblings, 0 replies; 23+ messages in thread
From: Chuck Ebbert @ 2006-08-28  9:06 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Andrew Morton, Andi Kleen, Jan Beulich, Zachary Amsden,
	linux-kernel

In-Reply-To: <20060827172155.GA21724@rhlx01.fht-esslingen.de>

On Sun, 27 Aug 2006 19:21:55 +0200, Andreas Mohr wrote:

> Something like that had to be done eventually about the inefficient
> current_thread_info() mechanism, but I wasn't sure what exactly.

In 2.6.18 it's done in C and the optimizer does a pretty good job
with it in recent compilers.

-- 
Chuck


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 0/6] Implement per-processor data areas for i386.
@ 2006-08-30  9:00 Chuck Ebbert
  2006-08-30  9:17 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 23+ messages in thread
From: Chuck Ebbert @ 2006-08-30  9:00 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, Andi Kleen, Jan Beulich, Zachary Amsden,
	linux-kernel

In-Reply-To: <20060827084417.918992193@goop.org>

On Sun, 27 Aug 2006 01:44:17 -0700, Jeremy Fitzhardinge wrote:

> This patch implements per-processor data areas by using %gs as the
> base segment of the per-processor memory.

This changes the ABI for signals and ptrace() and that seems like
a bad idea to me.

And the way things are done now is so ingrained into the i386
kernel that I'm not sure it can be done.  E.g. I found two
open-coded implementations of current, one in kernel_fpu_begin()
and one in math_state_restore().

> - It also allows per-CPU data to be allocated as each CPU is brought
>   up, rather than statically allocating it based on the maximum number
>   of CPUs which could be brought up.

Can you describe what it is about the way things work now that
prevents dynamic allocation?

-- 
Chuck

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-30  9:00 Chuck Ebbert
@ 2006-08-30  9:17 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 23+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-30  9:17 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Andrew Morton, Andi Kleen, Jan Beulich, Zachary Amsden,
	linux-kernel

Chuck Ebbert wrote:
>> This patch implements per-processor data areas by using %gs as the
>> base segment of the per-processor memory.
>>     
> This changes the ABI for signals and ptrace() and that seems like
> a bad idea to me.
>   

I don't believe it does; it certainly shouldn't change the usermode 
ABI.  How do you see it changing?

> And the way things are done now is so ingrained into the i386
> kernel that I'm not sure it can be done.  E.g. I found two
> open-coded implementations of current, one in kernel_fpu_begin()
> and one in math_state_restore().
>   

That's OK.  The current task will still be available in thread_info; 
this is needed for very early CPU bringup anyway, before the PDA has 
been set up.  The vast majority of "get current task" operations can be 
swept up by changing "current" however.

> Can you describe what it is about the way things work now that
> prevents dynamic allocation?
>   

To be honest, I haven't looked at percpu.h in great detail.  I was 
making assumptions about how it works, but it looks like they were wrong.

    J

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
@ 2006-08-30 12:33 Chuck Ebbert
  2006-08-30 12:54 ` Andi Kleen
  2006-08-30 16:39 ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 23+ messages in thread
From: Chuck Ebbert @ 2006-08-30 12:33 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: linux-kernel, Zachary Amsden, Jan Beulich, Andi Kleen,
	Andrew Morton

In-Reply-To: <44F557A8.1030605@goop.org>

On Wed, 30 Aug 2006 02:17:28 -0700, Jeremy Fitzhardinge wrote:

> > This changes the ABI for signals and ptrace() and that seems like
> > a bad idea to me.
> >   
> 
> I don't believe it does; it certainly shouldn't change the usermode 
> ABI.  How do you see it changing?

Nevermind.  I thought because you changed struct pt_regs in ptrace_abi.h
it meant a user ABI change.

> > And the way things are done now is so ingrained into the i386
> > kernel that I'm not sure it can be done.  E.g. I found two
> > open-coded implementations of current, one in kernel_fpu_begin()
> > and one in math_state_restore().
> >   
> 
> That's OK.  The current task will still be available in thread_info; 

But they can get out of sync, e.g. when switch_to() restores the new
task's esp, the PDA still contains the old pcurrent and they don't get
synchronized until the write_pda() in __switch_to().

> To be honest, I haven't looked at percpu.h in great detail.  I was 
> making assumptions about how it works, but it looks like they were wrong.

Would it make any sense to replace the 'cpu' field in thread_info with
a pointer to a PDA-like structure?  We could even embed the static per_cpu
data directly into that struct instead of chasing pointers...

-- 
Chuck


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for i386.
  2006-08-30 12:33 [PATCH RFC 0/6] Implement per-processor data areas for i386 Chuck Ebbert
@ 2006-08-30 12:54 ` Andi Kleen
  2006-08-30 16:39 ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 23+ messages in thread
From: Andi Kleen @ 2006-08-30 12:54 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Jeremy Fitzhardinge, linux-kernel, Zachary Amsden, Jan Beulich,
	Andrew Morton

On Wednesday 30 August 2006 14:33, Chuck Ebbert wrote:
> In-Reply-To: <44F557A8.1030605@goop.org>
> 
> On Wed, 30 Aug 2006 02:17:28 -0700, Jeremy Fitzhardinge wrote:
> 
> > > This changes the ABI for signals and ptrace() and that seems like
> > > a bad idea to me.
> > >   
> > 
> > I don't believe it does; it certainly shouldn't change the usermode 
> > ABI.  How do you see it changing?
> 
> Nevermind.  I thought because you changed struct pt_regs in ptrace_abi.h
> it meant a user ABI change.

I think he broke the ptrace ABI actually in the first patch, but only by mistake 
and he promised to fix it :)

> 
> > > And the way things are done now is so ingrained into the i386
> > > kernel that I'm not sure it can be done.  E.g. I found two
> > > open-coded implementations of current, one in kernel_fpu_begin()
> > > and one in math_state_restore().

Perhaps those should be fixed? Is there a reason they are open coded?

> > >   
> > 
> > That's OK.  The current task will still be available in thread_info; 
> 
> But they can get out of sync, e.g. when switch_to() restores the new
> task's esp, the PDA still contains the old pcurrent and they don't get
> synchronized until the write_pda() in __switch_to().

But there is neither kernel_fpu_begin nor math_state_restore inbetween.
And I think interrupts are off too.

> 
> > To be honest, I haven't looked at percpu.h in great detail.  I was 
> > making assumptions about how it works, but it looks like they were wrong.
> 
> Would it make any sense to replace the 'cpu' field in thread_info with
> a pointer to a PDA-like structure?  We could even embed the static per_cpu
> data directly into that struct instead of chasing pointers...

I don't see what advantage it would have. %gs is clearly faster and shorter.

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for  i386.
  2006-08-30 12:33 [PATCH RFC 0/6] Implement per-processor data areas for i386 Chuck Ebbert
  2006-08-30 12:54 ` Andi Kleen
@ 2006-08-30 16:39 ` Jeremy Fitzhardinge
  2006-08-30 16:48   ` Andi Kleen
  1 sibling, 1 reply; 23+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-30 16:39 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: linux-kernel, Zachary Amsden, Jan Beulich, Andi Kleen,
	Andrew Morton

Chuck Ebbert wrote:
> Nevermind.  I thought because you changed struct pt_regs in ptrace_abi.h
> it meant a user ABI change.
>   

Yes, that header seems pretty badly misnamed.  The user-visible 
structure is user_pt_regs.

> But they can get out of sync, e.g. when switch_to() restores the new
> task's esp, the PDA still contains the old pcurrent and they don't get
> synchronized until the write_pda() in __switch_to().
>   

Yes, that's true, but the window is fairly small.  More importantly, the 
question of "what task is currently running" is fundamentally 
ill-defined while you're in the middle of a context switch, so I don't 
think this is a big issue.  __switch_to runs on the new task's stack, so 
that's effectively where current_thread_info()->task changes value 
(aside from the few instructions in switch_to() between the %esp update 
and the jmp to __switch_to).  I could put the write_pda() earlier in 
__switch_to so the window is smaller.  But again, for it to make a 
difference, someone would want to be using current *within* __switch_to, 
which is just silly.

>> To be honest, I haven't looked at percpu.h in great detail.  I was 
>> making assumptions about how it works, but it looks like they were wrong.
>>     
>
> Would it make any sense to replace the 'cpu' field in thread_info with
> a pointer to a PDA-like structure?  We could even embed the static per_cpu
> data directly into that struct instead of chasing pointers...
>   

I don't think so.  The whole point is to make the pda easily accessible 
with simple addressing modes based on %gs:.   I have been wondering if 
we can modify the percpu mechanism to get the linker to construct the 
layout of the pda, so that all the existing percpu stuff can be 
transparently moved into the pda and accessed efficiently.  I think it 
would be pretty tricky to get it all working though...

But the PDA isn't really intended for *all* per-cpu data; its mostly for 
stuff which is accessed often, or needs to be quickly and easily 
accessibly.  My specific motivation is to use the PDA for easy access to 
Xen per-cpu data, which needs to be accessible with short instructions 
which can be inlined.

    J

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for  i386.
  2006-08-30 16:39 ` Jeremy Fitzhardinge
@ 2006-08-30 16:48   ` Andi Kleen
  2006-08-30 17:13     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2006-08-30 16:48 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chuck Ebbert, linux-kernel, Zachary Amsden, Jan Beulich,
	Andrew Morton

  
> 
> I don't think so.  The whole point is to make the pda easily accessible 
> with simple addressing modes based on %gs:.   I have been wondering if 
> we can modify the percpu mechanism to get the linker to construct the 
> layout of the pda, so that all the existing percpu stuff can be 
> transparently moved into the pda and accessed efficiently.  I think it 
> would be pretty tricky to get it all working though...

I tried that once on x86-64, but it wasn't possible because the linker
is missing the right relocations. It has something on the first look similar 
for __thread data in user space, but it wasn't usable for the kernel.

Even with a single indirection it is still far more efficient than a 
array lookup.

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for  i386.
  2006-08-30 16:48   ` Andi Kleen
@ 2006-08-30 17:13     ` Jeremy Fitzhardinge
  2006-08-30 17:32       ` Andi Kleen
  0 siblings, 1 reply; 23+ messages in thread
From: Jeremy Fitzhardinge @ 2006-08-30 17:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chuck Ebbert, linux-kernel, Zachary Amsden, Jan Beulich,
	Andrew Morton

Andi Kleen wrote:
> I tried that once on x86-64, but it wasn't possible because the linker
> is missing the right relocations. It has something on the first look similar 
> for __thread data in user space, but it wasn't usable for the kernel.
>   
The other difficulty is that you can't take the addresses of things in 
the pda and pass them around, which happens a lot.  It would be another 
matter if you could add an attribute to a pointer which told gcc "always 
use a %gs override for indirecting through this pointer", though that 
would still require some type qualifier on any pointer which holds the 
value.

(Hm, it would be interesting to see if we could possibly use the code 
generated by gcc for TLS variables...)

> Even with a single indirection it is still far more efficient than a 
> array lookup.

Yes.  Even in the Xen case it will be useful to have a pointer to the 
vcpu structure, at least until Xen can be convinced to put the vcpu 
structure in the PDA itself.

    J

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/6] Implement per-processor data areas for  i386.
  2006-08-30 17:13     ` Jeremy Fitzhardinge
@ 2006-08-30 17:32       ` Andi Kleen
  0 siblings, 0 replies; 23+ messages in thread
From: Andi Kleen @ 2006-08-30 17:32 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chuck Ebbert, linux-kernel, Zachary Amsden, Jan Beulich,
	Andrew Morton

On Wednesday 30 August 2006 19:13, Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> > I tried that once on x86-64, but it wasn't possible because the linker
> > is missing the right relocations. It has something on the first look similar 
> > for __thread data in user space, but it wasn't usable for the kernel.
> >   
> The other difficulty is that you can't take the addresses of things in 
> the pda and pass them around, which happens a lot.  

The user space __thread works around this by always storing the address at
offset 0. Kernel does it similar, except it's not at offset 0.


> (Hm, it would be interesting to see if we could possibly use the code 
> generated by gcc for TLS variables...)

I tried once on 64bit and it wasn't possible for various reasons.

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2006-08-30 17:32 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-30 12:33 [PATCH RFC 0/6] Implement per-processor data areas for i386 Chuck Ebbert
2006-08-30 12:54 ` Andi Kleen
2006-08-30 16:39 ` Jeremy Fitzhardinge
2006-08-30 16:48   ` Andi Kleen
2006-08-30 17:13     ` Jeremy Fitzhardinge
2006-08-30 17:32       ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2006-08-30  9:00 Chuck Ebbert
2006-08-30  9:17 ` Jeremy Fitzhardinge
2006-08-28  9:06 Chuck Ebbert
2006-08-27  8:44 Jeremy Fitzhardinge
2006-08-27  9:47 ` Arjan van de Ven
2006-08-27 16:46   ` Jeremy Fitzhardinge
2006-08-27 17:44     ` Arjan van de Ven
2006-08-27 18:07       ` Andi Kleen
2006-08-27 18:27         ` Jeremy Fitzhardinge
2006-08-27 16:01 ` Andi Kleen
2006-08-27 16:41   ` Jeremy Fitzhardinge
2006-08-27 17:21 ` Andreas Mohr
2006-08-27 17:34   ` Jeremy Fitzhardinge
2006-08-27 18:23     ` Andreas Mohr
2006-08-27 18:04   ` Andi Kleen
2006-08-27 18:27     ` Andreas Mohr
2006-08-27 18:35       ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox