[PATCH] turn off writable page tables

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] turn off writable page tables
@ 2006-07-25 22:14 Andrew Theurer
  2006-07-25 22:43 ` Nivedita Singhvi
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Theurer @ 2006-07-25 22:14 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1498 bytes --]

At OLS I gave a talk on some of the Xen scalability inhibitors, and one 
of these was writable page tables.  We went over why the feature does 
not scale, but just as important, we found that the uniprocessor case 
does not provide any advantage either.  These tests were done on x86_64, 
so I wanted to run the 1-way test on 32 bit to show the same problem.  
So, I have run with writable PTs and with emulation forced on for 
several benchmarks:

on Xeon MP processor, uniprocessor dom0 kernel, pae=y:

benchmark                c/s 10729 force_emulate
------------------------ --------- -------------
lmbench fork+exit:       469.5833  470.3913   usec, lower is better
lmbench fork+execve:     1241.0000 1225.7778  usec, lower is better
lmbench fork+/sbin/bash: 12190.000 12119.000  usec, lower is better
dbench 3.03              186.354   191.278    MB/sec
reaim_aim9               1890.01   2055.97    jobs/min
reaim_compute            2538.75   2522.90    jobs/min
reaim_dbase              3852.14   3739.38    jobs/min
reaim_fserver            4437.93   4389.71    jobs/min
reaim_shared             2365.85   2362.97    jobs/min
SPEC SDET                4315.91   4312.02    scripts/hr

These are all within the noise level (some slightly better, some 
slightly worse for emulate).  There really isn't much of difference 
here.  I'd like to propose turning on the emulate path all the time in 
xen. 

-Andrew Theurer

Applies to c/s 10729
Signed-off-by: Andrew Theurer <habanero@us.ibm.com>


[-- Attachment #2: force-emulate.patch --]
[-- Type: text/x-patch, Size: 493 bytes --]

diff -Naurp xen-unstable.hg-10729/xen/arch/x86/mm.c xen-unstable.hg-10729-emulate/xen/arch/x86/mm.c
--- xen-unstable.hg-10729/xen/arch/x86/mm.c	2006-07-25 17:05:33.000000000 -0500
+++ xen-unstable.hg-10729-emulate/xen/arch/x86/mm.c	2006-07-25 17:03:40.000000000 -0500
@@ -3582,7 +3582,7 @@ int ptwr_do_page_fault(struct domain *d,
         return 0;
     }
 
-#if 0 /* Leave this in as useful for debugging */ 
+#if 1 /* Leave this in as useful for debugging */ 
     goto emulate; 
 #endif
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH] turn off writable page tables
@ 2006-07-25 22:41 Ian Pratt
  2006-07-26  2:25 ` Andrew Theurer
  2006-07-26  8:18 ` Gerd Hoffmann
  0 siblings, 2 replies; 28+ messages in thread
From: Ian Pratt @ 2006-07-25 22:41 UTC (permalink / raw)
  To: Andrew Theurer, xen-devel

> on Xeon MP processor, uniprocessor dom0 kernel, pae=y:
> 
> benchmark                c/s 10729 force_emulate
> ------------------------ --------- -------------
> lmbench fork+exit:       469.5833  470.3913   usec, lower is better
> lmbench fork+execve:     1241.0000 1225.7778  usec, lower is better
> lmbench fork+/sbin/bash: 12190.000 12119.000  usec, lower is better

It's kinda weird that these scores are so close -- I guess its just
coincidence that we must be getting something like an average of 10-20
pte's updated per pagetable page and the cost of doing multiple emulates
perfectly balances the cost of unhooking/rehooking.

I would like to make sure we fully understand what's going on, though.

I'd like to make sure there's no 'dumb stuff' happening, and the
writeable pagetables isn't being used erroneously where we don't expect
it (hence crippling the scores), and that its actually functioning as
intended i.e. that we get one fault to unhook, and then a fault causing
a rehook once we move to the next page in the fork.

If you write a little test program that dirties a large chunk of memory
just before the fork, we should see writeable pagetables winning easily.

It would also be good to use some of the trace buffer stuff to find out
exactly what the sequence of faults and flushes is.

I have no problem with enabling force emulation, I'd just like to fully
understand the tradeoff. I suspect the answer is that typically only a
handful of PTEs are dirty, and hence there are relatively few updates to
the parent process's page tables. It's worth understanding this as it
also has implications for shadow pagetables.

Thanks,
Ian

> dbench 3.03              186.354   191.278    MB/sec
> reaim_aim9               1890.01   2055.97    jobs/min
> reaim_compute            2538.75   2522.90    jobs/min
> reaim_dbase              3852.14   3739.38    jobs/min
> reaim_fserver            4437.93   4389.71    jobs/min
> reaim_shared             2365.85   2362.97    jobs/min
> SPEC SDET                4315.91   4312.02    scripts/hr
> 
> These are all within the noise level (some slightly better, some
> slightly worse for emulate).  There really isn't much of difference
> here.  I'd like to propose turning on the emulate path all the time in
> xen.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-25 22:14 [PATCH] turn off writable page tables Andrew Theurer
@ 2006-07-25 22:43 ` Nivedita Singhvi
  2006-07-25 23:19   ` Andrew Theurer
  0 siblings, 1 reply; 28+ messages in thread
From: Nivedita Singhvi @ 2006-07-25 22:43 UTC (permalink / raw)
  To: Andrew Theurer; +Cc: xen-devel

Andrew Theurer wrote:

> 
> These are all within the noise level (some slightly better, some 
> slightly worse for emulate).  There really isn't much of difference 
> here.  I'd like to propose turning on the emulate path all the time in xen.
> -Andrew Theurer
> 
> Applies to c/s 10729
> Signed-off-by: Andrew Theurer <habanero@us.ibm.com>

Andrew, is this something for 3.0.3/Fedora6/RHEL5 consideration,
or post-3.0.3?

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-25 22:43 ` Nivedita Singhvi
@ 2006-07-25 23:19   ` Andrew Theurer
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Theurer @ 2006-07-25 23:19 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: xen-devel

Nivedita Singhvi wrote:
> Andrew Theurer wrote:
>
>>
>> These are all within the noise level (some slightly better, some 
>> slightly worse for emulate).  There really isn't much of difference 
>> here.  I'd like to propose turning on the emulate path all the time 
>> in xen.
>> -Andrew Theurer
>>
>> Applies to c/s 10729
>> Signed-off-by: Andrew Theurer <habanero@us.ibm.com>
>
> Andrew, is this something for 3.0.3/Fedora6/RHEL5 consideration,
> or post-3.0.3?
>
>
I think it could go either way.  There should be no "risk" using emulate 
over writable PT when it comes to stability, etc.

-Andrew

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-25 22:41 Ian Pratt
@ 2006-07-26  2:25 ` Andrew Theurer
  2006-07-26  5:31   ` Jacob Gorm Hansen
  2006-07-26  8:18 ` Gerd Hoffmann
  1 sibling, 1 reply; 28+ messages in thread
From: Andrew Theurer @ 2006-07-26  2:25 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel

Ian Pratt wrote:
>> on Xeon MP processor, uniprocessor dom0 kernel, pae=y:
>>
>> benchmark                c/s 10729 force_emulate
>> ------------------------ --------- -------------
>> lmbench fork+exit:       469.5833  470.3913   usec, lower is better
>> lmbench fork+execve:     1241.0000 1225.7778  usec, lower is better
>> lmbench fork+/sbin/bash: 12190.000 12119.000  usec, lower is better
>>     
>
> It's kinda weird that these scores are so close -- I guess its just
> coincidence that we must be getting something like an average of 10-20
> pte's updated per pagetable page and the cost of doing multiple emulates
> perfectly balances the cost of unhooking/rehooking.
>   
Ian, I'll try a small program which dirties a large chunk of memory.  
I'll also try the trace tool and see what we get.

Thanks,

Andrew

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-26  2:25 ` Andrew Theurer
@ 2006-07-26  5:31   ` Jacob Gorm Hansen
  0 siblings, 0 replies; 28+ messages in thread
From: Jacob Gorm Hansen @ 2006-07-26  5:31 UTC (permalink / raw)
  To: Andrew Theurer; +Cc: Ian Pratt, xen-devel

On 7/25/06, Andrew Theurer <habanero@us.ibm.com> wrote:
> Ian Pratt wrote:
> >> on Xeon MP processor, uniprocessor dom0 kernel, pae=y:
> >>
> >> benchmark                c/s 10729 force_emulate
> >> ------------------------ --------- -------------
> >> lmbench fork+exit:       469.5833  470.3913   usec, lower is better
> >> lmbench fork+execve:     1241.0000 1225.7778  usec, lower is better
> >> lmbench fork+/sbin/bash: 12190.000 12119.000  usec, lower is better
> >>
> >
> > It's kinda weird that these scores are so close -- I guess its just
> > coincidence that we must be getting something like an average of 10-20
> > pte's updated per pagetable page and the cost of doing multiple emulates
> > perfectly balances the cost of unhooking/rehooking.

Just a silly question; is the old batched update mechanism totally out
of the picture here? Is it the cost of taking additional faults that
makes writable ptes as slow as emulation (which I suppose just means
shadow p.t.s)? Is there tension between shadow pt cache size inside
Xen and runtime performance?

Regards,
Jacob

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-25 22:41 Ian Pratt
  2006-07-26  2:25 ` Andrew Theurer
@ 2006-07-26  8:18 ` Gerd Hoffmann
  2006-07-26  8:40   ` Keir Fraser
  1 sibling, 1 reply; 28+ messages in thread
From: Gerd Hoffmann @ 2006-07-26  8:18 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Andrew Theurer, xen-devel

  Hi,

> I'd like to make sure there's no 'dumb stuff' happening, and the
> writeable pagetables isn't being used erroneously where we don't expect
> it (hence crippling the scores), and that its actually functioning as
> intended i.e. that we get one fault to unhook, and then a fault causing
> a rehook once we move to the next page in the fork.
>    
> If you write a little test program that dirties a large chunk of memory
> just before the fork, we should see writeable pagetables winning easily.

Just an idea:  Any chance mm_pin() and mm_unpin() cause this?  The bulk
page table updates for the new process created by fork() are not seen by
xen anyway I think.  The first schedule of the new process triggers
pinning, i.e. r/o mapping and verification ...

cheers,

  Gerd

-- 
Gerd Hoffmann <kraxel@suse.de>
http://www.suse.de/~kraxel/julika-dora.jpeg

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-26  8:18 ` Gerd Hoffmann
@ 2006-07-26  8:40   ` Keir Fraser
  2006-07-26 21:10     ` Andrew Theurer
  0 siblings, 1 reply; 28+ messages in thread
From: Keir Fraser @ 2006-07-26  8:40 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Ian Pratt, Andrew Theurer, xen-devel


On 26 Jul 2006, at 09:18, Gerd Hoffmann wrote:

>> I'd like to make sure there's no 'dumb stuff' happening, and the
>> writeable pagetables isn't being used erroneously where we don't 
>> expect
>> it (hence crippling the scores), and that its actually functioning as
>> intended i.e. that we get one fault to unhook, and then a fault 
>> causing
>> a rehook once we move to the next page in the fork.
>>
>> If you write a little test program that dirties a large chunk of 
>> memory
>> just before the fork, we should see writeable pagetables winning 
>> easily.
>
> Just an idea:  Any chance mm_pin() and mm_unpin() cause this?  The bulk
> page table updates for the new process created by fork() are not seen 
> by
> xen anyway I think.  The first schedule of the new process triggers
> pinning, i.e. r/o mapping and verification ...

The batching should still benefit the write-protecting of the parent 
pagetables, which are visible to Xen during fork() (since the fork() 
runs on them!).

Hence the suggestion of dirtying pages before the fork -- that will 
ensure that lots of PTEs are definitely writable, and so they will have 
to be updated to make them read-only.

  -- Keir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-26  8:40   ` Keir Fraser
@ 2006-07-26 21:10     ` Andrew Theurer
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Theurer @ 2006-07-26 21:10 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Ian Pratt, Gerd Hoffmann, xen-devel

Keir Fraser wrote:
>
> On 26 Jul 2006, at 09:18, Gerd Hoffmann wrote:
>
>>> I'd like to make sure there's no 'dumb stuff' happening, and the
>>> writeable pagetables isn't being used erroneously where we don't expect
>>> it (hence crippling the scores), and that its actually functioning as
>>> intended i.e. that we get one fault to unhook, and then a fault causing
>>> a rehook once we move to the next page in the fork.
>>>
>>> If you write a little test program that dirties a large chunk of memory
>>> just before the fork, we should see writeable pagetables winning 
>>> easily.
>>
>> Just an idea:  Any chance mm_pin() and mm_unpin() cause this?  The bulk
>> page table updates for the new process created by fork() are not seen by
>> xen anyway I think.  The first schedule of the new process triggers
>> pinning, i.e. r/o mapping and verification ...
>
> The batching should still benefit the write-protecting of the parent 
> pagetables, which are visible to Xen during fork() (since the fork() 
> runs on them!).
>
> Hence the suggestion of dirtying pages before the fork -- that will 
> ensure that lots of PTEs are definitely writable, and so they will 
> have to be updated to make them read-only.
>
And it does make a difference in this case.  I now have a test program 
which dirties a number of virtually contiguous pages then forks (it also 
resets xen perf counters before fork and collects perf counters right 
after fork), then records the elapsed time for the fork.  The difference 
is quite amazing in this case.  For both writable and emulate, I ran 
with a range of dirty pages, from 1280 to 128000.  The elapsed times for 
fork a quite linear from small number to large number of dirty pages. 
Below are the min and max:

         1280 pages    128000 pages
wtpt:     813 usec      37552 usec 
emulate: 3279 usec     283879 usec

The perf counters showed just about every writable page had all entries 
modified (for 128000 pages below):

writable pt updates: total: 253  all entries updated: 250

So, in a -perfect-world- this works great.  Problem is most workloads 
don't appear to have a vast percentage of entries that need to be 
updated.   I'll go ahead and  expand this test to find out what the 
threshold is to break even.  I'll also see if we can implement a batched 
call in fork to update the parent -I hope this will show just as good 
performance even when most entries need modification and even better 
performance over wtpt with a low number of entries modified.

-Andrew

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH] turn off writable page tables
@ 2006-07-26 21:38 Ian Pratt
  2006-07-27 14:43 ` Andrew Theurer
  0 siblings, 1 reply; 28+ messages in thread
From: Ian Pratt @ 2006-07-26 21:38 UTC (permalink / raw)
  To: Andrew Theurer, Keir Fraser; +Cc: Ian Pratt, Gerd Hoffmann, xen-devel

> And it does make a difference in this case.  I now have a test program
> which dirties a number of virtually contiguous pages then forks (it
also
> resets xen perf counters before fork and collects perf counters right
> after fork), then records the elapsed time for the fork.  The
difference
> is quite amazing in this case.  For both writable and emulate, I ran
> with a range of dirty pages, from 1280 to 128000.  The elapsed times
for
> fork a quite linear from small number to large number of dirty pages.
> Below are the min and max:
> 
>          1280 pages    128000 pages
> wtpt:     813 usec      37552 usec
> emulate: 3279 usec     283879 usec

Good, at least that suggests that the code works for the usage it was
intended for. 

> So, in a -perfect-world- this works great.  Problem is most workloads
> don't appear to have a vast percentage of entries that need to be
> updated.   I'll go ahead and  expand this test to find out what the
> threshold is to break even.  I'll also see if we can implement a
batched
> call in fork to update the parent -I hope this will show just as good
> performance even when most entries need modification and even better
> performance over wtpt with a low number of entries modified.

With license to make more invasive changes to core Linux mm it certainly
should be possible to optimize this specific case with a batched update
fairly easily. You could even go further an implement a 'make all PTEs
in pagetable RO' hypercall, possibly including a copy to the child. This
could potentially work better than current 'late pin', at least the
validation would be incremental rather than in one big hit at the end. 

Ian

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH] turn off writable page tables
       [not found] <E1G5sBV-0005eg-At@host-192-168-0-1-bcn-london>
@ 2006-07-26 23:38 ` Joe Bonasera
  0 siblings, 0 replies; 28+ messages in thread
From: Joe Bonasera @ 2006-07-26 23:38 UTC (permalink / raw)
  To: xen-devel

xen-devel-request@lists.xensource.com wrote:

> 
> Message: 1
> Date: Wed, 26 Jul 2006 22:38:32 +0100
> From: "Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk>
> Subject: RE: [Xen-devel] [PATCH] turn off writable page tables
> To: "Andrew Theurer" <habanero@us.ibm.com>,	"Keir Fraser"
> 	<Keir.Fraser@cl.cam.ac.uk>
> Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk>, Gerd Hoffmann
> 	<kraxel@suse.de>,	xen-devel@lists.xensource.com
> Message-ID:
> 	<A95E2296287EAD4EB592B5DEEFCE0E9D572247@liverpoolst.ad.cl.cam.ac.uk>
> Content-Type: text/plain;	charset="us-ascii"
> 
>> And it does make a difference in this case.  I now have a test program
>> which dirties a number of virtually contiguous pages then forks (it
> also
>> resets xen perf counters before fork and collects perf counters right
>> after fork), then records the elapsed time for the fork.  The
> difference
>> is quite amazing in this case.  For both writable and emulate, I ran
>> with a range of dirty pages, from 1280 to 128000.  The elapsed times
> for
>> fork a quite linear from small number to large number of dirty pages.
>> Below are the min and max:
>>
>>          1280 pages    128000 pages
>> wtpt:     813 usec      37552 usec
>> emulate: 3279 usec     283879 usec
> 
> Good, at least that suggests that the code works for the usage it was
> intended for. 
> 
>> So, in a -perfect-world- this works great.  Problem is most workloads
>> don't appear to have a vast percentage of entries that need to be
>> updated.   I'll go ahead and  expand this test to find out what the
>> threshold is to break even.  I'll also see if we can implement a
> batched
>> call in fork to update the parent -I hope this will show just as good
>> performance even when most entries need modification and even better
>> performance over wtpt with a low number of entries modified.
> 
> With license to make more invasive changes to core Linux mm it certainly
> should be possible to optimize this specific case with a batched update
> fairly easily. You could even go further an implement a 'make all PTEs
> in pagetable RO' hypercall, possibly including a copy to the child. This
> could potentially work better than current 'late pin', at least the
> validation would be incremental rather than in one big hit at the end. 
> 
> Ian

OpenSolaris could easily use the "make all PTEs in pagetable RO" hypercall.
But we don't copy in bulk to the child, so if you go down that path
please make the copy to child part optional.

Joe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-26 21:38 Ian Pratt
@ 2006-07-27 14:43 ` Andrew Theurer
  2006-07-27 15:30   ` Keir Fraser
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Theurer @ 2006-07-27 14:43 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel, Gerd Hoffmann

>> fork a quite linear from small number to large number of dirty pages.
>> Below are the min and max:
>>
>>          1280 pages    128000 pages
>> wtpt:     813 usec      37552 usec
>> emulate: 3279 usec     283879 usec
>>     
>
> Good, at least that suggests that the code works for the usage it was
> intended for. 
>
>   
>> So, in a -perfect-world- this works great.  Problem is most workloads
>> don't appear to have a vast percentage of entries that need to be
>> updated.   I'll go ahead and  expand this test to find out what the
>> threshold is to break even.  I'll also see if we can implement a
>>     
> batched
>   
>> call in fork to update the parent -I hope this will show just as good
>> performance even when most entries need modification and even better
>> performance over wtpt with a low number of entries modified.
>>     
>
> With license to make more invasive changes to core Linux mm it certainly
> should be possible to optimize this specific case with a batched update
> fairly easily. You could even go further an implement a 'make all PTEs
> in pagetable RO' hypercall, possibly including a copy to the child. This
> could potentially work better than current 'late pin', at least the
> validation would be incremental rather than in one big hit at the end. 
>
> Ian
>   
FWIW, I found the threshold for emulate vs wtpt.  I ran the fork test 
with a set number of pages dirtied such that we had x number of PTEs per 
pte_page.

writable-pt
-----------
#pte usec
002 5242
004 5251
006 5373
008 5519
010 5873

emulate
--------
#pte usec
002 4922
004 5265
006 6074
008 6991
010 7806
012 5988

So, the threshold appears to be around 4 PTEs/page.  I was a little 
shocked at first how low this number is, but considering the near 
identical performance with the various workloads, this make sense.  All 
of the workloads had the vast majority of writable pages flushed with 
just 2 PTEs/page changed and a handful with more PTEs/page changed.  It 
would not surprise me if the overall average was around 4 PTEs/page.

I am having a hard time finding any "enterprise" workloads which have a 
lot of PTEs/page right before fork.  If anyone can point me to some, 
that would be great.

I will look into batching next, but I am curious if simply using a 
hypercall in stead of write fault + emulate will make any difference at 
all.  I'll try that first, then implement the batched update. 

Eventually a hypercall which does more would be nice, but I guess we'll 
have to convince the Linux maintainers it's a good idea.

-Andrew

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-27 14:43 ` Andrew Theurer
@ 2006-07-27 15:30   ` Keir Fraser
  0 siblings, 0 replies; 28+ messages in thread
From: Keir Fraser @ 2006-07-27 15:30 UTC (permalink / raw)
  To: Andrew Theurer; +Cc: Ian Pratt, Gerd Hoffmann, xen-devel


On 27 Jul 2006, at 15:43, Andrew Theurer wrote:

> So, the threshold appears to be around 4 PTEs/page.  I was a little 
> shocked at first how low this number is, but considering the near 
> identical performance with the various workloads, this make sense.  
> All of the workloads had the vast majority of writable pages flushed 
> with just 2 PTEs/page changed and a handful with more PTEs/page 
> changed.  It would not surprise me if the overall average was around 4 
> PTEs/page.
>
> I am having a hard time finding any "enterprise" workloads which have 
> a lot of PTEs/page right before fork.  If anyone can point me to some, 
> that would be great.
>
> I will look into batching next, but I am curious if simply using a 
> hypercall in stead of write fault + emulate will make any difference 
> at all.  I'll try that first, then implement the batched update.
> Eventually a hypercall which does more would be nice, but I guess 
> we'll have to convince the Linux maintainers it's a good idea.

The obvious thing to do is emulate the first 4 updates to a particular 
page, and only then switch to batched mode. Slows down the batched path 
a bit, but stops it firing in many cases where it is no help.

  -- Keir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH] turn off writable page tables
@ 2006-07-27 17:31 Ian Pratt
  2006-07-28  8:55 ` Keir Fraser
  0 siblings, 1 reply; 28+ messages in thread
From: Ian Pratt @ 2006-07-27 17:31 UTC (permalink / raw)
  To: Keir Fraser, Andrew Theurer; +Cc: Ian Pratt, Gerd Hoffmann, xen-devel


> > I am having a hard time finding any "enterprise" workloads which
have
> > a lot of PTEs/page right before fork.  If anyone can point me to
some,
> > that would be great.
> >
> > I will look into batching next, but I am curious if simply using a
> > hypercall in stead of write fault + emulate will make any difference
> > at all.  I'll try that first, then implement the batched update.
> > Eventually a hypercall which does more would be nice, but I guess
> > we'll have to convince the Linux maintainers it's a good idea.
> 
> The obvious thing to do is emulate the first 4 updates to a particular
> page, and only then switch to batched mode. Slows down the batched
path
> a bit, but stops it firing in many cases where it is no help.

Why? There should be no overhead to just building batches on the stack
(or a per vcpu area) and flushing at the end of the page. Certainly if
we were to keep wrpt it would make sense to take a few emulations faults
first on a page before engaging wrpt, but for explicit batches we don't
need any smarts. 

[Although the batching strategy would (currently) work for Linux, we do
have to bare in mind that some OSes (possibly NetBSD) won't rely on a
lock to protect updates to pagetables and will use individual atomic
ops.]

Ian

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-27 17:31 Ian Pratt
@ 2006-07-28  8:55 ` Keir Fraser
  2006-07-28 15:21   ` Andrew Theurer
  0 siblings, 1 reply; 28+ messages in thread
From: Keir Fraser @ 2006-07-28  8:55 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel, Gerd Hoffmann, Andrew Theurer


On 27 Jul 2006, at 18:31, Ian Pratt wrote:

>> The obvious thing to do is emulate the first 4 updates to a particular
>> page, and only then switch to batched mode. Slows down the batched
> path
>> a bit, but stops it firing in many cases where it is no help.
>
> Why? There should be no overhead to just building batches on the stack
> (or a per vcpu area) and flushing at the end of the page. Certainly if
> we were to keep wrpt it would make sense to take a few emulations 
> faults
> first on a page before engaging wrpt, but for explicit batches we don't
> need any smarts.
>
> [Although the batching strategy would (currently) work for Linux, we do
> have to bare in mind that some OSes (possibly NetBSD) won't rely on a
> lock to protect updates to pagetables and will use individual atomic
> ops.]

It wasn't clear to me there was a batching strategy that would 
integrate nicely with Linux generic mm code and be useful to any other 
OSes. We don't particularly want to accumulate OS-specific hacks unless 
it's a significant win (which we have no evidence it would be here).

  -- Keir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-28  8:55 ` Keir Fraser
@ 2006-07-28 15:21   ` Andrew Theurer
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Theurer @ 2006-07-28 15:21 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Ian Pratt, Gerd Hoffmann, xen-devel

Keir Fraser wrote:
>
> On 27 Jul 2006, at 18:31, Ian Pratt wrote:
>
>>> The obvious thing to do is emulate the first 4 updates to a particular
>>> page, and only then switch to batched mode. Slows down the batched
>> path
>>> a bit, but stops it firing in many cases where it is no help.
>>
>> Why? There should be no overhead to just building batches on the stack
>> (or a per vcpu area) and flushing at the end of the page. Certainly if
>> we were to keep wrpt it would make sense to take a few emulations faults
>> first on a page before engaging wrpt, but for explicit batches we don't
>> need any smarts.
>>
>> [Although the batching strategy would (currently) work for Linux, we do
>> have to bare in mind that some OSes (possibly NetBSD) won't rely on a
>> lock to protect updates to pagetables and will use individual atomic
>> ops.]
>
> It wasn't clear to me there was a batching strategy that would 
> integrate nicely with Linux generic mm code and be useful to any other 
> OSes. We don't particularly want to accumulate OS-specific hacks 
> unless it's a significant win (which we have no evidence it would be 
> here). 
I think there are only a couple of spots where batching is obvious (fork 
parent being one).  However, I don't think we'll see any significant 
improvement, as we don't see any right now on typical workloads with 
writable pages either.  And I think that's the point I want to make -we 
are not seeing an advantage for writable pages unless you have a 
workload with a lot of dirty PTE's/page and it forks a lot, which 
apparently does not seem to be that common (please, if anyone has one, 
let me know, I would like to test it) 

Now, if this were the only consequence, then I would not even bother 
trying to remove writable page tables.  However, the writable pages do 
not scale with SMP guests, partly because of the single page available 
(not counting the inactive page, since it's never used anymore), but 
also because the tlb flush of all active cpus in that guest post page 
detachment.  Keeping writable page tables will probably also make 
implementing a fine grain locking in the mm.c hypercall functions quite 
difficult.

One other point: For those OSes which use cmpxchg on PTEs, I believe 
keeping the emulate path will preserve the cmpxchg, so I don't think we 
need wtpt for that.  Alternatively, we could add a set_pte_cmpxchg call 
if needed.

So, in summary, we know writable page tables are not broken, they just 
don't help on typical workloads because the PTEs/page are so low.  
However, they do hurt SMP guest performance.  If we are not seeing a 
benefit today, should we turn it off?  Should we make it a compile time 
option, with the default off?

Thanks,

-Andrew

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH] turn off writable page tables
@ 2006-07-28 15:51 Ian Pratt
  2006-07-28 16:31 ` Keir Fraser
  0 siblings, 1 reply; 28+ messages in thread
From: Ian Pratt @ 2006-07-28 15:51 UTC (permalink / raw)
  To: Andrew Theurer, Keir Fraser; +Cc: Gerd Hoffmann, xen-devel

> So, in summary, we know writable page tables are not broken, they just
> don't help on typical workloads because the PTEs/page are so low.
> However, they do hurt SMP guest performance.  If we are not seeing a
> benefit today, should we turn it off?  Should we make it a compile
time
> option, with the default off?

I wouldn't mind seeing wrpt removed altogether, or at least emulation
made the compile time default for the moment. There's bound to be some
workload that bites us in the future which is why batching updates on
the fork path mightn't be a bad thing if it can be done without too much
gratuitous hacking of linux core code.

Ian

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-28 15:51 Ian Pratt
@ 2006-07-28 16:31 ` Keir Fraser
  2006-07-28 21:36   ` Zachary Amsden
  0 siblings, 1 reply; 28+ messages in thread
From: Keir Fraser @ 2006-07-28 16:31 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Gerd Hoffmann, xen-devel, Andrew Theurer

On 28 Jul 2006, at 16:51, Ian Pratt wrote:

>> So, in summary, we know writable page tables are not broken, they just
>> don't help on typical workloads because the PTEs/page are so low.
>> However, they do hurt SMP guest performance.  If we are not seeing a
>> benefit today, should we turn it off?  Should we make it a compile
> time
>> option, with the default off?
>
> I wouldn't mind seeing wrpt removed altogether, or at least emulation
> made the compile time default for the moment. There's bound to be some
> workload that bites us in the future which is why batching updates on
> the fork path mightn't be a bad thing if it can be done without too 
> much
> gratuitous hacking of linux core code.

My only fear is that batched wrpt has some guest-visible effects. For 
example, the guest has to be able to cope with seeing page directory 
entries with the present bit cleared. Also, on SMP, it has to be able 
to cope with spurious page faults anywhere in its address space (e.g., 
faults on a unhooked page table which some other VCPU has rehooked by 
the time the Xen pagefault handler runs, hence the fault is bounced 
back to the guest even though there is no work to be done). If we turn 
off batched wrpt then guests will not be tested against it and we are 
likely to hit problems if we ever want to turn it back on again -- 
we'll find that some guests are not able to correctly handle the weird 
side effects.

On the other hand, perhaps we can find a neater more explicit 
alternative to batched wrpt in future.

  -- Keir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-28 16:31 ` Keir Fraser
@ 2006-07-28 21:36   ` Zachary Amsden
  2006-07-28 23:05     ` Andi Kleen
  0 siblings, 1 reply; 28+ messages in thread
From: Zachary Amsden @ 2006-07-28 21:36 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Ian Pratt, Gerd Hoffmann, xen-devel, Andrew Theurer,
	Virtualization Mailing List

[-- Attachment #1: Type: text/plain, Size: 2982 bytes --]

Keir Fraser wrote:
>
> On 28 Jul 2006, at 16:51, Ian Pratt wrote:
>
>>> So, in summary, we know writable page tables are not broken, they just
>>> don't help on typical workloads because the PTEs/page are so low.
>>> However, they do hurt SMP guest performance.  If we are not seeing a
>>> benefit today, should we turn it off?  Should we make it a compile
>> time
>>> option, with the default off?
>>
>> I wouldn't mind seeing wrpt removed altogether, or at least emulation
>> made the compile time default for the moment. There's bound to be some
>> workload that bites us in the future which is why batching updates on
>> the fork path mightn't be a bad thing if it can be done without too much
>> gratuitous hacking of linux core code.
>
> My only fear is that batched wrpt has some guest-visible effects. For 
> example, the guest has to be able to cope with seeing page directory 
> entries with the present bit cleared. Also, on SMP, it has to be able 
> to cope with spurious page faults anywhere in its address space (e.g., 
> faults on a unhooked page table which some other VCPU has rehooked by 
> the time the Xen pagefault handler runs, hence the fault is bounced 
> back to the guest even though there is no work to be done). If we turn 
> off batched wrpt then guests will not be tested against it and we are 
> likely to hit problems if we ever want to turn it back on again -- 
> we'll find that some guests are not able to correctly handle the weird 
> side effects.
>
> On the other hand, perhaps we can find a neater more explicit 
> alternative to batched wrpt in future.

This is a very nice win for shadow page tables on SMP.  Basically, we 
use the lazy state information to defer all the MMU hypercalls into a 
single flush, which happens when leaving lazy MMU mode.

At the PT level, this can be done without gratuitous hacking of linux 
core code.  However, this can not be extended safely to also encompass 
the set of the parent page directory entry for SMP.  It is a little 
unclear exactly how this would work under a direct page table hypervisor 
- would you still take the faults, or would you re-type and reprotect 
the pages first?  In the fork case, there can be two page tables being 
updated because of COW, but re-typing both pages changes the crossover 
point for when the batching will be a win.  But if the same hooks can be 
used for direct mode, it makes sense to figure that out now so we don't 
have to add 4 different sets of hooks to Linux (UP / SMP want slightly 
different batching models, as might also shadow/direct).

The PDE p-bit going missing is still a problem, and Linux can be changed 
to deal with that - but it is messy.

One remaining issue for deferring direct page table updates is the read 
hazard potential.  I believe there is only one read hazard in the Linux 
mm code that has the potential to be exposed here - the explicit, rather 
than implicit batching makes it quite a bit easier to reason about that.

Zach

[-- Attachment #2: lazy-mmu-batching --]
[-- Type: text/plain, Size: 5883 bytes --]

Implement lazy MMU update hooks which are SMP safe for both direct and
shadow page tables.  The idea is that PTE updates and page invalidations
while in lazy mode can be batched into a single hypercall.  We use this
in VMI for shadow page table synchronization, and it is a win.

For SMP, the enter / leave must happen under protection of the page table
locks for page tables which are being modified.  This is because otherwise,
you end up with stale state in the batched hypercall, which other CPUs can
race ahead of.  Doing this under the protection of the locks guarantees
the synchronization is correct, and also means that spurious faults which
are generated during this window by remote CPUs are properly handled, as
the page fault handler must re-check the PTE under protection of the same
lock.

Signed-off-by: Zachary Amsden <zach@vmware.com>

Index: linux-2.6.18-rc2/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.18-rc2.orig/include/asm-generic/pgtable.h	2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/include/asm-generic/pgtable.h	2006-07-28 14:18:03.000000000 -0700
@@ -163,6 +163,11 @@ static inline void ptep_set_wrprotect(st
 #define move_pte(pte, prot, old_addr, new_addr)	(pte)
 #endif
 
+#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+#define arch_enter_lazy_mmu_mode()	do {} while (0)
+#define arch_leave_lazy_mmu_mode()	do {} while (0)
+#endif
+
 /*
  * When walking page tables, get the address of the next boundary,
  * or the end address of the range if that comes earlier.  Although no
Index: linux-2.6.18-rc2/mm/memory.c
===================================================================
--- linux-2.6.18-rc2.orig/mm/memory.c	2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/memory.c	2006-07-28 14:18:44.000000000 -0700
@@ -506,6 +506,7 @@ again:
 	src_ptl = pte_lockptr(src_mm, src_pmd);
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 
+	arch_enter_lazy_mmu_mode();
 	do {
 		/*
 		 * We are holding two locks at this point - either of them
@@ -525,6 +526,7 @@ again:
 		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+	arch_leave_lazy_mmu_mode();
 
 	spin_unlock(src_ptl);
 	pte_unmap_nested(src_pte - 1);
@@ -627,6 +629,7 @@ static unsigned long zap_pte_range(struc
 	int anon_rss = 0;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = *pte;
 		if (pte_none(ptent)) {
@@ -692,6 +695,7 @@ static unsigned long zap_pte_range(struc
 		pte_clear_full(mm, addr, pte, tlb->fullmm);
 	} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
 
+	arch_leave_lazy_mmu_mode();
 	add_mm_rss(mm, file_rss, anon_rss);
 	pte_unmap_unlock(pte - 1, ptl);
 
@@ -1108,6 +1112,7 @@ static int zeromap_pte_range(struct mm_s
 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
+	arch_enter_lazy_mmu_mode();
 	do {
 		struct page *page = ZERO_PAGE(addr);
 		pte_t zero_pte = pte_wrprotect(mk_pte(page, prot));
@@ -1117,6 +1122,7 @@ static int zeromap_pte_range(struct mm_s
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, zero_pte);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
+	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 	return 0;
 }
@@ -1269,11 +1275,13 @@ static int remap_pte_range(struct mm_str
 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
+	arch_enter_lazy_mmu_mode();
 	do {
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
+	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 	return 0;
 }
Index: linux-2.6.18-rc2/mm/mprotect.c
===================================================================
--- linux-2.6.18-rc2.orig/mm/mprotect.c	2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/mprotect.c	2006-07-28 14:17:25.000000000 -0700
@@ -33,6 +33,7 @@ static void change_pte_range(struct mm_s
 	spinlock_t *ptl;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
 	do {
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
@@ -62,6 +63,7 @@ static void change_pte_range(struct mm_s
 		}
 
 	} while (pte++, addr += PAGE_SIZE, addr != end);
+	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 }
 
Index: linux-2.6.18-rc2/mm/mremap.c
===================================================================
--- linux-2.6.18-rc2.orig/mm/mremap.c	2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/mremap.c	2006-07-28 14:17:25.000000000 -0700
@@ -99,6 +99,7 @@ static void move_ptes(struct vm_area_str
 	if (new_ptl != old_ptl)
 		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
 
+	arch_enter_lazy_mmu_mode();
 	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
 				   new_pte++, new_addr += PAGE_SIZE) {
 		if (pte_none(*old_pte))
@@ -108,6 +109,7 @@ static void move_ptes(struct vm_area_str
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		set_pte_at(mm, new_addr, new_pte, pte);
 	}
+	arch_leave_lazy_mmu_mode();
 
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
Index: linux-2.6.18-rc2/mm/msync.c
===================================================================
--- linux-2.6.18-rc2.orig/mm/msync.c	2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/msync.c	2006-07-28 14:17:25.000000000 -0700
@@ -30,6 +30,7 @@ static unsigned long msync_pte_range(str
 
 again:
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
 	do {
 		struct page *page;
 
@@ -51,6 +52,7 @@ again:
 			ret += set_page_dirty(page);
 		progress += 3;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
+	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 	cond_resched();
 	if (addr != end)

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-28 21:36   ` Zachary Amsden
@ 2006-07-28 23:05     ` Andi Kleen
  2006-07-28 23:10       ` Zachary Amsden
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2006-07-28 23:05 UTC (permalink / raw)
  To: virtualization; +Cc: Zachary Amsden, xen-devel, Andrew Theurer

On Friday 28 July 2006 23:36, Zachary Amsden wrote:
> For SMP, the enter / leave must happen under protection of the page table
> locks for page tables which are being modified.

This should be in a comment in the code

-andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-28 23:05     ` Andi Kleen
@ 2006-07-28 23:10       ` Zachary Amsden
  2006-07-31  9:14         ` Keir Fraser
  0 siblings, 1 reply; 28+ messages in thread
From: Zachary Amsden @ 2006-07-28 23:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, xen-devel, Andrew Theurer

Andi Kleen wrote:
> On Friday 28 July 2006 23:36, Zachary Amsden wrote:
>   
>> For SMP, the enter / leave must happen under protection of the page table
>> locks for page tables which are being modified.
>>     
>
> This should be in a comment in the code
>   

Most definitely agreed!  I think it deserves an entire paragraph 
explaining how to use it properly.  But I wanted to get input from the 
Xen folks to see if this can be useful for them in some way first.

Zach

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-28 23:10       ` Zachary Amsden
@ 2006-07-31  9:14         ` Keir Fraser
  2006-07-31  9:32           ` Zachary Amsden
  0 siblings, 1 reply; 28+ messages in thread
From: Keir Fraser @ 2006-07-31  9:14 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: virtualization, Andrew Theurer, xen-devel, Andi Kleen

On 29 Jul 2006, at 00:10, Zachary Amsden wrote:

>>> For SMP, the enter / leave must happen under protection of the page 
>>> table
>>> locks for page tables which are being modified.
>>>
>>
>> This should be in a comment in the code
>>
>
> Most definitely agreed!  I think it deserves an entire paragraph 
> explaining how to use it properly.  But I wanted to get input from the 
> Xen folks to see if this can be useful for them in some way first.

It would allow set_pte() to switch between explicit queuing and 
'direct' writing. We moved away from the former a few years back as 
doing it everywhere made a mess of the generic Linux mm code and it was 
hard to reason whether our patches were correct. I guess doing it for 
the most important subset of mm routines is not so bad. It's a shame 
that, although many set_pte() call sites could determine statically 
whether or not they will batch, we'd end up with a dynamic run-time 
test everywhere (unless I'm mistaken) -- I wonder if that has a 
measurable cost?

  -- Keir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-31  9:14         ` Keir Fraser
@ 2006-07-31  9:32           ` Zachary Amsden
  2006-07-31  9:53             ` Keir Fraser
  0 siblings, 1 reply; 28+ messages in thread
From: Zachary Amsden @ 2006-07-31  9:32 UTC (permalink / raw)
  To: Keir Fraser; +Cc: virtualization, Andrew Theurer, xen-devel, Andi Kleen

Keir Fraser wrote:
> It would allow set_pte() to switch between explicit queuing and 
> 'direct' writing. We moved away from the former a few years back as 
> doing it everywhere made a mess of the generic Linux mm code and it 
> was hard to reason whether our patches were correct. I guess doing it 
> for the most important subset of mm routines is not so bad. It's a 
> shame that, although many set_pte() call sites could determine 
> statically whether or not they will batch, we'd end up with a dynamic 
> run-time test everywhere (unless I'm mistaken) -- I wonder if that has 
> a measurable cost?
>

We've actually seen a benefit for this, despite the cost of the 
non-static parameters, for both VMI Linux with shadow pagetables on ESX 
and VMI Linux with direct pagetables on Xen.  Turns out that as long as 
the call EIP is predictable, the parameters do not necessarily need to 
be so, and modern processors are getting much better at branch prediction.

Doing explicit batching exactly where it counts, under protection of 
locks, so that SMP safety is guaranteed turns out to be really easy, as 
well as a nice win.

Zach

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-31  9:32           ` Zachary Amsden
@ 2006-07-31  9:53             ` Keir Fraser
  2006-07-31 19:56               ` Zachary Amsden
  0 siblings, 1 reply; 28+ messages in thread
From: Keir Fraser @ 2006-07-31  9:53 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: virtualization, Andrew Theurer, xen-devel, Andi Kleen


On 31 Jul 2006, at 10:32, Zachary Amsden wrote:

>> It would allow set_pte() to switch between explicit queuing and 
>> 'direct' writing. We moved away from the former a few years back as 
>> doing it everywhere made a mess of the generic Linux mm code and it 
>> was hard to reason whether our patches were correct. I guess doing it 
>> for the most important subset of mm routines is not so bad. It's a 
>> shame that, although many set_pte() call sites could determine 
>> statically whether or not they will batch, we'd end up with a dynamic 
>> run-time test everywhere (unless I'm mistaken) -- I wonder if that 
>> has a measurable cost?
>>
>
> We've actually seen a benefit for this, despite the cost of the 
> non-static parameters, for both VMI Linux with shadow pagetables on 
> ESX and VMI Linux with direct pagetables on Xen.  Turns out that as 
> long as the call EIP is predictable, the parameters do not necessarily 
> need to be so, and modern processors are getting much better at branch 
> prediction.

You mean that the benefit of batching outweighs the cost of an extra 
test-and-branch in the middle of a loop, or that the extra 
test-and-branch simply has unmeasurable overhead? The former is to be 
expected, but I'd be worried about other call sites where batching does 
not happen, and an effect on those.

> Doing explicit batching exactly where it counts, under protection of 
> locks, so that SMP safety is guaranteed turns out to be really easy, 
> as well as a nice win.

If the run-time check cost really isn't an issue (I'd like to see 
numbers), we'd likely use this new interface in preference to 
implicitly batched writable pagetables and would support its inclusion 
in the kernel.

  -- Keir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-31  9:53             ` Keir Fraser
@ 2006-07-31 19:56               ` Zachary Amsden
  2006-07-31 22:07                 ` Keir Fraser
  0 siblings, 1 reply; 28+ messages in thread
From: Zachary Amsden @ 2006-07-31 19:56 UTC (permalink / raw)
  To: Keir Fraser; +Cc: virtualization, Andrew Theurer, xen-devel, Andi Kleen

Keir Fraser wrote:
>
> On 31 Jul 2006, at 10:32, Zachary Amsden wrote:
>
>>> It would allow set_pte() to switch between explicit queuing and 
>>> 'direct' writing. We moved away from the former a few years back as 
>>> doing it everywhere made a mess of the generic Linux mm code and it 
>>> was hard to reason whether our patches were correct. I guess doing 
>>> it for the most important subset of mm routines is not so bad. It's 
>>> a shame that, although many set_pte() call sites could determine 
>>> statically whether or not they will batch, we'd end up with a 
>>> dynamic run-time test everywhere (unless I'm mistaken) -- I wonder 
>>> if that has a measurable cost?
>>>
>>
>> We've actually seen a benefit for this, despite the cost of the 
>> non-static parameters, for both VMI Linux with shadow pagetables on 
>> ESX and VMI Linux with direct pagetables on Xen.  Turns out that as 
>> long as the call EIP is predictable, the parameters do not 
>> necessarily need to be so, and modern processors are getting much 
>> better at branch prediction.
>
> You mean that the benefit of batching outweighs the cost of an extra 
> test-and-branch in the middle of a loop, or that the extra 
> test-and-branch simply has unmeasurable overhead? The former is to be 
> expected, but I'd be worried about other call sites where batching 
> does not happen, and an effect on those.

The extra test-and-branch has unmeasurable overhead.  In the 
implementation we had chosen, there was already a branch requirement on 
the set_pte call anyway, to potentially delay the pte update so that it 
can piggyback onto a page invalidation with just one hypercall.  
Combining the two branches into one is trivial, and the cost of one 
extra branch here seems to be invisible.  We were getting better numbers 
for MMU related workloads with VMI-Linux than XenoLinux was.  I don't 
have hard numbers on this, and even if I did, it would take some time to 
get them approved for public distribution.  For that I must apologize.  
But avoiding the changes that would otherwise be required - a full set 
of pte and tlb functions which could be delayed, as well as combining 
the pte update and invlpg into a single call - seemed worth a single 
branch.  I'm not even convinced these changes can be done in a way that 
would be safe for all architectures.  Of course, I may be wrong on that 
point - but there is no simple way I see to do it that affords the 
strong reasoning about correctness that the enter / leave semantic does.

>
>> Doing explicit batching exactly where it counts, under protection of 
>> locks, so that SMP safety is guaranteed turns out to be really easy, 
>> as well as a nice win.
>
> If the run-time check cost really isn't an issue (I'd like to see 
> numbers), we'd likely use this new interface in preference to 
> implicitly batched writable pagetables and would support its inclusion 
> in the kernel.

Sorry about not having numbers.  My biggest question is - do you need 
any other information than simply a single state variable to use 
explicit batching?  I thought, and Jeremy and Chris both pointed out as 
well, that Xen could potentially use the information about which PT to 
unhook to take advantage of writable pagetables.  But, if that is not 
the direction you are going, then it seems this information is not so 
relevant for the explicit batching.  The explicit batching does have one 
disadvantage without writable page tables, which is a potential long 
term maintenance / correctness issue - you must remove read hazards from 
these encapsulated paths.  That is not so hard to do, and not a large 
general problem, because the batching is explicit rather than implicit, 
so you can pick paths to batch that are small, compact, and easy to 
reason about.  But nevertheless, a point I would like to make sure you 
are comfortable with before we all decide these hooks will work for 
everyone.

Zach

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-31 19:56               ` Zachary Amsden
@ 2006-07-31 22:07                 ` Keir Fraser
  2006-07-31 22:40                   ` Zachary Amsden
  0 siblings, 1 reply; 28+ messages in thread
From: Keir Fraser @ 2006-07-31 22:07 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: virtualization, Andrew Theurer, xen-devel, Andi Kleen


On 31 Jul 2006, at 20:56, Zachary Amsden wrote:

>> If the run-time check cost really isn't an issue (I'd like to see 
>> numbers), we'd likely use this new interface in preference to 
>> implicitly batched writable pagetables and would support its 
>> inclusion in the kernel.
>
> Sorry about not having numbers.  My biggest question is - do you need 
> any other information than simply a single state variable to use 
> explicit batching?  I thought, and Jeremy and Chris both pointed out 
> as well, that Xen could potentially use the information about which PT 
> to unhook to take advantage of writable pagetables.  But, if that is 
> not the direction you are going, then it seems this information is not 
> so relevant for the explicit batching.

If the guest gives explicit hints, and the extra branch on set_pte does 
not hurt, then I think it makes sense to do straightforward explicit 
batching. Providing a PT page hint sounds like it could be ambiguous in 
some contexts too (e.g., the fork loop modifies two PT pages at a 
time).

>   The explicit batching does have one disadvantage without writable 
> page tables, which is a potential long term maintenance / correctness 
> issue - you must remove read hazards from these encapsulated paths.  
> That is not so hard to do, and not a large general problem, because 
> the batching is explicit rather than implicit, so you can pick paths 
> to batch that are small, compact, and easy to reason about.  But 
> nevertheless, a point I would like to make sure you are comfortable 
> with before we all decide these hooks will work for everyone.

Yes, that's why we moved away from this approach before. But previously 
we did it for *all* pagetable updates, which was a pain. Doing it just 
for a few important cases, and having the hooks maintained in upstream 
Linux, makes this rather less of a headache.

  -- Keir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-31 22:07                 ` Keir Fraser
@ 2006-07-31 22:40                   ` Zachary Amsden
  2006-08-02  9:21                     ` Keir Fraser
  0 siblings, 1 reply; 28+ messages in thread
From: Zachary Amsden @ 2006-07-31 22:40 UTC (permalink / raw)
  To: Keir Fraser; +Cc: virtualization, Andrew Theurer, xen-devel, Andi Kleen

Keir Fraser wrote:
> If the guest gives explicit hints, and the extra branch on set_pte 
> does not hurt, then I think it makes sense to do straightforward 
> explicit batching. Providing a PT page hint sounds like it could be 
> ambiguous in some contexts too (e.g., the fork loop modifies two PT 
> pages at a time).

Yes, that is why I shyed away from passing PT hints - then you need to 
deal with the double hint case, and I think unhooking and revalidating 
two page tables probably does not make sense based on the UP writeable 
page table numbers.

>
>>   The explicit batching does have one disadvantage without writable 
>> page tables, which is a potential long term maintenance / correctness 
>> issue - you must remove read hazards from these encapsulated paths.  
>> That is not so hard to do, and not a large general problem, because 
>> the batching is explicit rather than implicit, so you can pick paths 
>> to batch that are small, compact, and easy to reason about.  But 
>> nevertheless, a point I would like to make sure you are comfortable 
>> with before we all decide these hooks will work for everyone.
>
> Yes, that's why we moved away from this approach before. But 
> previously we did it for *all* pagetable updates, which was a pain. 
> Doing it just for a few important cases, and having the hooks 
> maintained in upstream Linux, makes this rather less of a headache.

Cool.  It sounds like the lazy mode hooks are exactly what you want then?

Cheers,

Zach

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] turn off writable page tables
  2006-07-31 22:40                   ` Zachary Amsden
@ 2006-08-02  9:21                     ` Keir Fraser
  0 siblings, 0 replies; 28+ messages in thread
From: Keir Fraser @ 2006-08-02  9:21 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: virtualization, Andrew Theurer, xen-devel, Andi Kleen


On 31 Jul 2006, at 23:40, Zachary Amsden wrote:

>> Yes, that's why we moved away from this approach before. But 
>> previously we did it for *all* pagetable updates, which was a pain. 
>> Doing it just for a few important cases, and having the hooks 
>> maintained in upstream Linux, makes this rather less of a headache.
>
> Cool.  It sounds like the lazy mode hooks are exactly what you want 
> then?

I wonder how it will interact with our late-pin/early-unpin model where 
we can directly write to pagetables before they are first used and also 
while they are being finally destroyed. It may be that if we use 
explicit batching and turn off that pinning logic we will not affect 
performance much, but we'll need to do some performance analysis.

  -- Keir

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2006-08-02  9:21 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-25 22:14 [PATCH] turn off writable page tables Andrew Theurer
2006-07-25 22:43 ` Nivedita Singhvi
2006-07-25 23:19   ` Andrew Theurer
  -- strict thread matches above, loose matches on Subject: below --
2006-07-25 22:41 Ian Pratt
2006-07-26  2:25 ` Andrew Theurer
2006-07-26  5:31   ` Jacob Gorm Hansen
2006-07-26  8:18 ` Gerd Hoffmann
2006-07-26  8:40   ` Keir Fraser
2006-07-26 21:10     ` Andrew Theurer
2006-07-26 21:38 Ian Pratt
2006-07-27 14:43 ` Andrew Theurer
2006-07-27 15:30   ` Keir Fraser
     [not found] <E1G5sBV-0005eg-At@host-192-168-0-1-bcn-london>
2006-07-26 23:38 ` Joe Bonasera
2006-07-27 17:31 Ian Pratt
2006-07-28  8:55 ` Keir Fraser
2006-07-28 15:21   ` Andrew Theurer
2006-07-28 15:51 Ian Pratt
2006-07-28 16:31 ` Keir Fraser
2006-07-28 21:36   ` Zachary Amsden
2006-07-28 23:05     ` Andi Kleen
2006-07-28 23:10       ` Zachary Amsden
2006-07-31  9:14         ` Keir Fraser
2006-07-31  9:32           ` Zachary Amsden
2006-07-31  9:53             ` Keir Fraser
2006-07-31 19:56               ` Zachary Amsden
2006-07-31 22:07                 ` Keir Fraser
2006-07-31 22:40                   ` Zachary Amsden
2006-08-02  9:21                     ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.