All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: [PATCH] turn off writable page tables
@ 2006-07-25 22:41 Ian Pratt
  2006-07-26  2:25 ` Andrew Theurer
  2006-07-26  8:18 ` Gerd Hoffmann
  0 siblings, 2 replies; 28+ messages in thread
From: Ian Pratt @ 2006-07-25 22:41 UTC (permalink / raw)
  To: Andrew Theurer, xen-devel

> on Xeon MP processor, uniprocessor dom0 kernel, pae=y:
> 
> benchmark                c/s 10729 force_emulate
> ------------------------ --------- -------------
> lmbench fork+exit:       469.5833  470.3913   usec, lower is better
> lmbench fork+execve:     1241.0000 1225.7778  usec, lower is better
> lmbench fork+/sbin/bash: 12190.000 12119.000  usec, lower is better

It's kinda weird that these scores are so close -- I guess its just
coincidence that we must be getting something like an average of 10-20
pte's updated per pagetable page and the cost of doing multiple emulates
perfectly balances the cost of unhooking/rehooking.

I would like to make sure we fully understand what's going on, though.

I'd like to make sure there's no 'dumb stuff' happening, and the
writeable pagetables isn't being used erroneously where we don't expect
it (hence crippling the scores), and that its actually functioning as
intended i.e. that we get one fault to unhook, and then a fault causing
a rehook once we move to the next page in the fork.
   
If you write a little test program that dirties a large chunk of memory
just before the fork, we should see writeable pagetables winning easily.

It would also be good to use some of the trace buffer stuff to find out
exactly what the sequence of faults and flushes is.

I have no problem with enabling force emulation, I'd just like to fully
understand the tradeoff. I suspect the answer is that typically only a
handful of PTEs are dirty, and hence there are relatively few updates to
the parent process's page tables. It's worth understanding this as it
also has implications for shadow pagetables.


Thanks,
Ian

> dbench 3.03              186.354   191.278    MB/sec
> reaim_aim9               1890.01   2055.97    jobs/min
> reaim_compute            2538.75   2522.90    jobs/min
> reaim_dbase              3852.14   3739.38    jobs/min
> reaim_fserver            4437.93   4389.71    jobs/min
> reaim_shared             2365.85   2362.97    jobs/min
> SPEC SDET                4315.91   4312.02    scripts/hr
> 
> These are all within the noise level (some slightly better, some
> slightly worse for emulate).  There really isn't much of difference
> here.  I'd like to propose turning on the emulate path all the time in
> xen.

^ permalink raw reply	[flat|nested] 28+ messages in thread
* RE: [PATCH] turn off writable page tables
@ 2006-07-28 15:51 Ian Pratt
  2006-07-28 16:31 ` Keir Fraser
  0 siblings, 1 reply; 28+ messages in thread
From: Ian Pratt @ 2006-07-28 15:51 UTC (permalink / raw)
  To: Andrew Theurer, Keir Fraser; +Cc: Gerd Hoffmann, xen-devel

> So, in summary, we know writable page tables are not broken, they just
> don't help on typical workloads because the PTEs/page are so low.
> However, they do hurt SMP guest performance.  If we are not seeing a
> benefit today, should we turn it off?  Should we make it a compile
time
> option, with the default off?

I wouldn't mind seeing wrpt removed altogether, or at least emulation
made the compile time default for the moment. There's bound to be some
workload that bites us in the future which is why batching updates on
the fork path mightn't be a bad thing if it can be done without too much
gratuitous hacking of linux core code.

Ian

^ permalink raw reply	[flat|nested] 28+ messages in thread
* RE: [PATCH] turn off writable page tables
@ 2006-07-27 17:31 Ian Pratt
  2006-07-28  8:55 ` Keir Fraser
  0 siblings, 1 reply; 28+ messages in thread
From: Ian Pratt @ 2006-07-27 17:31 UTC (permalink / raw)
  To: Keir Fraser, Andrew Theurer; +Cc: Ian Pratt, Gerd Hoffmann, xen-devel


> > I am having a hard time finding any "enterprise" workloads which
have
> > a lot of PTEs/page right before fork.  If anyone can point me to
some,
> > that would be great.
> >
> > I will look into batching next, but I am curious if simply using a
> > hypercall in stead of write fault + emulate will make any difference
> > at all.  I'll try that first, then implement the batched update.
> > Eventually a hypercall which does more would be nice, but I guess
> > we'll have to convince the Linux maintainers it's a good idea.
> 
> The obvious thing to do is emulate the first 4 updates to a particular
> page, and only then switch to batched mode. Slows down the batched
path
> a bit, but stops it firing in many cases where it is no help.

Why? There should be no overhead to just building batches on the stack
(or a per vcpu area) and flushing at the end of the page. Certainly if
we were to keep wrpt it would make sense to take a few emulations faults
first on a page before engaging wrpt, but for explicit batches we don't
need any smarts. 

[Although the batching strategy would (currently) work for Linux, we do
have to bare in mind that some OSes (possibly NetBSD) won't rely on a
lock to protect updates to pagetables and will use individual atomic
ops.]

Ian

^ permalink raw reply	[flat|nested] 28+ messages in thread
[parent not found: <E1G5sBV-0005eg-At@host-192-168-0-1-bcn-london>]
* RE: [PATCH] turn off writable page tables
@ 2006-07-26 21:38 Ian Pratt
  2006-07-27 14:43 ` Andrew Theurer
  0 siblings, 1 reply; 28+ messages in thread
From: Ian Pratt @ 2006-07-26 21:38 UTC (permalink / raw)
  To: Andrew Theurer, Keir Fraser; +Cc: Ian Pratt, Gerd Hoffmann, xen-devel

> And it does make a difference in this case.  I now have a test program
> which dirties a number of virtually contiguous pages then forks (it
also
> resets xen perf counters before fork and collects perf counters right
> after fork), then records the elapsed time for the fork.  The
difference
> is quite amazing in this case.  For both writable and emulate, I ran
> with a range of dirty pages, from 1280 to 128000.  The elapsed times
for
> fork a quite linear from small number to large number of dirty pages.
> Below are the min and max:
> 
>          1280 pages    128000 pages
> wtpt:     813 usec      37552 usec
> emulate: 3279 usec     283879 usec

Good, at least that suggests that the code works for the usage it was
intended for. 

> So, in a -perfect-world- this works great.  Problem is most workloads
> don't appear to have a vast percentage of entries that need to be
> updated.   I'll go ahead and  expand this test to find out what the
> threshold is to break even.  I'll also see if we can implement a
batched
> call in fork to update the parent -I hope this will show just as good
> performance even when most entries need modification and even better
> performance over wtpt with a low number of entries modified.

With license to make more invasive changes to core Linux mm it certainly
should be possible to optimize this specific case with a batched update
fairly easily. You could even go further an implement a 'make all PTEs
in pagetable RO' hypercall, possibly including a copy to the child. This
could potentially work better than current 'late pin', at least the
validation would be incremental rather than in one big hit at the end. 

Ian

^ permalink raw reply	[flat|nested] 28+ messages in thread
* [PATCH] turn off writable page tables
@ 2006-07-25 22:14 Andrew Theurer
  2006-07-25 22:43 ` Nivedita Singhvi
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Theurer @ 2006-07-25 22:14 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1498 bytes --]

At OLS I gave a talk on some of the Xen scalability inhibitors, and one 
of these was writable page tables.  We went over why the feature does 
not scale, but just as important, we found that the uniprocessor case 
does not provide any advantage either.  These tests were done on x86_64, 
so I wanted to run the 1-way test on 32 bit to show the same problem.  
So, I have run with writable PTs and with emulation forced on for 
several benchmarks:

on Xeon MP processor, uniprocessor dom0 kernel, pae=y:

benchmark                c/s 10729 force_emulate
------------------------ --------- -------------
lmbench fork+exit:       469.5833  470.3913   usec, lower is better
lmbench fork+execve:     1241.0000 1225.7778  usec, lower is better
lmbench fork+/sbin/bash: 12190.000 12119.000  usec, lower is better
dbench 3.03              186.354   191.278    MB/sec
reaim_aim9               1890.01   2055.97    jobs/min
reaim_compute            2538.75   2522.90    jobs/min
reaim_dbase              3852.14   3739.38    jobs/min
reaim_fserver            4437.93   4389.71    jobs/min
reaim_shared             2365.85   2362.97    jobs/min
SPEC SDET                4315.91   4312.02    scripts/hr

These are all within the noise level (some slightly better, some 
slightly worse for emulate).  There really isn't much of difference 
here.  I'd like to propose turning on the emulate path all the time in 
xen. 

-Andrew Theurer

Applies to c/s 10729
Signed-off-by: Andrew Theurer <habanero@us.ibm.com>


[-- Attachment #2: force-emulate.patch --]
[-- Type: text/x-patch, Size: 493 bytes --]

diff -Naurp xen-unstable.hg-10729/xen/arch/x86/mm.c xen-unstable.hg-10729-emulate/xen/arch/x86/mm.c
--- xen-unstable.hg-10729/xen/arch/x86/mm.c	2006-07-25 17:05:33.000000000 -0500
+++ xen-unstable.hg-10729-emulate/xen/arch/x86/mm.c	2006-07-25 17:03:40.000000000 -0500
@@ -3582,7 +3582,7 @@ int ptwr_do_page_fault(struct domain *d,
         return 0;
     }
 
-#if 0 /* Leave this in as useful for debugging */ 
+#if 1 /* Leave this in as useful for debugging */ 
     goto emulate; 
 #endif
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2006-08-02  9:21 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-25 22:41 [PATCH] turn off writable page tables Ian Pratt
2006-07-26  2:25 ` Andrew Theurer
2006-07-26  5:31   ` Jacob Gorm Hansen
2006-07-26  8:18 ` Gerd Hoffmann
2006-07-26  8:40   ` Keir Fraser
2006-07-26 21:10     ` Andrew Theurer
  -- strict thread matches above, loose matches on Subject: below --
2006-07-28 15:51 Ian Pratt
2006-07-28 16:31 ` Keir Fraser
2006-07-28 21:36   ` Zachary Amsden
2006-07-28 23:05     ` Andi Kleen
2006-07-28 23:10       ` Zachary Amsden
2006-07-31  9:14         ` Keir Fraser
2006-07-31  9:32           ` Zachary Amsden
2006-07-31  9:53             ` Keir Fraser
2006-07-31 19:56               ` Zachary Amsden
2006-07-31 22:07                 ` Keir Fraser
2006-07-31 22:40                   ` Zachary Amsden
2006-08-02  9:21                     ` Keir Fraser
2006-07-27 17:31 Ian Pratt
2006-07-28  8:55 ` Keir Fraser
2006-07-28 15:21   ` Andrew Theurer
     [not found] <E1G5sBV-0005eg-At@host-192-168-0-1-bcn-london>
2006-07-26 23:38 ` Joe Bonasera
2006-07-26 21:38 Ian Pratt
2006-07-27 14:43 ` Andrew Theurer
2006-07-27 15:30   ` Keir Fraser
2006-07-25 22:14 Andrew Theurer
2006-07-25 22:43 ` Nivedita Singhvi
2006-07-25 23:19   ` Andrew Theurer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.