Zachary Amsden wrote:

> Chris Wright wrote:
>
>>
>> Why memset was never done on PAE?
>
>
> That's a good point.  The memset() is redundant on PAE, since it 
> allocates all 4 PMDs immediately after that (in pgd_alloc).  There are 
> two reasons for moving the memset() - one is that it can potentially 
> perform useful work ahead of the lock and effectively act as a 
> prefetch.  The second is that at least on a hypervisor, 
> clone_pgd_range() is likely to be taken as a page allocation hint, and 
> thus moving the memset() before this operation allows only the 
> actually present page directory entry updates to be passed to the 
> hypervisor.
>
> Actually, the memset() could be redundant on non-PAE as well, since we 
> should have gone through free_pgtables, which would have done a 
> pmd_clear() on each user level pmd, and the kernel level pmds are 
> copied in again inside the lock.
>
> I'll try it out to see if this is possible.
>
> Zach
>

So that turned out to be a really bad idea.  But, I did notice that the 
pmds in PAE mode could be cached with the pgds instead of destroying and 
re-allocating them.  Unfortunately, this spends three pages per cached 
PAE pgd, and doesn't look like a big win.  I ran microbenchmarks, stolen 
mostly from lmbench (thank you Larry!), and this patch shows almost no 
improvement.  Judging by the fact the the kmem slab cache seems to work 
very efficiently, I don't think the extra overhead from memset in the 
constructor is of much significance.

Here's the benchmark results on native hardware (P4, 2.4 GHz, PAE kernel):

Before:
(getpid and segv truncated beyond my scrollback, but of no significance)
  forkwait: 0.596u 3.932s 0:04.54 99.5% 0+0k 0+0io 0pf+0w
  forkwait: 0.632u 3.876s 0:04.50 100.0%        0+0k 0+0io 0pf+0w
  forkwait: 0.468u 4.048s 0:04.51 99.7% 0+0k 0+0io 0pf+0w
  forkwait: 0.516u 3.988s 0:04.50 99.7% 0+0k 0+0io 0pf+0w
  forkwait: 0.644u 3.908s 0:04.55 99.7% 0+0k 0+0io 0pf+0w
   divzero: 1.356u 6.712s 0:08.07 99.8% 0+0k 0+0io 0pf+0w
   divzero: 1.332u 6.620s 0:07.94 100.1%        0+0k 0+0io 0pf+0w
   divzero: 1.300u 6.652s 0:07.95 100.0%        0+0k 0+0io 0pf+0w
   divzero: 1.672u 6.312s 0:07.98 100.0%        0+0k 0+0io 0pf+0w
   divzero: 1.128u 6.824s 0:07.95 99.8% 0+0k 0+0io 0pf+0w
  lat_pipe: 0.228u 8.196s 0:16.98 49.5% 0+0k 0+0io 0pf+0w
  lat_pipe: 0.220u 8.420s 0:17.15 50.3% 0+0k 0+0io 0pf+0w
  lat_pipe: 0.236u 8.376s 0:17.00 50.5% 0+0k 0+0io 0pf+0w
  lat_pipe: 0.220u 8.140s 0:16.97 49.2% 0+0k 0+0io 0pf+0w
  lat_pipe: 0.232u 8.488s 0:16.86 51.6% 0+0k 0+0io 0pf+0w
    Switch: 5.896u 7.172s 0:11.97 109.1%        0+0k 0+0io 53pf+0w
    Switch: 6.168u 6.792s 0:11.23 115.3%        0+0k 0+0io 1pf+0w
    Switch: 6.084u 7.044s 0:11.22 116.9%        0+0k 0+0io 1pf+0w
    Switch: 6.044u 7.088s 0:11.34 115.6%        0+0k 0+0io 1pf+0w
    Switch: 6.252u 7.212s 0:11.45 117.5%        0+0k 0+0io 1pf+0w

After:

zach-dev2:Micro-bench $ cat out.post-patch
    getpid: 0.076u 0.000s 0:00.08 87.5% 0+0k 0+0io 0pf+0w
    getpid: 0.076u 0.004s 0:00.07 100.0%        0+0k 0+0io 0pf+0w
    getpid: 0.080u 0.000s 0:00.08 100.0%        0+0k 0+0io 0pf+0w
    getpid: 0.076u 0.000s 0:00.07 100.0%        0+0k 0+0io 0pf+0w
    getpid: 0.072u 0.004s 0:00.07 100.0%        0+0k 0+0io 0pf+0w
      segv: 1.168u 8.552s 0:09.72 99.8% 0+0k 0+0io 0pf+0w
      segv: 1.160u 8.544s 0:09.70 100.0%        0+0k 0+0io 0pf+0w
      segv: 1.248u 8.364s 0:09.61 99.8% 0+0k 0+0io 0pf+0w
      segv: 1.296u 8.368s 0:09.66 99.8% 0+0k 0+0io 0pf+0w
      segv: 1.312u 8.288s 0:09.59 100.0%        0+0k 0+0io 0pf+0w
  forkwait: 0.600u 3.932s 0:04.53 100.0%        0+0k 0+0io 0pf+0w
  forkwait: 0.580u 3.940s 0:04.51 100.2%        0+0k 0+0io 0pf+0w
  forkwait: 0.576u 3.948s 0:04.52 99.7% 0+0k 0+0io 0pf+0w
  forkwait: 0.492u 3.996s 0:04.48 100.0%        0+0k 0+0io 0pf+0w
  forkwait: 0.604u 3.908s 0:04.51 99.7% 0+0k 0+0io 0pf+0w
   divzero: 1.304u 6.740s 0:08.04 100.0%        0+0k 0+0io 0pf+0w
   divzero: 1.360u 6.704s 0:08.06 100.0%        0+0k 0+0io 0pf+0w
   divzero: 1.344u 6.696s 0:08.03 100.0%        0+0k 0+0io 0pf+0w
   divzero: 1.428u 6.600s 0:08.02 100.0%        0+0k 0+0io 0pf+0w
   divzero: 1.308u 6.720s 0:08.02 100.0%        0+0k 0+0io 0pf+0w
  lat_pipe: 0.212u 7.648s 0:16.40 47.8% 0+0k 0+0io 0pf+0w
  lat_pipe: 0.268u 8.208s 0:16.78 50.4% 0+0k 0+0io 0pf+0w
  lat_pipe: 0.188u 8.296s 0:16.42 51.5% 0+0k 0+0io 0pf+0w
  lat_pipe: 0.180u 8.084s 0:16.91 48.8% 0+0k 0+0io 0pf+0w
  lat_pipe: 0.160u 7.668s 0:16.85 46.4% 0+0k 0+0io 0pf+0w
    Switch: 6.168u 6.740s 0:11.91 108.3%        0+0k 0+0io 53pf+0w
    Switch: 5.860u 7.332s 0:11.45 115.1%        0+0k 0+0io 1pf+0w
    Switch: 5.804u 7.140s 0:11.34 114.1%        0+0k 0+0io 1pf+0w
    Switch: 6.168u 6.644s 0:11.12 115.1%        0+0k 0+0io 1pf+0w
    Switch: 6.076u 6.896s 0:11.34 114.2%        0+0k 0+0io 1pf+0w

So lat_pipe seems to have improved slightly.. but it could be noise.  
Yeah, not worth it.  Plus, this patch is obviously broken - the panic() 
could be avoided by reworking the code, but this seems like a large 
amount of work for very little gain.  Nevertheless, I have attached the 
patch for posterity's sake.

Zach