Zachary Amsden wrote: > Chris Wright wrote: > >> >> Why memset was never done on PAE? > > > That's a good point. The memset() is redundant on PAE, since it > allocates all 4 PMDs immediately after that (in pgd_alloc). There are > two reasons for moving the memset() - one is that it can potentially > perform useful work ahead of the lock and effectively act as a > prefetch. The second is that at least on a hypervisor, > clone_pgd_range() is likely to be taken as a page allocation hint, and > thus moving the memset() before this operation allows only the > actually present page directory entry updates to be passed to the > hypervisor. > > Actually, the memset() could be redundant on non-PAE as well, since we > should have gone through free_pgtables, which would have done a > pmd_clear() on each user level pmd, and the kernel level pmds are > copied in again inside the lock. > > I'll try it out to see if this is possible. > > Zach > So that turned out to be a really bad idea. But, I did notice that the pmds in PAE mode could be cached with the pgds instead of destroying and re-allocating them. Unfortunately, this spends three pages per cached PAE pgd, and doesn't look like a big win. I ran microbenchmarks, stolen mostly from lmbench (thank you Larry!), and this patch shows almost no improvement. Judging by the fact the the kmem slab cache seems to work very efficiently, I don't think the extra overhead from memset in the constructor is of much significance. Here's the benchmark results on native hardware (P4, 2.4 GHz, PAE kernel): Before: (getpid and segv truncated beyond my scrollback, but of no significance) forkwait: 0.596u 3.932s 0:04.54 99.5% 0+0k 0+0io 0pf+0w forkwait: 0.632u 3.876s 0:04.50 100.0% 0+0k 0+0io 0pf+0w forkwait: 0.468u 4.048s 0:04.51 99.7% 0+0k 0+0io 0pf+0w forkwait: 0.516u 3.988s 0:04.50 99.7% 0+0k 0+0io 0pf+0w forkwait: 0.644u 3.908s 0:04.55 99.7% 0+0k 0+0io 0pf+0w divzero: 1.356u 6.712s 0:08.07 99.8% 0+0k 0+0io 0pf+0w divzero: 1.332u 6.620s 0:07.94 100.1% 0+0k 0+0io 0pf+0w divzero: 1.300u 6.652s 0:07.95 100.0% 0+0k 0+0io 0pf+0w divzero: 1.672u 6.312s 0:07.98 100.0% 0+0k 0+0io 0pf+0w divzero: 1.128u 6.824s 0:07.95 99.8% 0+0k 0+0io 0pf+0w lat_pipe: 0.228u 8.196s 0:16.98 49.5% 0+0k 0+0io 0pf+0w lat_pipe: 0.220u 8.420s 0:17.15 50.3% 0+0k 0+0io 0pf+0w lat_pipe: 0.236u 8.376s 0:17.00 50.5% 0+0k 0+0io 0pf+0w lat_pipe: 0.220u 8.140s 0:16.97 49.2% 0+0k 0+0io 0pf+0w lat_pipe: 0.232u 8.488s 0:16.86 51.6% 0+0k 0+0io 0pf+0w Switch: 5.896u 7.172s 0:11.97 109.1% 0+0k 0+0io 53pf+0w Switch: 6.168u 6.792s 0:11.23 115.3% 0+0k 0+0io 1pf+0w Switch: 6.084u 7.044s 0:11.22 116.9% 0+0k 0+0io 1pf+0w Switch: 6.044u 7.088s 0:11.34 115.6% 0+0k 0+0io 1pf+0w Switch: 6.252u 7.212s 0:11.45 117.5% 0+0k 0+0io 1pf+0w After: zach-dev2:Micro-bench $ cat out.post-patch getpid: 0.076u 0.000s 0:00.08 87.5% 0+0k 0+0io 0pf+0w getpid: 0.076u 0.004s 0:00.07 100.0% 0+0k 0+0io 0pf+0w getpid: 0.080u 0.000s 0:00.08 100.0% 0+0k 0+0io 0pf+0w getpid: 0.076u 0.000s 0:00.07 100.0% 0+0k 0+0io 0pf+0w getpid: 0.072u 0.004s 0:00.07 100.0% 0+0k 0+0io 0pf+0w segv: 1.168u 8.552s 0:09.72 99.8% 0+0k 0+0io 0pf+0w segv: 1.160u 8.544s 0:09.70 100.0% 0+0k 0+0io 0pf+0w segv: 1.248u 8.364s 0:09.61 99.8% 0+0k 0+0io 0pf+0w segv: 1.296u 8.368s 0:09.66 99.8% 0+0k 0+0io 0pf+0w segv: 1.312u 8.288s 0:09.59 100.0% 0+0k 0+0io 0pf+0w forkwait: 0.600u 3.932s 0:04.53 100.0% 0+0k 0+0io 0pf+0w forkwait: 0.580u 3.940s 0:04.51 100.2% 0+0k 0+0io 0pf+0w forkwait: 0.576u 3.948s 0:04.52 99.7% 0+0k 0+0io 0pf+0w forkwait: 0.492u 3.996s 0:04.48 100.0% 0+0k 0+0io 0pf+0w forkwait: 0.604u 3.908s 0:04.51 99.7% 0+0k 0+0io 0pf+0w divzero: 1.304u 6.740s 0:08.04 100.0% 0+0k 0+0io 0pf+0w divzero: 1.360u 6.704s 0:08.06 100.0% 0+0k 0+0io 0pf+0w divzero: 1.344u 6.696s 0:08.03 100.0% 0+0k 0+0io 0pf+0w divzero: 1.428u 6.600s 0:08.02 100.0% 0+0k 0+0io 0pf+0w divzero: 1.308u 6.720s 0:08.02 100.0% 0+0k 0+0io 0pf+0w lat_pipe: 0.212u 7.648s 0:16.40 47.8% 0+0k 0+0io 0pf+0w lat_pipe: 0.268u 8.208s 0:16.78 50.4% 0+0k 0+0io 0pf+0w lat_pipe: 0.188u 8.296s 0:16.42 51.5% 0+0k 0+0io 0pf+0w lat_pipe: 0.180u 8.084s 0:16.91 48.8% 0+0k 0+0io 0pf+0w lat_pipe: 0.160u 7.668s 0:16.85 46.4% 0+0k 0+0io 0pf+0w Switch: 6.168u 6.740s 0:11.91 108.3% 0+0k 0+0io 53pf+0w Switch: 5.860u 7.332s 0:11.45 115.1% 0+0k 0+0io 1pf+0w Switch: 5.804u 7.140s 0:11.34 114.1% 0+0k 0+0io 1pf+0w Switch: 6.168u 6.644s 0:11.12 115.1% 0+0k 0+0io 1pf+0w Switch: 6.076u 6.896s 0:11.34 114.2% 0+0k 0+0io 1pf+0w So lat_pipe seems to have improved slightly.. but it could be noise. Yeah, not worth it. Plus, this patch is obviously broken - the panic() could be avoided by reworking the code, but this seems like a large amount of work for very little gain. Nevertheless, I have attached the patch for posterity's sake. Zach