* 2.6.5-rc3-as1 patchset, cks5.2, cfq, aa1, and some wli @ 2004-04-02 8:21 Antony Suter 2004-04-02 10:14 ` William Lee Irwin III 0 siblings, 1 reply; 4+ messages in thread From: Antony Suter @ 2004-04-02 8:21 UTC (permalink / raw) To: linux-kernel Here is an update to my tidy set of patches. I was a fan of WLI's patchset until he discontinued it in the 2.6.0-test era. They were a small set of patches with performance improvements for laptops and NUMA machines amongst others... (wish I had a laptop NUMA machine *cough*). Some of those patches are now in the kernel proper. Some others have been updated and are found elsewhere, like the objrmap series can be found in Andrea Archangeli's -aa series. I'm starting to add some of the other patches from WLI's last release, depending on my abolity to resolve rejections. The numbers relate directly to those from linux-2.6.0-test11-wli-1.tar.bz2 Patches were applied in the following order: - Con Kolivas' new starcase cpu scheduler patch 5.2 - Jens Axboe's cfq io scheduler - Andrea Archangeli's 2.6.5-rc3-aa2.bz2 < from linux-2.6.0-test11-wli-1 > - #17 convert copy_strings() to use kmap_atomic() instead of kmap() - #19 node-local i386 per_cpu areas - #22 increase static vfs hashtable and VM array sizes - #24 /proc/ BKL gunk plus page wait hashtable sizing adjustment - #25 invalidate_inodes() speedup Linqs: http://www.users.on.net/sutera/2.6.5-rc3-as1.patch.gz http://www.users.on.net/sutera/2.6.5-rc3-as1.patch.gz.sign Note that to use the cfq scheduler to need to add "elevator=cfq" to your kernel command line. This is usually done in your lilo or grub (or equivalent) config. -- - Antony Suter (sutera internode on net) "Bonta" - "...through shadows falling, out of memory and time..." ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: 2.6.5-rc3-as1 patchset, cks5.2, cfq, aa1, and some wli 2004-04-02 8:21 2.6.5-rc3-as1 patchset, cks5.2, cfq, aa1, and some wli Antony Suter @ 2004-04-02 10:14 ` William Lee Irwin III 2004-04-02 16:43 ` Antony Suter 0 siblings, 1 reply; 4+ messages in thread From: William Lee Irwin III @ 2004-04-02 10:14 UTC (permalink / raw) To: Antony Suter; +Cc: linux-kernel On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote: > Here is an update to my tidy set of patches. I was a fan of WLI's > patchset until he discontinued it in the 2.6.0-test era. They were a > small set of patches with performance improvements for laptops and NUMA > machines amongst others... (wish I had a laptop NUMA machine *cough*). > Some of those patches are now in the kernel proper. Some others have > been updated and are found elsewhere, like the objrmap series can be > found in Andrea Archangeli's -aa series. I'm starting to add some of the > other patches from WLI's last release, depending on my abolity to > resolve rejections. The numbers relate directly to those from > linux-2.6.0-test11-wli-1.tar.bz2 Ouch! Please, use either Hugh's or Andrea's up-to-date patches. anobjrmap (and actually the vast majority of this material) was not original. You're also unlikely to find highpmd and a number of others useful without highmem and/or ia32 NUMA. I'm honestly not sure how you managed to merge anything with all the highpmd/O(1) proc_pid_statm()/anobjrmap bits in there. Heck, I lost the stamina to keep it going myself. On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote: > Patches were applied in the following order: > - Con Kolivas' new starcase cpu scheduler patch 5.2 > - Jens Axboe's cfq io scheduler > - Andrea Archangeli's 2.6.5-rc3-aa2.bz2 Phew, aa's bits should be maintained/updated/bugfixed. On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote: > < from linux-2.6.0-test11-wli-1 > > - #17 convert copy_strings() to use kmap_atomic() instead of kmap() > - #19 node-local i386 per_cpu areas These two are completely useless unless you're running on bigfathighmem. On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote: > - #22 increase static vfs hashtable and VM array sizes > - #24 /proc/ BKL gunk plus page wait hashtable sizing adjustment > - #25 invalidate_inodes() speedup #25 is in -mm and may have bugfixes/updates relative to whatever I had. #22 and #24 don't have very definite impacts; I only saw any difference with truly massive amounts of IO in-flight (e.g. 25GB/48GB), and while it wasn't a clear win, it moved around the blocking points surrounding IO submission to places where I'd rather it go. This is a rather foggy issue you probably shouldn't be concerned with. If you do see it make a difference, I'll be surprised, but happy to see the benchmark numbers demonstrating it. To understand what was really going on with all this, it helps to know why this thing was put together. That was to optimize a benchmark I can't name without having problems with respect to publication rules and so on. This was actually done in somewhat of a hurry as it was meant more as a demonstration of benchmarking methodology than producing important results in and of themselves, but then the results were maintained for far longer than the actual optimization effort lasted, as they appeared to be valuable. I believed the results of this to be potentially useful to end users because the benchmark was a simulation of interactive workloads where people interacted with shells and spawned short-running jobs meant to be typical for shellservers in university- like environments, though the age of the benchmark hurt its relevance quite a bit. Active development was pursued on other things while the -wli patches were maintained. As it aged, most of the highly experimental parts were backed out instead of debugged to address stability issues, and eventually the entire patch coctail imploded as the series of 10-15 patches in a row that stomped over every line modifying a user pte developed poor interactions I didn't have the bandwidth to address in addition to my regular duties. While the optimization effort was ongoing, my general approach was to hunt for patches to forward port or "combine" as opposed to producing anything original. Many of these things were motivated by a combination of a priori reasoning about what was going on with various profile-based hints. For instance, on ia32 NUMA, all lowmem is on node 0. I observed that pagetable _teardown_ was expensive, and this in the loops over pmd's to remove pagetable pages. My attack was to hoist the pmd's into node-local memory by shoving it into highmem, where Andrea's pte-highmem and Arjan and Ingo's highpte were both strong precedents. I combined the two approaches, as Andrea placed pmd's in highmem, while Arjan and Ingo used kmap_atomic() and the like to avoid kmap_lock overhead and so on. All that was, in fact, after some abortive attempts to punt pagetable teardown to keventd, which while mechanically successful (i.e. the code worked) was not effective as a performance improvement. Many of the other patches were even more direct equivalents of some predecessor. For instance, this unnameable benchmark spawns ps(1) very frequently to simulate users monitoring their own workloads. This then very heavily stresses the /proc/ VM reporting code and the /proc/ vfs code. To address this, I ported whatever vfs RCU code I could that maneesh and dipankar had written, and _also_ ported bcrl's O(1) proc_pid_statm() from RH's 2.4.9, which resolved semaphore contention issues and more general algorithmic efficiency issues in /proc/ reporting. With that and various BKL-related /proc/ adjustments, the /proc/ -stressing components were speeded up greatly. This differs a lot from other attacks on this benchmark, where the benchmark is altered so the parts stressing /proc/ are removed. I also had in the back of my mind the notion that /proc/ performance improvements would be appreciated by end users with limited cpu power to devote to the monitoring of their workloads and machines' performance, which is part of what motivated me to do it "the hard way" instead of modifying the benchmark or replacing the userspace procps utilities with /dev/kmem -diving utilities. The general points this is all meant to illustrate are that some of the cherrypicking going on doesn't really make sense, and to give the background on where all this stuff came from so you can understand which parts are going to be useful to you if you do choose to cherrypick them. I very much regret not arranging relative benchmark results to post, as they are very impressive for not having exploited the extreme NUMA characteristics of the test machines. In the very strict sense of the slope of the curve as the number of processors increases, the original patch set was measured to literally double the kernel's scalability in this benchmark, which is something I'm rather proud of. There were other approaches which exploited the NUMA hardware aspects to achieve more drastic results with less code, but had more limited applicability as non-NUMA machines didn't benefit from them at all. Also, there will be new -wli's. They will be vastly different in nature from the prior -wli's. I don't like repeating myself. I already acknowledged the precedents available to me in the 2.5.74 era. The new -wli's won't be as heavily influenced by precedents and will be of a substantially different character from the prior releases. I'm taking my time to do this and for a good reason. I don't want to do it half-assed. It may not be VM. It may not be any one thing. What it _will_ be (unlike some of the prior -wli code) is up to my own personal coding standards, which you may rest assured are rather high. And finally, even with all this longwinded harangue, congratulations on your tree. There are very definite feelings of importance and satisfaction of having done service from producing releases others rely upon. And these are real, as real users do benefit from what you've assembled. I'm more than happy to help if you have bugreports in any code I maintained or other need to call on me. And whatever precedent I may have provided, you do own this, and this is your own original work. -- wli ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: 2.6.5-rc3-as1 patchset, cks5.2, cfq, aa1, and some wli 2004-04-02 10:14 ` William Lee Irwin III @ 2004-04-02 16:43 ` Antony Suter 2004-04-02 23:44 ` William Lee Irwin III 0 siblings, 1 reply; 4+ messages in thread From: Antony Suter @ 2004-04-02 16:43 UTC (permalink / raw) To: William Lee Irwin III; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 4933 bytes --] On Fri, 2004-04-02 at 20:14, William Lee Irwin III wrote: > On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote: > > Some of those patches are now in the kernel proper. Some others have > > been updated and are found elsewhere, like the objrmap series can be > > found in Andrea Archangeli's -aa series. I'm starting to add some of the > > other patches from WLI's last release, depending on my ability to > > resolve rejections. The numbers relate directly to those from > > linux-2.6.0-test11-wli-1.tar.bz2 > > Ouch! Please, use either Hugh's or Andrea's up-to-date patches. > anobjrmap (and actually the vast majority of this material) was not > original. You're also unlikely to find highpmd and a number of others > useful without highmem and/or ia32 NUMA. I certainly want to use the most up to date versions if others have continued that work. Pointers appreciated. > On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote: > > < from linux-2.6.0-test11-wli-1 > > > - #17 convert copy_strings() to use kmap_atomic() instead of kmap() > > - #19 node-local i386 per_cpu areas > > These two are completely useless unless you're running on bigfathighmem. > > On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote: > > - #22 increase static vfs hashtable and VM array sizes > > - #24 /proc/ BKL gunk plus page wait hashtable sizing adjustment > > - #25 invalidate_inodes() speedup > > #25 is in -mm and may have bugfixes/updates relative to whatever I had. > #22 and #24 don't have very definite impacts; I only saw any difference > with truly massive amounts of IO in-flight (e.g. 25GB/48GB), and while > it wasn't a clear win, it moved around the blocking points surrounding > IO submission to places where I'd rather it go. This is a rather foggy > issue you probably shouldn't be concerned with. If you do see it make a > difference, I'll be surprised, but happy to see the benchmark numbers > demonstrating it. Could you add some outlines of your patches #02, #03, #18 and #28 please? > [...] > altered so the parts stressing /proc/ are removed. I also had in the > back of my mind the notion that /proc/ performance improvements would > be appreciated by end users with limited cpu power to devote to the > monitoring of their workloads and machines' performance, which is part > of what motivated me to do it "the hard way" instead of modifying the > benchmark or replacing the userspace procps utilities with /dev/kmem > -diving utilities. How important would improvements to /proc be now we have /sys ? > The general points this is all meant to illustrate are that some of the > cherrypicking going on doesn't really make sense, and to give the > background on where all this stuff came from so you can understand > which parts are going to be useful to you if you do choose to cherrypick > them. I certainly want to grok the purpose of each and every patch. This release might would have been larger and out sooner if not for a reverse patch cascade meltdown. > I very much regret not arranging relative benchmark results to > post, as they are very impressive for not having exploited the extreme > NUMA characteristics of the test machines. In the very strict sense of > the slope of the curve as the number of processors increases, the > original patch set was measured to literally double the kernel's > scalability in this benchmark, which is something I'm rather proud of. > There were other approaches which exploited the NUMA hardware aspects > to achieve more drastic results with less code, but had more limited > applicability as non-NUMA machines didn't benefit from them at all. For as long as this series continues, I would want to include patches that improve performance in any area, so long as the overall effect is positive. And no improvement too great or small. I must have more power. > Also, there will be new -wli's. They will be vastly different in nature > from the prior -wli's. I don't like repeating myself. I already I look forward to it ;) > And finally, even with all this longwinded harangue, congratulations on > your tree. There are very definite feelings of importance and > satisfaction of having done service from producing releases others rely > upon. And these are real, as real users do benefit from what you've > assembled. I'm more than happy to help if you have bugreports in any > code I maintained or other need to call on me. And whatever precedent I > may have provided, you do own this, and this is your own original work. Thanks for your kind words, and detailed notes! Again, any pointers to similar sorts of work would be greatly appreciated. Can you recommend any good tools for patch set management? Cheers. -- - Antony Suter (suterant users sourceforge net) "Bonta" - "...through shadows falling, out of memory and time..." [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: 2.6.5-rc3-as1 patchset, cks5.2, cfq, aa1, and some wli 2004-04-02 16:43 ` Antony Suter @ 2004-04-02 23:44 ` William Lee Irwin III 0 siblings, 0 replies; 4+ messages in thread From: William Lee Irwin III @ 2004-04-02 23:44 UTC (permalink / raw) To: Antony Suter; +Cc: linux-kernel On Sat, Apr 03, 2004 at 02:43:28AM +1000, Antony Suter wrote: > Could you add some outlines of your patches #02, #03, #18 and #28 > please? #2 rewrote the page allocator to do deferred coalescing, which had the additional advantage of making transfers of groups of pages to and from the lists protected by zone->lock into expected O(1) operations. It also exported the new functionality of O(1) batched page freeing to callers, which was utilized by #3. #3 implemented caching of preconstructed leaf pagetable nodes in a manner compatible with highpte. This was supposed to conserve cache, and may improve performance on load that repetitively fork() and exit(). #18 just micro-optimized some page allocator logic and enlarged the batches so as to take advantage of the operations newly made O(1) by #2 which would otherwise have been expensive with large batches. It's not actually useful without #2 in place. #28 put all scheduling primitives into their own ELF sections, delimited by new marker symbols, and uses those to improve /proc/$PID/wchan reporting so that various scheduling primitives are skipped over that aren't now, and scheduling functions don't need to be contiguous in the text segment of the kernel. I've resubmitted this a few more times, and it seems to be destined for mainline. At some point in the past, I wrote: >> altered so the parts stressing /proc/ are removed. I also had in the >> back of my mind the notion that /proc/ performance improvements would >> be appreciated by end users with limited cpu power to devote to the >> monitoring of their workloads and machines' performance, which is part >> of what motivated me to do it "the hard way" instead of modifying the >> benchmark or replacing the userspace procps utilities with /dev/kmem >> -diving utilities. On Sat, Apr 03, 2004 at 02:43:28AM +1000, Antony Suter wrote: > How important would improvements to /proc be now we have /sys ? These were largely performance-oriented, not functionality. A rather unfortunate aspect of that benchmark was that it effectively benchmarked dozens or hundreds of processes doing ps(1) in parallel. End users may find that the overhead of running top(1) is reduced by the patches meant to speed up /proc/ for the benchmark, as many of them were single- threaded speedups, and not just locking improvements. The most important of the /proc/ performance patches was actually #5, the forward port of bcrl's O(1) proc_pid_statm(). #4, the rbtree-based get_tgid_list()/get_tid_list(), may also prove useful, and could use some benchmarking on it as a standalone patch done. At some point in the past, I wrote: >> And finally, even with all this longwinded harangue, congratulations on >> your tree. There are very definite feelings of importance and >> satisfaction of having done service from producing releases others rely >> upon. And these are real, as real users do benefit from what you've >> assembled. I'm more than happy to help if you have bugreports in any >> code I maintained or other need to call on me. And whatever precedent I >> may have provided, you do own this, and this is your own original work. On Sat, Apr 03, 2004 at 02:43:28AM +1000, Antony Suter wrote: > Thanks for your kind words, and detailed notes! Again, any pointers to > similar sorts of work would be greatly appreciated. Can you recommend > any good tools for patch set management? I've discovered quilt (based on akpm's patch scripts) is excellent and am replacing my old scripts, which confused everyone but me, with it. -- wli ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2004-04-02 23:45 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-04-02 8:21 2.6.5-rc3-as1 patchset, cks5.2, cfq, aa1, and some wli Antony Suter 2004-04-02 10:14 ` William Lee Irwin III 2004-04-02 16:43 ` Antony Suter 2004-04-02 23:44 ` William Lee Irwin III
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox