* Re: Mainline kernel OLTP performance update [not found] ` <20090115201210.ca1a9542.akpm@linux-foundation.org> @ 2009-01-16 6:46 ` Nick Piggin 2009-01-16 6:55 ` Matthew Wilcox ` (2 more replies) 0 siblings, 3 replies; 42+ messages in thread From: Nick Piggin @ 2009-01-16 6:46 UTC (permalink / raw) To: Andrew Morton, netdev, sfr Cc: matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Friday 16 January 2009 15:12:10 Andrew Morton wrote: > On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > I would like to see SLQB merged in mainline, made default, and wait for > > some number releases. Then we take what we know, and try to make an > > informed decision about the best one to take. I guess that is problematic > > in that the rest of the kernel is moving underneath us. Do you have > > another idea? > > Nope. If it doesn't work out, we can remove it again I guess. OK, I have these numbers to show I'm not completely off my rocker to suggest we merge SLQB :) Given these results, how about I ask to merge SLQB as default in linux-next, then if nothing catastrophic happens, merge it upstream in the next merge window, then a couple of releases after that, given some time to test and tweak SLQB, then we plan to bite the bullet and emerge with just one main slab allocator (plus SLOB). System is a 2socket, 4 core AMD. All debug and stats options turned off for all the allocators; default parameters (ie. SLUB using higher order pages, and the others tend to be using order-0). SLQB is the version I recently posted, with some of the prefetching removed according to Pekka's review (probably a good idea to only add things like that in if/when they prove to be an improvement). time fio examples/netio (10 runs, lower better): SLAB AVG=13.19 STD=0.40 SLQB AVG=13.78 STD=0.24 SLUB AVG=14.47 STD=0.23 SLAB makes a good showing here. The allocation/freeing pattern seems to be very regular and easy (fast allocs and frees). So it could be some "lucky" caching behaviour, I'm not exactly sure. I'll have to run more tests and profiles here. hackbench (10 runs, lower better): 1 GROUP SLAB AVG=1.34 STD=0.05 SLQB AVG=1.31 STD=0.06 SLUB AVG=1.46 STD=0.07 2 GROUPS SLAB AVG=1.20 STD=0.09 SLQB AVG=1.22 STD=0.12 SLUB AVG=1.21 STD=0.06 4 GROUPS SLAB AVG=0.84 STD=0.05 SLQB AVG=0.81 STD=0.10 SLUB AVG=0.98 STD=0.07 8 GROUPS SLAB AVG=0.79 STD=0.10 SLQB AVG=0.76 STD=0.15 SLUB AVG=0.89 STD=0.08 16 GROUPS SLAB AVG=0.78 STD=0.08 SLQB AVG=0.79 STD=0.10 SLUB AVG=0.86 STD=0.05 32 GROUPS SLAB AVG=0.86 STD=0.05 SLQB AVG=0.78 STD=0.06 SLUB AVG=0.88 STD=0.06 64 GROUPS SLAB AVG=1.03 STD=0.05 SLQB AVG=0.90 STD=0.04 SLUB AVG=1.05 STD=0.06 128 GROUPS SLAB AVG=1.31 STD=0.19 SLQB AVG=1.16 STD=0.36 SLUB AVG=1.29 STD=0.11 SLQB tends to be the winner here. SLAB is close at lower numbers of groups, but drops behind a bit more as they increase. tbench (10 runs, higher better): 1 THREAD SLAB AVG=239.25 STD=31.74 SLQB AVG=257.75 STD=33.89 SLUB AVG=223.02 STD=14.73 2 THREADS SLAB AVG=649.56 STD=9.77 SLQB AVG=647.77 STD=7.48 SLUB AVG=634.50 STD=7.66 4 THREADS SLAB AVG=1294.52 STD=13.19 SLQB AVG=1266.58 STD=35.71 SLUB AVG=1228.31 STD=48.08 8 THREADS SLAB AVG=2750.78 STD=26.67 SLQB AVG=2758.90 STD=18.86 SLUB AVG=2685.59 STD=22.41 16 THREADS SLAB AVG=2669.11 STD=58.34 SLQB AVG=2671.69 STD=31.84 SLUB AVG=2571.05 STD=45.39 SLAB and SLQB seem to be pretty close, winning some and losing some. They're always within a standard deviation of one another, so we can't make conclusions between them. SLUB seems to be a bit slower. Netperf UDP unidirectional send test (10 runs, higher better): Server and client bound to same CPU SLAB AVG=60.111 STD=1.59382 SLQB AVG=60.167 STD=0.685347 SLUB AVG=58.277 STD=0.788328 Server and client bound to same socket, different CPUs SLAB AVG=85.938 STD=0.875794 SLQB AVG=93.662 STD=2.07434 SLUB AVG=81.983 STD=0.864362 Server and client bound to different sockets SLAB AVG=78.801 STD=1.44118 SLQB AVG=78.269 STD=1.10457 SLUB AVG=71.334 STD=1.16809 SLQB is up with SLAB for the first and last cases, and faster in the second case. SLUB trails in each case. (Any ideas for better types of netperf tests?) Kbuild numbers don't seem to be significantly different. SLAB and SLQB actually got exactly the same average over 10 runs. The user+sys times tend to be almost identical between allocators, with elapsed time mainly depending on how much time the CPU was not idle. Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within their measurement confidence interval. If it comes down to it, I think we could get them to do more runs to narrow that down, but we're talking a couple of tenths of a percent already. I haven't done any non-local network tests. Networking is the one of the subsystems most heavily dependent on slab performance, so if anybody cares to run their favourite tests, that would be really helpful. Disclaimer ---------- Now remember this is just one specific HW configuration, and some allocators for some reason give significantly (and sometimes perplexingly) different results between different CPU and system architectures. The other frustrating thing is that sometimes you happen to get a lucky or unlucky cache or NUMA layout depending on the compile, the boot, etc. So sometimes results get a little "skewed" in a way that isn't reflected in the STDDEV. But I've tried to minimise that. Dropping caches and restarting services etc. between individual runs. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 6:46 ` Mainline kernel OLTP performance update Nick Piggin @ 2009-01-16 6:55 ` Matthew Wilcox 2009-01-16 7:06 ` Nick Piggin 2009-01-16 7:53 ` Zhang, Yanmin 2009-01-16 7:00 ` Mainline kernel OLTP performance update Andrew Morton 2009-01-16 18:11 ` Rick Jones 2 siblings, 2 replies; 42+ messages in thread From: Matthew Wilcox @ 2009-01-16 6:55 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Zhang, Yanmin On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote: > Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within > their measurement confidence interval. If it comes down to it, I think we > could get them to do more runs to narrow that down, but we're talking a > couple of tenths of a percent already. I think I can speak with some measure of confidence for at least the OLTP-testing part of my company when I say that I have no objection to Nick's planned merge scheme. I believe the kernel benchmark group have also done some testing with SLQB and have generally positive things to say about it (Yanmin added to the gargantuan cc). Did slabtop get fixed to work with SLQB? -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 6:55 ` Matthew Wilcox @ 2009-01-16 7:06 ` Nick Piggin 2009-01-16 7:53 ` Zhang, Yanmin 1 sibling, 0 replies; 42+ messages in thread From: Nick Piggin @ 2009-01-16 7:06 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Zhang, Yanmin On Friday 16 January 2009 17:55:47 Matthew Wilcox wrote: > On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote: > > Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within > > their measurement confidence interval. If it comes down to it, I think we > > could get them to do more runs to narrow that down, but we're talking a > > couple of tenths of a percent already. > > I think I can speak with some measure of confidence for at least the > OLTP-testing part of my company when I say that I have no objection to > Nick's planned merge scheme. > > I believe the kernel benchmark group have also done some testing with > SLQB and have generally positive things to say about it (Yanmin added to > the gargantuan cc). > > Did slabtop get fixed to work with SLQB? Yes the old slabtop that works on /proc/slabinfo works with SLQB (ie. SLQB implements /proc/slabinfo). Lin Ming recently also ported the SLUB /sys/kernel/slab/ specific slabinfo tool to SLQB. Basically it reports in-depth internal event counts etc. and can operate on individual caches, making it very useful for performance "observability" and tuning. It is hard to come up with a single set of statistics that apply usefully to all the allocators. FWIW, it would be a useful tool to port over to SLAB too, if we end up deciding to go with SLAB. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 6:55 ` Matthew Wilcox 2009-01-16 7:06 ` Nick Piggin @ 2009-01-16 7:53 ` Zhang, Yanmin 2009-01-16 10:20 ` Andi Kleen 1 sibling, 1 reply; 42+ messages in thread From: Zhang, Yanmin @ 2009-01-16 7:53 UTC (permalink / raw) To: Matthew Wilcox Cc: Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Thu, 2009-01-15 at 23:55 -0700, Matthew Wilcox wrote: > On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote: > > Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within > > their measurement confidence interval. If it comes down to it, I think we > > could get them to do more runs to narrow that down, but we're talking a > > couple of tenths of a percent already. > > I think I can speak with some measure of confidence for at least the > OLTP-testing part of my company when I say that I have no objection to > Nick's planned merge scheme. > > I believe the kernel benchmark group have also done some testing with > SLQB and have generally positive things to say about it (Yanmin added to > the gargantuan cc). We did run lots of benchmarks with SLQB. Comparing with SLUB, one highlighting of SLQB is with netperf UDP-U-4k. On my x86-64 machines, if I start 1 client and 1 server process and bind them to different physical cpus, the result of SLQB is about 20% better than SLUB's. If I start CPU_NUM clients and the same number of servers without binding, the results of SLQB is about 100% better than SLUB's. I think that's because SLQB doesn't pass through big object allocation to page allocator. netperf UDP-U-1k has less improvement with SLQB. The results of other benchmarks have variations. They are good on some machines, but bad on other machines. However, the variation is small. For example, hackbench's result with SLQB is about 1 second than with SLUB on 8-core stoakley. After we worked with Nick to do small code changing, SLQB's result is a little better than SLUB's with hackbench on stoakley. We consider other variations as fluctuation. All the testing use default SLUB and SLQB configuration. > > Did slabtop get fixed to work with SLQB? > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 7:53 ` Zhang, Yanmin @ 2009-01-16 10:20 ` Andi Kleen 2009-01-20 5:16 ` Zhang, Yanmin 0 siblings, 1 reply; 42+ messages in thread From: Andi Kleen @ 2009-01-16 10:20 UTC (permalink / raw) To: Zhang, Yanmin Cc: Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes: > I think that's because SLQB > doesn't pass through big object allocation to page allocator. > netperf UDP-U-1k has less improvement with SLQB. That sounds like just the page allocator needs to be improved. That would help everyone. We talked a bit about this earlier, some of the heuristics for hot/cold pages are quite outdated and have been tuned for obsolete machines and also its fast path is quite long. Unfortunately no code currently. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 10:20 ` Andi Kleen @ 2009-01-20 5:16 ` Zhang, Yanmin 2009-01-21 23:58 ` Christoph Lameter 0 siblings, 1 reply; 42+ messages in thread From: Zhang, Yanmin @ 2009-01-20 5:16 UTC (permalink / raw) To: Andi Kleen, Christoph Lameter, Pekka Enberg Cc: Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Fri, 2009-01-16 at 11:20 +0100, Andi Kleen wrote: > "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes: > > > > I think that's because SLQB > > doesn't pass through big object allocation to page allocator. > > netperf UDP-U-1k has less improvement with SLQB. > > That sounds like just the page allocator needs to be improved. > That would help everyone. We talked a bit about this earlier, > some of the heuristics for hot/cold pages are quite outdated > and have been tuned for obsolete machines and also its fast path > is quite long. Unfortunately no code currently. Andi, Thanks for your kind information. I did more investigation with SLUB on netperf UDP-U-4k issue. oprofile shows: 328058 30.1342 linux-2.6.29-rc2 copy_user_generic_string 134666 12.3699 linux-2.6.29-rc2 __free_pages_ok 125447 11.5231 linux-2.6.29-rc2 get_page_from_freelist 22611 2.0770 linux-2.6.29-rc2 __sk_mem_reclaim 21442 1.9696 linux-2.6.29-rc2 list_del 21187 1.9462 linux-2.6.29-rc2 __ip_route_output_key So __free_pages_ok and get_page_from_freelist consume too much cpu time. With SLQB, these 2 functions almost don't consume time. Command 'slabinfo -AD' shows: Name Objects Alloc Free %Fast :0000256 1685 29611065 29609548 99 99 :0000168 2987 164689 161859 94 39 :0004096 1471 114918 113490 99 97 So kmem_cache :0000256 is very active. Kernel stack dump in __free_pages_ok shows [<ffffffff8027010f>] __free_pages_ok+0x109/0x2e0 [<ffffffff8024bb34>] autoremove_wake_function+0x0/0x2e [<ffffffff8060f387>] __kfree_skb+0x9/0x6f [<ffffffff8061204b>] skb_free_datagram+0xc/0x31 [<ffffffff8064b528>] udp_recvmsg+0x1e7/0x26f [<ffffffff8060b509>] sock_common_recvmsg+0x30/0x45 [<ffffffff80609acd>] sock_recvmsg+0xd5/0xed The callchain is: __kfree_skb => kfree_skbmem => kmem_cache_free(skbuff_head_cache, skb); kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache with :0000256. Their order is 1 which means every slab consists of 2 physical pages. netperf UDP-U-4k is a UDP stream testing. client process keeps sending 4k-size packets to server process and server process just receives the packets one by one. If we start CPU_NUM clients and the same number of servers, every client will send lots of packets within one sched slice, then process scheduler schedules the server to receive many packets within one sched slice; then client resends again. So there are many packets in the queue. When server receive the packets, it frees skbuff_head_cache. When the slab's objects are all free, the slab will be released by calling __free_pages. Such batch sending/receiving creates lots of slab free activity. Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0. But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer. SLQB has no such issue, because: 1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up later on quickly without lock. A batch parameter to control the free object recollection is mostly 1024. 2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can benefit from zone_pcp(zone, cpu)->pcp page buffer. So SLUB need resolve such issues that one process allocates a batch of objects and another process frees them batchly. yanmin ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-20 5:16 ` Zhang, Yanmin @ 2009-01-21 23:58 ` Christoph Lameter 2009-01-22 8:36 ` Zhang, Yanmin 0 siblings, 1 reply; 42+ messages in thread From: Christoph Lameter @ 2009-01-21 23:58 UTC (permalink / raw) To: Zhang, Yanmin Cc: Andi Kleen, Pekka Enberg, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty [-- Attachment #1: Type: TEXT/PLAIN, Size: 1708 bytes --] On Tue, 20 Jan 2009, Zhang, Yanmin wrote: > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache > with :0000256. Their order is 1 which means every slab consists of 2 physical pages. That order can be changed. Try specifying slub_max_order=0 on the kernel command line to force an order 0 alloc. The queues of the page allocator are of limited use due to their overhead. Order-1 allocations can actually be 5% faster than order-0. order-0 makes sense if pages are pushed rapidly to the page allocator and are then reissues elsewhere. If there is a linear consumption then the page allocator queues are just overhead. > Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0. > But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer. That usually does not matter because of partial list avoiding page allocator actions. > SLQB has no such issue, because: > 1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up > later on quickly without lock. A batch parameter to control the free object recollection is mostly > 1024. > 2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can > benefit from zone_pcp(zone, cpu)->pcp page buffer. > > So SLUB need resolve such issues that one process allocates a batch of objects and another process > frees them batchly. SLUB has a percpu freelist but its bounded by the basic allocation unit. You can increase that by modifying the allocation order. Writing a 3 or 5 into the order value in /sys/kernel/slab/xxx/order would do the trick. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-21 23:58 ` Christoph Lameter @ 2009-01-22 8:36 ` Zhang, Yanmin 2009-01-22 9:15 ` Pekka Enberg 0 siblings, 1 reply; 42+ messages in thread From: Zhang, Yanmin @ 2009-01-22 8:36 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, Pekka Enberg, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote: > On Tue, 20 Jan 2009, Zhang, Yanmin wrote: > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages. > > That order can be changed. Try specifying slub_max_order=0 on the kernel > command line to force an order 0 alloc. I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue. Both get_page_from_freelist and __free_pages_ok's cpu time are still very high. I checked my instrumentation in kernel and found it's caused by large object allocation/free whose size is more than PAGE_SIZE. Here its order is 1. The right free callchain is __kfree_skb => skb_release_all => skb_release_data. So this case isn't the issue that batch of allocation/free might erase partial page functionality. '#slaninfo -AD' couldn't show statistics of large object allocation/free. Can we add such info? That will be more helpful. In addition, I didn't find such issue wih TCP stream testing. > > The queues of the page allocator are of limited use due to their overhead. > Order-1 allocations can actually be 5% faster than order-0. order-0 makes > sense if pages are pushed rapidly to the page allocator and are then > reissues elsewhere. If there is a linear consumption then the page > allocator queues are just overhead. > > > Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0. > > But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer. > > That usually does not matter because of partial list avoiding page > allocator actions. > > > SLQB has no such issue, because: > > 1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up > > later on quickly without lock. A batch parameter to control the free object recollection is mostly > > 1024. > > 2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can > > benefit from zone_pcp(zone, cpu)->pcp page buffer. > > > > So SLUB need resolve such issues that one process allocates a batch of objects and another process > > frees them batchly. > > SLUB has a percpu freelist but its bounded by the basic allocation unit. > You can increase that by modifying the allocation order. Writing a 3 or 5 > into the order value in /sys/kernel/slab/xxx/order would do the trick. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-22 8:36 ` Zhang, Yanmin @ 2009-01-22 9:15 ` Pekka Enberg 2009-01-22 9:28 ` Zhang, Yanmin 0 siblings, 1 reply; 42+ messages in thread From: Pekka Enberg @ 2009-01-22 9:15 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote: > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote: > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote: > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages. > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel > > command line to force an order 0 alloc. > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue. > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high. > > I checked my instrumentation in kernel and found it's caused by large object allocation/free > whose size is more than PAGE_SIZE. Here its order is 1. > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data. > > So this case isn't the issue that batch of allocation/free might erase partial page > functionality. So is this the kfree(skb->head) in skb_release_data() or the put_page() calls in the same function in a loop? If it's the former, with big enough size passed to __alloc_skb(), the networking code might be taking a hit from the SLUB page allocator pass-through. Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-22 9:15 ` Pekka Enberg @ 2009-01-22 9:28 ` Zhang, Yanmin 2009-01-22 9:47 ` Pekka Enberg 0 siblings, 1 reply; 42+ messages in thread From: Zhang, Yanmin @ 2009-01-22 9:28 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote: > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote: > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote: > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote: > > > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages. > > > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel > > > command line to force an order 0 alloc. > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue. > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high. > > > > I checked my instrumentation in kernel and found it's caused by large object allocation/free > > whose size is more than PAGE_SIZE. Here its order is 1. > > > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data. > > > > So this case isn't the issue that batch of allocation/free might erase partial page > > functionality. > > So is this the kfree(skb->head) in skb_release_data() or the put_page() > calls in the same function in a loop? It's kfree(skb->head). > > If it's the former, with big enough size passed to __alloc_skb(), the > networking code might be taking a hit from the SLUB page allocator > pass-through. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-22 9:28 ` Zhang, Yanmin @ 2009-01-22 9:47 ` Pekka Enberg 2009-01-23 3:02 ` Zhang, Yanmin 0 siblings, 1 reply; 42+ messages in thread From: Pekka Enberg @ 2009-01-22 9:47 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote: > On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote: > > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote: > > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote: > > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote: > > > > > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache > > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages. > > > > > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel > > > > command line to force an order 0 alloc. > > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue. > > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high. > > > > > > I checked my instrumentation in kernel and found it's caused by large object allocation/free > > > whose size is more than PAGE_SIZE. Here its order is 1. > > > > > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data. > > > > > > So this case isn't the issue that batch of allocation/free might erase partial page > > > functionality. > > > > So is this the kfree(skb->head) in skb_release_data() or the put_page() > > calls in the same function in a loop? > It's kfree(skb->head). > > > > > If it's the former, with big enough size passed to __alloc_skb(), the > > networking code might be taking a hit from the SLUB page allocator > > pass-through. Do we know what kind of size is being passed to __alloc_skb() in this case? Maybe we want to do something like this. Pekka SLUB: revert page allocator pass-through This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: direct pass through of page size or higher kmalloc requests"). --- diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 2f5c16b..3bd3662 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -124,7 +124,7 @@ struct kmem_cache { * We keep the general caches in an array of slab caches that are used for * 2^x bytes of allocations. */ -extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1]; +extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1]; /* * Sorry that the following has to be that ugly but some versions of GCC @@ -135,6 +135,9 @@ static __always_inline int kmalloc_index(size_t size) if (!size) return 0; + if (size > KMALLOC_MAX_SIZE) + return -1; + if (size <= KMALLOC_MIN_SIZE) return KMALLOC_SHIFT_LOW; @@ -154,10 +157,6 @@ static __always_inline int kmalloc_index(size_t size) if (size <= 1024) return 10; if (size <= 2 * 1024) return 11; if (size <= 4 * 1024) return 12; -/* - * The following is only needed to support architectures with a larger page - * size than 4k. - */ if (size <= 8 * 1024) return 13; if (size <= 16 * 1024) return 14; if (size <= 32 * 1024) return 15; @@ -167,6 +166,10 @@ static __always_inline int kmalloc_index(size_t size) if (size <= 512 * 1024) return 19; if (size <= 1024 * 1024) return 20; if (size <= 2 * 1024 * 1024) return 21; + if (size <= 4 * 1024 * 1024) return 22; + if (size <= 8 * 1024 * 1024) return 23; + if (size <= 16 * 1024 * 1024) return 24; + if (size <= 32 * 1024 * 1024) return 25; return -1; /* @@ -191,6 +194,19 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size) if (index == 0) return NULL; + /* + * This function only gets expanded if __builtin_constant_p(size), so + * testing it here shouldn't be needed. But some versions of gcc need + * help. + */ + if (__builtin_constant_p(size) && index < 0) { + /* + * Generate a link failure. Would be great if we could + * do something to stop the compile here. + */ + extern void __kmalloc_size_too_large(void); + __kmalloc_size_too_large(); + } return &kmalloc_caches[index]; } @@ -204,17 +220,9 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size) void *kmem_cache_alloc(struct kmem_cache *, gfp_t); void *__kmalloc(size_t size, gfp_t flags); -static __always_inline void *kmalloc_large(size_t size, gfp_t flags) -{ - return (void *)__get_free_pages(flags | __GFP_COMP, get_order(size)); -} - static __always_inline void *kmalloc(size_t size, gfp_t flags) { if (__builtin_constant_p(size)) { - if (size > PAGE_SIZE) - return kmalloc_large(size, flags); - if (!(flags & SLUB_DMA)) { struct kmem_cache *s = kmalloc_slab(size); diff --git a/mm/slub.c b/mm/slub.c index 6392ae5..8fad23f 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2475,7 +2475,7 @@ EXPORT_SYMBOL(kmem_cache_destroy); * Kmalloc subsystem *******************************************************************/ -struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned; +struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __cacheline_aligned; EXPORT_SYMBOL(kmalloc_caches); static int __init setup_slub_min_order(char *str) @@ -2537,7 +2537,7 @@ panic: } #ifdef CONFIG_ZONE_DMA -static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1]; +static struct kmem_cache *kmalloc_caches_dma[KMALLOC_SHIFT_HIGH + 1]; static void sysfs_add_func(struct work_struct *w) { @@ -2643,8 +2643,12 @@ static struct kmem_cache *get_slab(size_t size, gfp_t flags) return ZERO_SIZE_PTR; index = size_index[(size - 1) / 8]; - } else + } else { + if (size > KMALLOC_MAX_SIZE) + return NULL; + index = fls(size - 1); + } #ifdef CONFIG_ZONE_DMA if (unlikely((flags & SLUB_DMA))) @@ -2658,9 +2662,6 @@ void *__kmalloc(size_t size, gfp_t flags) { struct kmem_cache *s; - if (unlikely(size > PAGE_SIZE)) - return kmalloc_large(size, flags); - s = get_slab(size, flags); if (unlikely(ZERO_OR_NULL_PTR(s))) @@ -2670,25 +2671,11 @@ void *__kmalloc(size_t size, gfp_t flags) } EXPORT_SYMBOL(__kmalloc); -static void *kmalloc_large_node(size_t size, gfp_t flags, int node) -{ - struct page *page = alloc_pages_node(node, flags | __GFP_COMP, - get_order(size)); - - if (page) - return page_address(page); - else - return NULL; -} - #ifdef CONFIG_NUMA void *__kmalloc_node(size_t size, gfp_t flags, int node) { struct kmem_cache *s; - if (unlikely(size > PAGE_SIZE)) - return kmalloc_large_node(size, flags, node); - s = get_slab(size, flags); if (unlikely(ZERO_OR_NULL_PTR(s))) @@ -2746,11 +2733,8 @@ void kfree(const void *x) return; page = virt_to_head_page(x); - if (unlikely(!PageSlab(page))) { - BUG_ON(!PageCompound(page)); - put_page(page); + if (unlikely(WARN_ON(!PageSlab(page)))) /* XXX */ return; - } slab_free(page->slab, page, object, _RET_IP_); } EXPORT_SYMBOL(kfree); @@ -2985,7 +2969,7 @@ void __init kmem_cache_init(void) caches++; } - for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) { + for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) { create_kmalloc_cache(&kmalloc_caches[i], "kmalloc", 1 << i, GFP_KERNEL); caches++; @@ -3022,7 +3006,7 @@ void __init kmem_cache_init(void) slab_state = UP; /* Provide the correct kmalloc names now that the caches are up */ - for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) + for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) kmalloc_caches[i]. name = kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i); @@ -3222,9 +3206,6 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller) { struct kmem_cache *s; - if (unlikely(size > PAGE_SIZE)) - return kmalloc_large(size, gfpflags); - s = get_slab(size, gfpflags); if (unlikely(ZERO_OR_NULL_PTR(s))) @@ -3238,9 +3219,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags, { struct kmem_cache *s; - if (unlikely(size > PAGE_SIZE)) - return kmalloc_large_node(size, gfpflags, node); - s = get_slab(size, gfpflags); if (unlikely(ZERO_OR_NULL_PTR(s))) ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-22 9:47 ` Pekka Enberg @ 2009-01-23 3:02 ` Zhang, Yanmin 2009-01-23 6:52 ` Pekka Enberg 2009-01-23 8:33 ` Nick Piggin 0 siblings, 2 replies; 42+ messages in thread From: Zhang, Yanmin @ 2009-01-23 3:02 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Thu, 2009-01-22 at 11:47 +0200, Pekka Enberg wrote: > On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote: > > On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote: > > > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote: > > > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote: > > > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote: > > > > > > > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache > > > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages. > > > > > > > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel > > > > > command line to force an order 0 alloc. > > > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue. > > > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high. > > > > > > > > I checked my instrumentation in kernel and found it's caused by large object allocation/free > > > > whose size is more than PAGE_SIZE. Here its order is 1. > > > > > > > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data. > > > > > > > > So this case isn't the issue that batch of allocation/free might erase partial page > > > > functionality. > > > > > > So is this the kfree(skb->head) in skb_release_data() or the put_page() > > > calls in the same function in a loop? > > It's kfree(skb->head). > > > > > > > > If it's the former, with big enough size passed to __alloc_skb(), the > > > networking code might be taking a hit from the SLUB page allocator > > > pass-through. > > Do we know what kind of size is being passed to __alloc_skb() in this > case? In function __alloc_skb, original parameter size=4155, SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so __kmalloc_track_caller's parameter size=4696. > Maybe we want to do something like this. > > Pekka > > SLUB: revert page allocator pass-through This patch amost fixes the netperf UDP-U-4k issue. #slabinfo -AD Name Objects Alloc Free %Fast :0000256 1658 70350463 70348946 99 99 kmalloc-8192 31 70322309 70322293 99 99 :0000168 2592 143154 140684 93 28 :0004096 1456 91072 89644 99 96 :0000192 3402 63838 60491 89 11 :0000064 6177 49635 43743 98 77 So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides. kmalloc-8192's default order on my 8-core stoakley is 2. 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result is about 10% better than SLUB's. I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? > > This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: > direct pass through of page size or higher kmalloc requests"). > --- > > diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h > index 2f5c16b..3bd3662 100644 > --- a/include/linux/slub_def.h > +++ b/include/linux/slub_def.h -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 3:02 ` Zhang, Yanmin @ 2009-01-23 6:52 ` Pekka Enberg 2009-01-23 8:06 ` Pekka Enberg 2009-01-23 8:33 ` Nick Piggin 1 sibling, 1 reply; 42+ messages in thread From: Pekka Enberg @ 2009-01-23 6:52 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, mingo Zhang, Yanmin wrote: >>>> If it's the former, with big enough size passed to __alloc_skb(), the >>>> networking code might be taking a hit from the SLUB page allocator >>>> pass-through. >> Do we know what kind of size is being passed to __alloc_skb() in this >> case? > In function __alloc_skb, original parameter size=4155, > SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so > __kmalloc_track_caller's parameter size=4696. OK, so all allocations go straight to the page allocator. > >> Maybe we want to do something like this. >> >> SLUB: revert page allocator pass-through > This patch amost fixes the netperf UDP-U-4k issue. > > #slabinfo -AD > Name Objects Alloc Free %Fast > :0000256 1658 70350463 70348946 99 99 > kmalloc-8192 31 70322309 70322293 99 99 > :0000168 2592 143154 140684 93 28 > :0004096 1456 91072 89644 99 96 > :0000192 3402 63838 60491 89 11 > :0000064 6177 49635 43743 98 77 > > So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides. > kmalloc-8192's default order on my 8-core stoakley is 2. Christoph, should we merge my patch as-is or do you have an alternative fix in mind? We could, of course, increase kmalloc() caches one level up to 8192 or higher. > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result > is about 10% better than SLUB's. > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? Maybe we can use the perfstat and/or kerneltop utilities of the new perf counters patch to diagnose this: http://lkml.org/lkml/2009/1/21/273 And do oprofile, of course. Thanks! Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 6:52 ` Pekka Enberg @ 2009-01-23 8:06 ` Pekka Enberg 2009-01-23 8:30 ` Zhang, Yanmin 0 siblings, 1 reply; 42+ messages in thread From: Pekka Enberg @ 2009-01-23 8:06 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, mingo On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote: > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result > > is about 10% better than SLUB's. > > > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? > > Maybe we can use the perfstat and/or kerneltop utilities of the new perf > counters patch to diagnose this: > > http://lkml.org/lkml/2009/1/21/273 > > And do oprofile, of course. Thanks! I assume binding the client and the server to different physical CPUs also means that the SKB is always allocated on CPU 1 and freed on CPU 2? If so, we will be taking the __slab_free() slow path all the time on kfree() which will cause cache effects, no doubt. But there's another potential performance hit we're taking because the object size of the cache is so big. As allocations from CPU 1 keep coming in, we need to allocate new pages and unfreeze the per-cpu page. That in turn causes __slab_free() to be more eager to discard the slab (see the PageSlubFrozen check there). So before going for cache profiling, I'd really like to see an oprofile report. I suspect we're still going to see much more page allocator activity there than with SLAB or SLQB which is why we're still behaving so badly here. Pekka ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 8:06 ` Pekka Enberg @ 2009-01-23 8:30 ` Zhang, Yanmin 2009-01-23 8:40 ` Pekka Enberg 2009-01-23 9:46 ` Pekka Enberg 0 siblings, 2 replies; 42+ messages in thread From: Zhang, Yanmin @ 2009-01-23 8:30 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, mingo On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote: > On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote: > > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; > > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result > > > is about 10% better than SLUB's. > > > > > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? > > > > Maybe we can use the perfstat and/or kerneltop utilities of the new perf > > counters patch to diagnose this: > > > > http://lkml.org/lkml/2009/1/21/273 > > > > And do oprofile, of course. Thanks! > > I assume binding the client and the server to different physical CPUs > also means that the SKB is always allocated on CPU 1 and freed on CPU > 2? If so, we will be taking the __slab_free() slow path all the time on > kfree() which will cause cache effects, no doubt. > > But there's another potential performance hit we're taking because the > object size of the cache is so big. As allocations from CPU 1 keep > coming in, we need to allocate new pages and unfreeze the per-cpu page. > That in turn causes __slab_free() to be more eager to discard the slab > (see the PageSlubFrozen check there). > > So before going for cache profiling, I'd really like to see an oprofile > report. I suspect we're still going to see much more page allocator > activity Theoretically, it should, but oprofile doesn't show that. > there than with SLAB or SLQB which is why we're still behaving > so badly here. oprofile output with 2.6.29-rc2-slubrevertlarge: CPU: Core 2, speed 2666.71 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % app name symbol name 132779 32.9951 vmlinux copy_user_generic_string 25334 6.2954 vmlinux schedule 21032 5.2264 vmlinux tg_shares_up 17175 4.2679 vmlinux __skb_recv_datagram 9091 2.2591 vmlinux sock_def_readable 8934 2.2201 vmlinux mwait_idle 8796 2.1858 vmlinux try_to_wake_up 6940 1.7246 vmlinux __slab_free #slaninfo -AD Name Objects Alloc Free %Fast :0000256 1643 5215544 5214027 94 0 kmalloc-8192 28 5189576 5189560 0 0 :0000168 2631 141466 138976 92 28 :0004096 1452 88697 87269 99 96 :0000192 3402 63050 59732 89 11 :0000064 6265 46611 40721 98 82 :0000128 1895 30429 28654 93 32 oprofile output with kernel 2.6.29-rc2-slqb0121: CPU: Core 2, speed 2666.76 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % image name app name symbol name 114793 28.7163 vmlinux vmlinux copy_user_generic_string 27880 6.9744 vmlinux vmlinux tg_shares_up 22218 5.5580 vmlinux vmlinux schedule 12238 3.0614 vmlinux vmlinux mwait_idle 7395 1.8499 vmlinux vmlinux task_rq_lock 7348 1.8382 vmlinux vmlinux sock_def_readable 7202 1.8016 vmlinux vmlinux sched_clock_cpu 6981 1.7464 vmlinux vmlinux __skb_recv_datagram 6566 1.6425 vmlinux vmlinux udp_queue_rcv_skb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 8:30 ` Zhang, Yanmin @ 2009-01-23 8:40 ` Pekka Enberg 2009-01-23 9:46 ` Pekka Enberg 1 sibling, 0 replies; 42+ messages in thread From: Pekka Enberg @ 2009-01-23 8:40 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, mingo On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote: > > I assume binding the client and the server to different physical CPUs > > also means that the SKB is always allocated on CPU 1 and freed on CPU > > 2? If so, we will be taking the __slab_free() slow path all the time on > > kfree() which will cause cache effects, no doubt. > > > > But there's another potential performance hit we're taking because the > > object size of the cache is so big. As allocations from CPU 1 keep > > coming in, we need to allocate new pages and unfreeze the per-cpu page. > > That in turn causes __slab_free() to be more eager to discard the slab > > (see the PageSlubFrozen check there). > > > > So before going for cache profiling, I'd really like to see an oprofile > > report. I suspect we're still going to see much more page allocator > > activity > Theoretically, it should, but oprofile doesn't show that. > > > there than with SLAB or SLQB which is why we're still behaving > > so badly here. > > oprofile output with 2.6.29-rc2-slubrevertlarge: > CPU: Core 2, speed 2666.71 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 > samples % app name symbol name > 132779 32.9951 vmlinux copy_user_generic_string > 25334 6.2954 vmlinux schedule > 21032 5.2264 vmlinux tg_shares_up > 17175 4.2679 vmlinux __skb_recv_datagram > 9091 2.2591 vmlinux sock_def_readable > 8934 2.2201 vmlinux mwait_idle > 8796 2.1858 vmlinux try_to_wake_up > 6940 1.7246 vmlinux __slab_free > > #slaninfo -AD > Name Objects Alloc Free %Fast > :0000256 1643 5215544 5214027 94 0 > kmalloc-8192 28 5189576 5189560 0 0 ^^^^^^ This looks bit funny. Hmm. > :0000168 2631 141466 138976 92 28 > :0004096 1452 88697 87269 99 96 > :0000192 3402 63050 59732 89 11 > :0000064 6265 46611 40721 98 82 > :0000128 1895 30429 28654 93 32 ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 8:30 ` Zhang, Yanmin 2009-01-23 8:40 ` Pekka Enberg @ 2009-01-23 9:46 ` Pekka Enberg 2009-01-23 15:22 ` Christoph Lameter 1 sibling, 1 reply; 42+ messages in thread From: Pekka Enberg @ 2009-01-23 9:46 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, mingo On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote: > On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote: > > On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote: > > > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; > > > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result > > > > is about 10% better than SLUB's. > > > > > > > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? > > > > > > Maybe we can use the perfstat and/or kerneltop utilities of the new perf > > > counters patch to diagnose this: > > > > > > http://lkml.org/lkml/2009/1/21/273 > > > > > > And do oprofile, of course. Thanks! > > > > I assume binding the client and the server to different physical CPUs > > also means that the SKB is always allocated on CPU 1 and freed on CPU > > 2? If so, we will be taking the __slab_free() slow path all the time on > > kfree() which will cause cache effects, no doubt. > > > > But there's another potential performance hit we're taking because the > > object size of the cache is so big. As allocations from CPU 1 keep > > coming in, we need to allocate new pages and unfreeze the per-cpu page. > > That in turn causes __slab_free() to be more eager to discard the slab > > (see the PageSlubFrozen check there). > > > > So before going for cache profiling, I'd really like to see an oprofile > > report. I suspect we're still going to see much more page allocator > > activity > Theoretically, it should, but oprofile doesn't show that. That's bit surprising, actually. FWIW, I've included a patch for empty slab lists. But it's probably not going to help here. > > there than with SLAB or SLQB which is why we're still behaving > > so badly here. > > oprofile output with 2.6.29-rc2-slubrevertlarge: > CPU: Core 2, speed 2666.71 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 > samples % app name symbol name > 132779 32.9951 vmlinux copy_user_generic_string > 25334 6.2954 vmlinux schedule > 21032 5.2264 vmlinux tg_shares_up > 17175 4.2679 vmlinux __skb_recv_datagram > 9091 2.2591 vmlinux sock_def_readable > 8934 2.2201 vmlinux mwait_idle > 8796 2.1858 vmlinux try_to_wake_up > 6940 1.7246 vmlinux __slab_free > > #slaninfo -AD > Name Objects Alloc Free %Fast > :0000256 1643 5215544 5214027 94 0 > kmalloc-8192 28 5189576 5189560 0 0 > :0000168 2631 141466 138976 92 28 > :0004096 1452 88697 87269 99 96 > :0000192 3402 63050 59732 89 11 > :0000064 6265 46611 40721 98 82 > :0000128 1895 30429 28654 93 32 Looking at __slab_free(), unless page->inuse is constantly zero and we discard the slab, it really is just cache effects (10% sounds like a lot, though!). AFAICT, the only way to optimize that is with Christoph's unfinished pointer freelists patches or with a remote free list like in SLQB. Pekka diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 3bd3662..41a4c1a 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -48,6 +48,9 @@ struct kmem_cache_node { unsigned long nr_partial; unsigned long min_partial; struct list_head partial; + unsigned long nr_empty; + unsigned long max_empty; + struct list_head empty; #ifdef CONFIG_SLUB_DEBUG atomic_long_t nr_slabs; atomic_long_t total_objects; diff --git a/mm/slub.c b/mm/slub.c index 8fad23f..5a12597 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -134,6 +134,11 @@ */ #define MAX_PARTIAL 10 +/* + * Maximum number of empty slabs. + */ +#define MAX_EMPTY 1 + #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \ SLAB_POISON | SLAB_STORE_USER) @@ -1205,6 +1210,24 @@ static void discard_slab(struct kmem_cache *s, struct page *page) free_slab(s, page); } +static void discard_or_cache_slab(struct kmem_cache *s, struct page *page) +{ + struct kmem_cache_node *n; + int node; + + node = page_to_nid(page); + n = get_node(s, node); + + dec_slabs_node(s, node, page->objects); + + if (likely(n->nr_empty >= n->max_empty)) { + free_slab(s, page); + } else { + n->nr_empty++; + list_add(&page->lru, &n->partial); + } +} + /* * Per slab locking using the pagelock */ @@ -1252,7 +1275,7 @@ static void remove_partial(struct kmem_cache *s, struct page *page) } /* - * Lock slab and remove from the partial list. + * Lock slab and remove from the partial or empty list. * * Must hold list_lock. */ @@ -1261,7 +1284,6 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n, { if (slab_trylock(page)) { list_del(&page->lru); - n->nr_partial--; __SetPageSlubFrozen(page); return 1; } @@ -1271,7 +1293,7 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n, /* * Try to allocate a partial slab from a specific node. */ -static struct page *get_partial_node(struct kmem_cache_node *n) +static struct page *get_partial_or_empty_node(struct kmem_cache_node *n) { struct page *page; @@ -1281,13 +1303,22 @@ static struct page *get_partial_node(struct kmem_cache_node *n) * partial slab and there is none available then get_partials() * will return NULL. */ - if (!n || !n->nr_partial) + if (!n || (!n->nr_partial && !n->nr_empty)) return NULL; spin_lock(&n->list_lock); + list_for_each_entry(page, &n->partial, lru) - if (lock_and_freeze_slab(n, page)) + if (lock_and_freeze_slab(n, page)) { + n->nr_partial--; + goto out; + } + + list_for_each_entry(page, &n->empty, lru) + if (lock_and_freeze_slab(n, page)) { + n->nr_empty--; goto out; + } page = NULL; out: spin_unlock(&n->list_lock); @@ -1297,7 +1328,7 @@ out: /* * Get a page from somewhere. Search in increasing NUMA distances. */ -static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) +static struct page *get_any_partial_or_empty(struct kmem_cache *s, gfp_t flags) { #ifdef CONFIG_NUMA struct zonelist *zonelist; @@ -1336,7 +1367,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) if (n && cpuset_zone_allowed_hardwall(zone, flags) && n->nr_partial > n->min_partial) { - page = get_partial_node(n); + page = get_partial_or_empty_node(n); if (page) return page; } @@ -1346,18 +1377,19 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) } /* - * Get a partial page, lock it and return it. + * Get a partial or empty page, lock it and return it. */ -static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node) +static struct page * +get_partial_or_empty(struct kmem_cache *s, gfp_t flags, int node) { struct page *page; int searchnode = (node == -1) ? numa_node_id() : node; - page = get_partial_node(get_node(s, searchnode)); + page = get_partial_or_empty_node(get_node(s, searchnode)); if (page || (flags & __GFP_THISNODE)) return page; - return get_any_partial(s, flags); + return get_any_partial_or_empty(s, flags); } /* @@ -1403,7 +1435,7 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail) } else { slab_unlock(page); stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB); - discard_slab(s, page); + discard_or_cache_slab(s, page); } } } @@ -1542,7 +1574,7 @@ another_slab: deactivate_slab(s, c); new_slab: - new = get_partial(s, gfpflags, node); + new = get_partial_or_empty(s, gfpflags, node); if (new) { c->page = new; stat(c, ALLOC_FROM_PARTIAL); @@ -1693,7 +1725,7 @@ slab_empty: } slab_unlock(page); stat(c, FREE_SLAB); - discard_slab(s, page); + discard_or_cache_slab(s, page); return; debug: @@ -1927,6 +1959,8 @@ static void init_kmem_cache_cpu(struct kmem_cache *s, static void init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s) { + spin_lock_init(&n->list_lock); + n->nr_partial = 0; /* @@ -1939,8 +1973,18 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s) else if (n->min_partial > MAX_PARTIAL) n->min_partial = MAX_PARTIAL; - spin_lock_init(&n->list_lock); INIT_LIST_HEAD(&n->partial); + + n->nr_empty = 0; + /* + * XXX: This needs to take object size into account. We don't need + * empty slabs for caches which will have plenty of partial slabs + * available. Only caches that have either full or empty slabs need + * this kind of optimization. + */ + n->max_empty = MAX_EMPTY; + INIT_LIST_HEAD(&n->empty); + #ifdef CONFIG_SLUB_DEBUG atomic_long_set(&n->nr_slabs, 0); atomic_long_set(&n->total_objects, 0); @@ -2427,6 +2471,32 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) spin_unlock_irqrestore(&n->list_lock, flags); } +static void free_empty_slabs(struct kmem_cache *s) +{ + int node; + + for_each_node_state(node, N_NORMAL_MEMORY) { + struct kmem_cache_node *n; + struct page *page, *t; + unsigned long flags; + + n = get_node(s, node); + + if (!n->nr_empty) + continue; + + spin_lock_irqsave(&n->list_lock, flags); + + list_for_each_entry_safe(page, t, &n->empty, lru) { + list_del(&page->lru); + n->nr_empty--; + + free_slab(s, page); + } + spin_unlock_irqrestore(&n->list_lock, flags); + } +} + /* * Release all resources used by a slab cache. */ @@ -2436,6 +2506,8 @@ static inline int kmem_cache_close(struct kmem_cache *s) flush_all(s); + free_empty_slabs(s); + /* Attempt to free all objects */ free_kmem_cache_cpus(s); for_each_node_state(node, N_NORMAL_MEMORY) { @@ -2765,6 +2837,7 @@ int kmem_cache_shrink(struct kmem_cache *s) return -ENOMEM; flush_all(s); + free_empty_slabs(s); for_each_node_state(node, N_NORMAL_MEMORY) { n = get_node(s, node); ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 9:46 ` Pekka Enberg @ 2009-01-23 15:22 ` Christoph Lameter 2009-01-23 15:31 ` Pekka Enberg 2009-01-24 2:55 ` Zhang, Yanmin 0 siblings, 2 replies; 42+ messages in thread From: Christoph Lameter @ 2009-01-23 15:22 UTC (permalink / raw) To: Pekka Enberg Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Fri, 23 Jan 2009, Pekka Enberg wrote: > Looking at __slab_free(), unless page->inuse is constantly zero and we > discard the slab, it really is just cache effects (10% sounds like a > lot, though!). AFAICT, the only way to optimize that is with Christoph's > unfinished pointer freelists patches or with a remote free list like in > SLQB. No there is another way. Increase the allocator order to 3 for the kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the larger chunks of data gotten from the page allocator. That will allow slub to do fast allocs. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 15:22 ` Christoph Lameter @ 2009-01-23 15:31 ` Pekka Enberg 2009-01-23 15:55 ` Christoph Lameter 2009-01-24 2:55 ` Zhang, Yanmin 1 sibling, 1 reply; 42+ messages in thread From: Pekka Enberg @ 2009-01-23 15:31 UTC (permalink / raw) To: Christoph Lameter Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: > On Fri, 23 Jan 2009, Pekka Enberg wrote: > > > Looking at __slab_free(), unless page->inuse is constantly zero and we > > discard the slab, it really is just cache effects (10% sounds like a > > lot, though!). AFAICT, the only way to optimize that is with Christoph's > > unfinished pointer freelists patches or with a remote free list like in > > SLQB. > > No there is another way. Increase the allocator order to 3 for the > kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the > larger chunks of data gotten from the page allocator. That will allow slub > to do fast allocs. I wonder why that doesn't happen already, actually. The slub_max_order know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously order 3 should be as good fit as order 2 so 'fraction' can't be too high either. Hmm. Pekka ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 15:31 ` Pekka Enberg @ 2009-01-23 15:55 ` Christoph Lameter 2009-01-23 16:01 ` Pekka Enberg 0 siblings, 1 reply; 42+ messages in thread From: Christoph Lameter @ 2009-01-23 15:55 UTC (permalink / raw) To: Pekka Enberg Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Fri, 23 Jan 2009, Pekka Enberg wrote: > I wonder why that doesn't happen already, actually. The slub_max_order > know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously > order 3 should be as good fit as order 2 so 'fraction' can't be too high > either. Hmm. The kmalloc-8192 is new. Look at slabinfo output to see what allocation orders are chosen. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 15:55 ` Christoph Lameter @ 2009-01-23 16:01 ` Pekka Enberg 0 siblings, 0 replies; 42+ messages in thread From: Pekka Enberg @ 2009-01-23 16:01 UTC (permalink / raw) To: Christoph Lameter Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Fri, 23 Jan 2009, Pekka Enberg wrote: > > I wonder why that doesn't happen already, actually. The slub_max_order > > know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously > > order 3 should be as good fit as order 2 so 'fraction' can't be too high > > either. Hmm. On Fri, 2009-01-23 at 10:55 -0500, Christoph Lameter wrote: > The kmalloc-8192 is new. Look at slabinfo output to see what allocation > orders are chosen. Yes, yes, I know the new cache a result of my patch. I'm just saying that AFAICT, the existing logic should set the order to 3 but IIRC Yanmin said it's 2. Pekka ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 15:22 ` Christoph Lameter 2009-01-23 15:31 ` Pekka Enberg @ 2009-01-24 2:55 ` Zhang, Yanmin 2009-01-24 7:36 ` Pekka Enberg 2009-01-26 17:36 ` Christoph Lameter 1 sibling, 2 replies; 42+ messages in thread From: Zhang, Yanmin @ 2009-01-24 2:55 UTC (permalink / raw) To: Christoph Lameter Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: > On Fri, 23 Jan 2009, Pekka Enberg wrote: > > > Looking at __slab_free(), unless page->inuse is constantly zero and we > > discard the slab, it really is just cache effects (10% sounds like a > > lot, though!). AFAICT, the only way to optimize that is with Christoph's > > unfinished pointer freelists patches or with a remote free list like in > > SLQB. > > No there is another way. Increase the allocator order to 3 for the > kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the > larger chunks of data gotten from the page allocator. That will allow slub > to do fast allocs. After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k) difference between SLUB and SLQB becomes 1% which can be considered as fluctuation. But when trying to increased it to 4, I got: [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order -bash: echo: write error: Invalid argument Comparing with SLQB, it seems SLUB needs too many investigation/manual finer-tuning against specific benchmarks. One hard is to tune page order number. Although SLQB also has many tuning options, I almost doesn't tune it manually, just run benchmark and collect results to compare. Does that mean the scalability of SLQB is better? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-24 2:55 ` Zhang, Yanmin @ 2009-01-24 7:36 ` Pekka Enberg 2009-02-12 5:22 ` Zhang, Yanmin 2009-01-26 17:36 ` Christoph Lameter 1 sibling, 1 reply; 42+ messages in thread From: Pekka Enberg @ 2009-01-24 7:36 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: >> No there is another way. Increase the allocator order to 3 for the >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the >> larger chunks of data gotten from the page allocator. That will allow slub >> to do fast allocs. On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k) > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation. Great. We should fix calculate_order() to be order 3 for kmalloc-8192. Are you interested in doing that? On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > But when trying to increased it to 4, I got: > [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order > [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order > -bash: echo: write error: Invalid argument That's probably because max order is capped to 3. You can change that by passing slub_max_order=<n> as kernel parameter. On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > Comparing with SLQB, it seems SLUB needs too many investigation/manual finer-tuning > against specific benchmarks. One hard is to tune page order number. Although SLQB also > has many tuning options, I almost doesn't tune it manually, just run benchmark and > collect results to compare. Does that mean the scalability of SLQB is better? One thing is sure, SLUB seems to be hard to tune. Probably because it's dependent on the page order so much. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-24 7:36 ` Pekka Enberg @ 2009-02-12 5:22 ` Zhang, Yanmin 2009-02-12 5:47 ` Zhang, Yanmin 0 siblings, 1 reply; 42+ messages in thread From: Zhang, Yanmin @ 2009-02-12 5:22 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Sat, 2009-01-24 at 09:36 +0200, Pekka Enberg wrote: > On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: > >> No there is another way. Increase the allocator order to 3 for the > >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the > >> larger chunks of data gotten from the page allocator. That will allow slub > >> to do fast allocs. > > On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin > <yanmin_zhang@linux.intel.com> wrote: > > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k) > > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation. > > Great. We should fix calculate_order() to be order 3 for kmalloc-8192. > Are you interested in doing that? Pekka, Sorry for the late update. The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order. slab_size order name ------------------------------------------------- 4096 3 sgpool-128 8192 2 kmalloc-8192 16384 3 kmalloc-16384 kmalloc-8192's default order is smaller than sgpool-128's. On 4*4 tigerton machine, a similiar issue appears on another kmem_cache. Function calculate_order uses 'min_objects /= 2;' to shrink. Plus size calculation/checking in slab_order, sometimes above issue appear. Below patch against 2.6.29-rc2 fixes it. I checked the default orders of all kmem_cache and they don't become smaller than before. So the patch wouldn't hurt performance. Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com> --- diff -Nraup linux-2.6.29-rc2/mm/slub.c linux-2.6.29-rc2_slubcalc_order/mm/slub.c --- linux-2.6.29-rc2/mm/slub.c 2009-02-11 00:49:48.000000000 -0500 +++ linux-2.6.29-rc2_slubcalc_order/mm/slub.c 2009-02-12 00:08:24.000000000 -0500 @@ -1856,6 +1856,7 @@ static inline int calculate_order(int si min_objects = slub_min_objects; if (!min_objects) min_objects = 4 * (fls(nr_cpu_ids) + 1); + min_objects = min(min_objects, (PAGE_SIZE << slub_max_order)/size); while (min_objects > 1) { fraction = 16; while (fraction >= 4) { @@ -1865,7 +1866,7 @@ static inline int calculate_order(int si return order; fraction /= 2; } - min_objects /= 2; + min_objects --; } /* ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-02-12 5:22 ` Zhang, Yanmin @ 2009-02-12 5:47 ` Zhang, Yanmin 2009-02-12 15:25 ` Christoph Lameter 2009-02-12 16:03 ` Pekka Enberg 0 siblings, 2 replies; 42+ messages in thread From: Zhang, Yanmin @ 2009-02-12 5:47 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Thu, 2009-02-12 at 13:22 +0800, Zhang, Yanmin wrote: > On Sat, 2009-01-24 at 09:36 +0200, Pekka Enberg wrote: > > On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: > > >> No there is another way. Increase the allocator order to 3 for the > > >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the > > >> larger chunks of data gotten from the page allocator. That will allow slub > > >> to do fast allocs. > > > > On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin > > <yanmin_zhang@linux.intel.com> wrote: > > > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k) > > > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation. > > > > Great. We should fix calculate_order() to be order 3 for kmalloc-8192. > > Are you interested in doing that? > Pekka, > > Sorry for the late update. > The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order. Oh, previous patch has a compiling warning. Pls. use below patch. From: Zhang Yanmin <yanmin.zhang@linux.intel.com> The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order. slab_size order name ------------------------------------------------- 4096 3 sgpool-128 8192 2 kmalloc-8192 16384 3 kmalloc-16384 kmalloc-8192's default order is smaller than sgpool-128's. On 4*4 tigerton machine, a similiar issue appears on another kmem_cache. Function calculate_order uses 'min_objects /= 2;' to shrink. Plus size calculation/checking in slab_order, sometimes above issue appear. Below patch against 2.6.29-rc2 fixes it. I checked the default orders of all kmem_cache and they don't become smaller than before. So the patch wouldn't hurt performance. Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com> --- --- linux-2.6.29-rc2/mm/slub.c 2009-02-11 00:49:48.000000000 -0500 +++ linux-2.6.29-rc2_slubcalc_order/mm/slub.c 2009-02-12 00:47:52.000000000 -0500 @@ -1844,6 +1844,7 @@ static inline int calculate_order(int si int order; int min_objects; int fraction; + int max_objects; /* * Attempt to find best configuration for a slab. This @@ -1856,6 +1857,9 @@ static inline int calculate_order(int si min_objects = slub_min_objects; if (!min_objects) min_objects = 4 * (fls(nr_cpu_ids) + 1); + max_objects = (PAGE_SIZE << slub_max_order)/size; + min_objects = min(min_objects, max_objects); + while (min_objects > 1) { fraction = 16; while (fraction >= 4) { @@ -1865,7 +1869,7 @@ static inline int calculate_order(int si return order; fraction /= 2; } - min_objects /= 2; + min_objects --; } /* -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-02-12 5:47 ` Zhang, Yanmin @ 2009-02-12 15:25 ` Christoph Lameter 2009-02-12 16:07 ` Pekka Enberg 2009-02-12 16:03 ` Pekka Enberg 1 sibling, 1 reply; 42+ messages in thread From: Christoph Lameter @ 2009-02-12 15:25 UTC (permalink / raw) To: Zhang, Yanmin Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar [-- Attachment #1: Type: TEXT/PLAIN, Size: 679 bytes --] On Thu, 12 Feb 2009, Zhang, Yanmin wrote: > The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order. > > > slab_size order name > ------------------------------------------------- > 4096 3 sgpool-128 > 8192 2 kmalloc-8192 > 16384 3 kmalloc-16384 > > kmalloc-8192's default order is smaller than sgpool-128's. You reverted the page allocator passthrough patch before this right? Otherwise kmalloc-8192 should not exist and allocation calls for 8192 bytes would be converted inline to request of an order 1 page from the page allocator. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-02-12 15:25 ` Christoph Lameter @ 2009-02-12 16:07 ` Pekka Enberg 0 siblings, 0 replies; 42+ messages in thread From: Pekka Enberg @ 2009-02-12 16:07 UTC (permalink / raw) To: Christoph Lameter Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar Hi Christoph, On Thu, 12 Feb 2009, Zhang, Yanmin wrote: >> The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order. >> >> >> slab_size order name >> ------------------------------------------------- >> 4096 3 sgpool-128 >> 8192 2 kmalloc-8192 >> 16384 3 kmalloc-16384 >> >> kmalloc-8192's default order is smaller than sgpool-128's. On Thu, Feb 12, 2009 at 5:25 PM, Christoph Lameter <cl@linux-foundation.org> wrote: > You reverted the page allocator passthrough patch before this right? > Otherwise kmalloc-8192 should not exist and allocation calls for 8192 > bytes would be converted inline to request of an order 1 page from the > page allocator. Yup, I assume that's the case here. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-02-12 5:47 ` Zhang, Yanmin 2009-02-12 15:25 ` Christoph Lameter @ 2009-02-12 16:03 ` Pekka Enberg 1 sibling, 0 replies; 42+ messages in thread From: Pekka Enberg @ 2009-02-12 16:03 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Sat, 2009-01-24 at 09:36 +0200, Pekka Enberg wrote: > > > On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: > > > >> No there is another way. Increase the allocator order to 3 for the > > > >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the > > > >> larger chunks of data gotten from the page allocator. That will allow slub > > > >> to do fast allocs. > > > > > > On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin > > > <yanmin_zhang@linux.intel.com> wrote: > > > > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k) > > > > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation. > > > > > > Great. We should fix calculate_order() to be order 3 for kmalloc-8192. > > > Are you interested in doing that? On Thu, 2009-02-12 at 13:22 +0800, Zhang, Yanmin wrote: > > Pekka, > > > > Sorry for the late update. > > The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order. On Thu, 2009-02-12 at 13:47 +0800, Zhang, Yanmin wrote: > Oh, previous patch has a compiling warning. Pls. use below patch. > > From: Zhang Yanmin <yanmin.zhang@linux.intel.com> > > The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order. Applied to the 'topic/slub/perf' branch. Thanks! Pekka ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-24 2:55 ` Zhang, Yanmin 2009-01-24 7:36 ` Pekka Enberg @ 2009-01-26 17:36 ` Christoph Lameter 2009-02-01 2:52 ` Zhang, Yanmin 1 sibling, 1 reply; 42+ messages in thread From: Christoph Lameter @ 2009-01-26 17:36 UTC (permalink / raw) To: Zhang, Yanmin Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Sat, 24 Jan 2009, Zhang, Yanmin wrote: > But when trying to increased it to 4, I got: > [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order > [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order > -bash: echo: write error: Invalid argument This is because 4 is more than the maximum allowed order. You can reconfigure that by setting slub_max_order=5 or so on boot. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-26 17:36 ` Christoph Lameter @ 2009-02-01 2:52 ` Zhang, Yanmin 0 siblings, 0 replies; 42+ messages in thread From: Zhang, Yanmin @ 2009-02-01 2:52 UTC (permalink / raw) To: Christoph Lameter Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Mon, 2009-01-26 at 12:36 -0500, Christoph Lameter wrote: > On Sat, 24 Jan 2009, Zhang, Yanmin wrote: > > > But when trying to increased it to 4, I got: > > [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order > > [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order > > -bash: echo: write error: Invalid argument > > This is because 4 is more than the maximum allowed order. You can > reconfigure that by setting > > slub_max_order=5 > > or so on boot. With slub_max_order=5, the default order of kmalloc-8192 becomes 5. I tested it with netperf UDP-U-4k and the result difference from SLAB/SLQB is less than 1% which is really fluctuation. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 3:02 ` Zhang, Yanmin 2009-01-23 6:52 ` Pekka Enberg @ 2009-01-23 8:33 ` Nick Piggin 2009-01-23 9:02 ` Zhang, Yanmin 1 sibling, 1 reply; 42+ messages in thread From: Nick Piggin @ 2009-01-23 8:33 UTC (permalink / raw) To: Zhang, Yanmin Cc: Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote: > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better > than SLQB's; I'll have to look into this too. Could be evidence of the possible TLB improvement from using bigger pages and/or page-specific freelist, I suppose. Do you have a scripted used to start netperf in that configuration? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 8:33 ` Nick Piggin @ 2009-01-23 9:02 ` Zhang, Yanmin 2009-01-23 18:40 ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones 0 siblings, 1 reply; 42+ messages in thread From: Zhang, Yanmin @ 2009-01-23 9:02 UTC (permalink / raw) To: Nick Piggin Cc: Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty [-- Attachment #1: Type: text/plain, Size: 622 bytes --] On Fri, 2009-01-23 at 19:33 +1100, Nick Piggin wrote: > On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote: > > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better > > than SLQB's; > > I'll have to look into this too. Could be evidence of the possible > TLB improvement from using bigger pages and/or page-specific freelist, > I suppose. > > Do you have a scripted used to start netperf in that configuration? See the attachment. Steps to run testing: 1) compile netperf; 2) Change PROG_DIR to path/to/netperf/src; 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus. [-- Attachment #2: start_netperf_udp_v4.sh --] [-- Type: application/x-shellscript, Size: 1361 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* care and feeding of netperf (Re: Mainline kernel OLTP performance update) 2009-01-23 9:02 ` Zhang, Yanmin @ 2009-01-23 18:40 ` Rick Jones 2009-01-23 18:51 ` Grant Grundler 2009-01-24 3:03 ` Zhang, Yanmin 0 siblings, 2 replies; 42+ messages in thread From: Rick Jones @ 2009-01-23 18:40 UTC (permalink / raw) To: Zhang, Yanmin Cc: Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty > 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus. Some comments on the script: > #!/bin/sh > > PROG_DIR=/home/ymzhang/test/netperf/src > date=`date +%H%M%N` > #PROG_DIR=/root/netperf/netperf/src > client_num=$1 > pin_cpu=$2 > > start_port_server=12384 > start_port_client=15888 > > killall netserver > ${PROG_DIR}/netserver > sleep 2 Any particular reason for killing-off the netserver daemon? > if [ ! -d result ]; then > mkdir result > fi > > all_result_files="" > for i in `seq 1 ${client_num}`; do > if [ "${pin_cpu}" == "pin" ]; then > pin_param="-T ${i} ${i}" The -T option takes arguments of the form: N - bind both netperf and netserver to core N N, - bind only netperf to core N, float netserver ,M - float netperf, bind only netserver to core M N,M - bind netperf to core N and netserver to core M Without a comma between N and M knuth only knows what the command line parser will do :) > fi > result_file=result/netperf_${start_port_client}.${date} > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096 > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096 > #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} & > ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file} & Same thing here for the -P option - there needs to be a comma between the two port numbers otherwise, the best case is that the second port number is ignored. Worst case is that netperf starts doing knuth only knows what. To get quick profiles, that form of aggregate netperf is OK - just the one iteration with background processes using a moderatly long run time. However, for result reporting, it is best to (ab)use the confidence intervals functionality to try to avoid skew errors. I tend to add-in a global -i 30 option to get each netperf to repeat its measurments 30 times. That way one is reasonably confident that skew issues are minimized. http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance And I would probably add the -c and -C options to have netperf report service demands. > sub_pid="${sub_pid} `echo $!`" > port_num=$((${port_num}+1)) > all_result_files="${all_result_files} ${result_file}" > start_port_server=$((${start_port_server}+1)) > start_port_client=$((${start_port_client}+1)) > done; > > wait ${sub_pid} > killall netserver > > result="0" > for i in `echo ${all_result_files}`; do > sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}` > result=`echo "${result}+${sub_result}"|bc` > done; The documented-only-in-source :( "omni" tests in top-of-trunk netperf: http://www.netperf.org/svn/netperf2/trunk ./configure --enable-omni allow one to specify which result values one wants, in which order, either as more or less traditional netperf output (test-specific -O), CSV (test-specific -o) or keyval (test-specific -k). All three take an optional filename as an argument with the file containing a list of desired output values. You can give a "filename" of '?' to get the list of output values known to that version of netperf. Might help simplify parsing and whatnot. happy benchmarking, rick jones > > echo $result > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update) 2009-01-23 18:40 ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones @ 2009-01-23 18:51 ` Grant Grundler 2009-01-24 3:03 ` Zhang, Yanmin 1 sibling, 0 replies; 42+ messages in thread From: Grant Grundler @ 2009-01-23 18:51 UTC (permalink / raw) To: Rick Jones Cc: Zhang, Yanmin, Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Fri, Jan 23, 2009 at 10:40 AM, Rick Jones <rick.jones2@hp.com> wrote: ... > And I would probably add the -c and -C options to have netperf report > service demands. For performance analysis, the service demand is often more interesting than the absolute performance (which typically only varies a few Mb/s for gigE NICs). I strongly encourage adding -c and -C. grant ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update) 2009-01-23 18:40 ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones 2009-01-23 18:51 ` Grant Grundler @ 2009-01-24 3:03 ` Zhang, Yanmin 2009-01-26 18:26 ` Rick Jones 1 sibling, 1 reply; 42+ messages in thread From: Zhang, Yanmin @ 2009-01-24 3:03 UTC (permalink / raw) To: Rick Jones Cc: Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Fri, 2009-01-23 at 10:40 -0800, Rick Jones wrote: > > 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus. > > Some comments on the script: Thanks. I wanted to run the testing to get result quickly as long as the result has no big fluctuation. > > > #!/bin/sh > > > > PROG_DIR=/home/ymzhang/test/netperf/src > > date=`date +%H%M%N` > > #PROG_DIR=/root/netperf/netperf/src > > client_num=$1 > > pin_cpu=$2 > > > > start_port_server=12384 > > start_port_client=15888 > > > > killall netserver > > ${PROG_DIR}/netserver > > sleep 2 > > Any particular reason for killing-off the netserver daemon? I'm not sure if prior running might leave any impact on later running, so just kill netserver. > > > if [ ! -d result ]; then > > mkdir result > > fi > > > > all_result_files="" > > for i in `seq 1 ${client_num}`; do > > if [ "${pin_cpu}" == "pin" ]; then > > pin_param="-T ${i} ${i}" > > The -T option takes arguments of the form: > > N - bind both netperf and netserver to core N > N, - bind only netperf to core N, float netserver > ,M - float netperf, bind only netserver to core M > N,M - bind netperf to core N and netserver to core M > > Without a comma between N and M knuth only knows what the command line parser > will do :) > > > fi > > result_file=result/netperf_${start_port_client}.${date} > > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096 > > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096 > > #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} & > > ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file} & > > Same thing here for the -P option - there needs to be a comma between the two > port numbers otherwise, the best case is that the second port number is ignored. > Worst case is that netperf starts doing knuth only knows what. Thanks. > > > To get quick profiles, that form of aggregate netperf is OK - just the one > iteration with background processes using a moderatly long run time. However, > for result reporting, it is best to (ab)use the confidence intervals > functionality to try to avoid skew errors. Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need finer-tuning or investigation, I would turn on more options. > I tend to add-in a global -i 30 > option to get each netperf to repeat its measurments 30 times. That way one is > reasonably confident that skew issues are minimized. > > http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance > > And I would probably add the -c and -C options to have netperf report service > demands. Yes. That's good. I'm used to start vmstat or mpstat to monitor cpu utilization in real time. > > > > sub_pid="${sub_pid} `echo $!`" > > port_num=$((${port_num}+1)) > > all_result_files="${all_result_files} ${result_file}" > > start_port_server=$((${start_port_server}+1)) > > start_port_client=$((${start_port_client}+1)) > > done; > > > > wait ${sub_pid} > > killall netserver > > > > result="0" > > for i in `echo ${all_result_files}`; do > > sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}` > > result=`echo "${result}+${sub_result}"|bc` > > done; > > The documented-only-in-source :( "omni" tests in top-of-trunk netperf: > > http://www.netperf.org/svn/netperf2/trunk > > ./configure --enable-omni > > allow one to specify which result values one wants, in which order, either as > more or less traditional netperf output (test-specific -O), CSV (test-specific > -o) or keyval (test-specific -k). All three take an optional filename as an > argument with the file containing a list of desired output values. You can give > a "filename" of '?' to get the list of output values known to that version of > netperf. > > Might help simplify parsing and whatnot. Yes, it does. > > happy benchmarking, > > rick jones Thanks again. I learned a lot. > > > > > echo $result > > > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update) 2009-01-24 3:03 ` Zhang, Yanmin @ 2009-01-26 18:26 ` Rick Jones 0 siblings, 0 replies; 42+ messages in thread From: Rick Jones @ 2009-01-26 18:26 UTC (permalink / raw) To: Zhang, Yanmin Cc: Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty >>To get quick profiles, that form of aggregate netperf is OK - just the one >>iteration with background processes using a moderatly long run time. However, >>for result reporting, it is best to (ab)use the confidence intervals >>functionality to try to avoid skew errors. > > Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need > finer-tuning or investigation, I would turn on more options. Netperf will silently clip that to 30 as that is all the built-in tables know. > Thanks again. I learned a lot. Feel free to wander over to netperf-talk over at netperf.org if you want to talk some more about the care and feeding of netperf. happy benchmarking, rick jones ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 6:46 ` Mainline kernel OLTP performance update Nick Piggin 2009-01-16 6:55 ` Matthew Wilcox @ 2009-01-16 7:00 ` Andrew Morton 2009-01-16 7:25 ` Nick Piggin 2009-01-16 8:59 ` Nick Piggin 2009-01-16 18:11 ` Rick Jones 2 siblings, 2 replies; 42+ messages in thread From: Andrew Morton @ 2009-01-16 7:00 UTC (permalink / raw) To: Nick Piggin Cc: netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > On Friday 16 January 2009 15:12:10 Andrew Morton wrote: > > On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin <nickpiggin@yahoo.com.au> > wrote: > > > I would like to see SLQB merged in mainline, made default, and wait for > > > some number releases. Then we take what we know, and try to make an > > > informed decision about the best one to take. I guess that is problematic > > > in that the rest of the kernel is moving underneath us. Do you have > > > another idea? > > > > Nope. If it doesn't work out, we can remove it again I guess. > > OK, I have these numbers to show I'm not completely off my rocker to suggest > we merge SLQB :) Given these results, how about I ask to merge SLQB as default > in linux-next, then if nothing catastrophic happens, merge it upstream in the > next merge window, then a couple of releases after that, given some time to > test and tweak SLQB, then we plan to bite the bullet and emerge with just one > main slab allocator (plus SLOB). That's a plan. > SLQB tends to be the winner here. Can you think of anything with which it will be the loser? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 7:00 ` Mainline kernel OLTP performance update Andrew Morton @ 2009-01-16 7:25 ` Nick Piggin 2009-01-16 8:59 ` Nick Piggin 1 sibling, 0 replies; 42+ messages in thread From: Nick Piggin @ 2009-01-16 7:25 UTC (permalink / raw) To: Andrew Morton Cc: netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Friday 16 January 2009 18:00:43 Andrew Morton wrote: > On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > On Friday 16 January 2009 15:12:10 Andrew Morton wrote: > > > On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin > > > <nickpiggin@yahoo.com.au> > > > > wrote: > > > > I would like to see SLQB merged in mainline, made default, and wait > > > > for some number releases. Then we take what we know, and try to make > > > > an informed decision about the best one to take. I guess that is > > > > problematic in that the rest of the kernel is moving underneath us. > > > > Do you have another idea? > > > > > > Nope. If it doesn't work out, we can remove it again I guess. > > > > OK, I have these numbers to show I'm not completely off my rocker to > > suggest we merge SLQB :) Given these results, how about I ask to merge > > SLQB as default in linux-next, then if nothing catastrophic happens, > > merge it upstream in the next merge window, then a couple of releases > > after that, given some time to test and tweak SLQB, then we plan to bite > > the bullet and emerge with just one main slab allocator (plus SLOB). > > That's a plan. > > > SLQB tends to be the winner here. > > Can you think of anything with which it will be the loser? Well, that fio test showed it was behind SLAB. I just discovered that yesterday during running these tests, so I'll take a look at that. The Intel performance guys I think have one or two cases where it is slower. They don't seem to be too serious, and tend to be specific to some machines (eg. the same test with a different CPU architecture turns out to be faster). So I'll be looking into these things, but I haven't seen anything too serious yet. I'm mostly interested in macro benchmarks and more real world workloads. At a higher level, SLAB has some interesting features. It basically has "crossbars" of queues, that basically provide queues for allocating and freeing to and from different CPUs and nodes. This is what bloats up the kmem_cache data structures to tens or hundreds of gigabytes each on SGI size systems. But it is also has good properties. On smaller multiprocessor and NUMA systems, it might be the case that SLAB does better in workloads that involve objects being allocated on one CPU and freed on another. I haven't actually observed problems here, but I don't have a lot of good tests. SLAB is also fundamentally different from SLUB and SLQB in that it uses arrays to store pointers to objects in its queues, rather than having a linked list using pointers embedded in the objects. This might in some cases make it easier to prefetch objects in parallel with finding the object itself. I haven't actually been able to attribute a particular regression to this interesting difference, but it might turn up as an issue. These are two big differences between SLAB and SLQB. The linked lists of objects were used in favour of arrays again because of the memory overhead, and to have a better ability to tune the size of the queues, and reduced overhead in copying around arrays of pointers (SLQB can just copy the head of one the list to the tail of another in order to move objects around), and eliminated the need to have additional metadata beyond the struct page for each slab. The crossbars of queues were removed because of the bloating and memory overhead issues. The fact that we now have linked lists helps a little bit with this, because moving lists of objects around gets a bit easier. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 7:00 ` Mainline kernel OLTP performance update Andrew Morton 2009-01-16 7:25 ` Nick Piggin @ 2009-01-16 8:59 ` Nick Piggin 1 sibling, 0 replies; 42+ messages in thread From: Nick Piggin @ 2009-01-16 8:59 UTC (permalink / raw) To: Andrew Morton Cc: netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Friday 16 January 2009 18:00:43 Andrew Morton wrote: > On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <nickpiggin@yahoo.com.au> > > SLQB tends to be the winner here. > > Can you think of anything with which it will be the loser? Here are some more performance numbers with "slub_test" kernel module. It's basically a really tiny microbenchmark, so I don't really consider it gives too useful results, except it does show up some problems in SLAB's scalability that may start to bite as we continue to get more threads per socket. (I ran a few of these tests on one of Dave's 2 socket, 128 thread systems, and slab gets really painful... these kinds of thread counts may only be a couple of years away from x86). All numbers are in CPU cycles. Single thread testing ===================== 1. Kmalloc: Repeatedly allocate 10000 objs then free them obj size SLAB SLQB SLUB 8 77+ 128 69+ 47 61+ 77 16 69+ 104 116+ 70 77+ 80 32 66+ 101 82+ 81 71+ 89 64 82+ 116 95+ 81 94+105 128 100+ 148 106+ 94 114+163 256 153+ 136 134+ 98 124+186 512 209+ 161 170+186 134+276 1024 331+ 249 236+245 134+283 2048 608+ 443 380+386 172+312 4096 1109+ 624 678+661 239+372 8192 1166+1077 767+683 535+433 16384 1213+1160 914+731 577+682 We can see SLAB has a fair bit more overhead in this case. SLUB starts doing higher order allocations I think around size 256, which reduces costs there. Don't know what the SLQB artifact at 16 is caused by... 2. Kmalloc: alloc/free test (repeatedly allocate and free) SLAB SLQB SLUB 8 98 90 94 16 98 90 93 32 98 90 93 64 99 90 94 128 100 92 93 256 104 93 95 512 105 94 97 1024 106 93 97 2048 107 95 95 4096 111 92 97 8192 111 94 631 16384 114 92 741 Here we see SLUB's allocator passthrough (or is the the lack of queueing?). Straight line speed at small sizes is probably due to instructions in the fastpaths. It's pretty meaningless though because it probably changes if there is any actual load on the CPU, or another CPU architecture. Doesn't look bad for SLQB though :) Concurrent allocs ================= 1. Like the first single thread test, lots of allocs, then lots of frees. But running on all CPUs. Average over all CPUs. SLAB SLQB SLUB 8 251+ 322 73+ 47 65+ 76 16 240+ 331 84+ 53 67+ 82 32 235+ 316 94+ 57 77+ 92 64 338+ 303 120+ 66 105+ 136 128 549+ 355 139+ 166 127+ 344 256 1129+ 456 189+ 178 236+ 404 512 2085+ 872 240+ 217 244+ 419 1024 3895+1373 347+ 333 251+ 440 2048 7725+2579 616+ 695 373+ 588 4096 15320+4534 1245+1442 689+1002 A problem with SLAB scalability starts showing up on this system with only 4 threads per socket. Again, SLUB sees a benefit from higher order allocations. 2. Same as 2nd single threaded test, alloc then free, on all CPUs. SLAB SLQB SLUB 8 99 90 93 16 99 90 93 32 99 90 93 64 100 91 94 128 102 90 93 256 105 94 97 512 106 93 97 1024 108 93 97 2048 109 93 96 4096 110 93 96 No surprises. Objects always fit in queues (or unqueues, in the case of SLUB), so there is no cross cache traffic. Remote free test ================ 1. Allocate N objects on CPUs 1-7, then free them all from CPU 0. Average cost of all kmalloc+kfree SLAB SLQB SLUB 8 191+ 142 53+ 64 56+99 16 180+ 141 82+ 69 60+117 32 173+ 142 100+ 71 78+151 64 240+ 147 131+ 73 117+216 128 441+ 162 158+114 114+251 256 833+ 181 179+119 185+263 512 1546+ 243 220+132 194+292 1024 2886+ 341 299+135 201+312 2048 5737+ 577 517+139 291+370 4096 11288+1201 976+153 528+482 2. All CPUs allocate on objects on CPU N, then freed by CPU N+1 % NR_CPUS (ie. CPU1 frees objects allocated by CPU0). SLAB SLQB SLUB 8 236+ 331 72+123 64+ 114 16 232+ 345 80+125 71+ 139 32 227+ 342 85+134 82+ 183 64 324+ 336 140+138 111+ 219 128 569+ 384 245+201 145+ 337 256 1111+ 448 243+222 238+ 447 512 2091+ 871 249+244 247+ 470 1024 3923+1593 254+256 254+ 503 2048 7700+2968 273+277 369+ 699 4096 15154+5061 310+323 693+1220 SLAB's concurrent allocation bottlnecks show up again in these tests. Unfortunately these are not too realistic tests of remote freeing pattern, because normally you would expect remote freeing and allocation happening concurrently, rather than all allocations up front, then all frees. If the test behaved like that, then object could probably fit in SLAB's queues and it might see some good numbers. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 6:46 ` Mainline kernel OLTP performance update Nick Piggin 2009-01-16 6:55 ` Matthew Wilcox 2009-01-16 7:00 ` Mainline kernel OLTP performance update Andrew Morton @ 2009-01-16 18:11 ` Rick Jones 2009-01-19 7:43 ` Nick Piggin 2 siblings, 1 reply; 42+ messages in thread From: Rick Jones @ 2009-01-16 18:11 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty Nick Piggin wrote: > OK, I have these numbers to show I'm not completely off my rocker to suggest > we merge SLQB :) Given these results, how about I ask to merge SLQB as default > in linux-next, then if nothing catastrophic happens, merge it upstream in the > next merge window, then a couple of releases after that, given some time to > test and tweak SLQB, then we plan to bite the bullet and emerge with just one > main slab allocator (plus SLOB). > > > System is a 2socket, 4 core AMD. Not exactly a large system :) Barely NUMA even with just two sockets. > All debug and stats options turned off for > all the allocators; default parameters (ie. SLUB using higher order pages, > and the others tend to be using order-0). SLQB is the version I recently > posted, with some of the prefetching removed according to Pekka's review > (probably a good idea to only add things like that in if/when they prove to > be an improvement). > > ... > > Netperf UDP unidirectional send test (10 runs, higher better): > > Server and client bound to same CPU > SLAB AVG=60.111 STD=1.59382 > SLQB AVG=60.167 STD=0.685347 > SLUB AVG=58.277 STD=0.788328 > > Server and client bound to same socket, different CPUs > SLAB AVG=85.938 STD=0.875794 > SLQB AVG=93.662 STD=2.07434 > SLUB AVG=81.983 STD=0.864362 > > Server and client bound to different sockets > SLAB AVG=78.801 STD=1.44118 > SLQB AVG=78.269 STD=1.10457 > SLUB AVG=71.334 STD=1.16809 > ... > I haven't done any non-local network tests. Networking is the one of the > subsystems most heavily dependent on slab performance, so if anybody > cares to run their favourite tests, that would be really helpful. I'm guessing, but then are these Mbit/s figures? Would that be the sending throughput or the receiving throughput? I love to see netperf used, but why UDP and loopback? Also, how about the service demands? rick jones ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 18:11 ` Rick Jones @ 2009-01-19 7:43 ` Nick Piggin 2009-01-19 22:19 ` Rick Jones 0 siblings, 1 reply; 42+ messages in thread From: Nick Piggin @ 2009-01-19 7:43 UTC (permalink / raw) To: Rick Jones Cc: Andrew Morton, netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Saturday 17 January 2009 05:11:02 Rick Jones wrote: > Nick Piggin wrote: > > OK, I have these numbers to show I'm not completely off my rocker to > > suggest we merge SLQB :) Given these results, how about I ask to merge > > SLQB as default in linux-next, then if nothing catastrophic happens, > > merge it upstream in the next merge window, then a couple of releases > > after that, given some time to test and tweak SLQB, then we plan to bite > > the bullet and emerge with just one main slab allocator (plus SLOB). > > > > > > System is a 2socket, 4 core AMD. > > Not exactly a large system :) Barely NUMA even with just two sockets. You're right ;) But at least it is exercising the NUMA paths in the allocator, and represents a pretty common size of system... I can run some tests on bigger systems at SUSE, but it is not always easy to set up "real" meaningful workloads on them or configure significant IO for them. > > Netperf UDP unidirectional send test (10 runs, higher better): > > > > Server and client bound to same CPU > > SLAB AVG=60.111 STD=1.59382 > > SLQB AVG=60.167 STD=0.685347 > > SLUB AVG=58.277 STD=0.788328 > > > > Server and client bound to same socket, different CPUs > > SLAB AVG=85.938 STD=0.875794 > > SLQB AVG=93.662 STD=2.07434 > > SLUB AVG=81.983 STD=0.864362 > > > > Server and client bound to different sockets > > SLAB AVG=78.801 STD=1.44118 > > SLQB AVG=78.269 STD=1.10457 > > SLUB AVG=71.334 STD=1.16809 > > > > ... > > > > I haven't done any non-local network tests. Networking is the one of the > > subsystems most heavily dependent on slab performance, so if anybody > > cares to run their favourite tests, that would be really helpful. > > I'm guessing, but then are these Mbit/s figures? Would that be the sending > throughput or the receiving throughput? Yes, Mbit/s. They were... hmm, sending throughput I think, but each pair of numbers seemed to be identical IIRC? > I love to see netperf used, but why UDP and loopback? No really good reason. I guess I was hoping to keep other variables as small as possible. But I guess a real remote test would be a lot more realistic as a networking test. Hmm, but I could probably set up a test over a simple GbE link here. I'll try that. > Also, how about the > service demands? Well, over loopback and using CPU binding, I was hoping it wouldn't change much... but I see netperf does some measurements for you. I will consider those in future too. BTW. is it possible to do parallel netperf tests? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-19 7:43 ` Nick Piggin @ 2009-01-19 22:19 ` Rick Jones 0 siblings, 0 replies; 42+ messages in thread From: Rick Jones @ 2009-01-19 22:19 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty >>>System is a 2socket, 4 core AMD. >> >>Not exactly a large system :) Barely NUMA even with just two sockets. > > > You're right ;) > > But at least it is exercising the NUMA paths in the allocator, and > represents a pretty common size of system... > > I can run some tests on bigger systems at SUSE, but it is not always > easy to set up "real" meaningful workloads on them or configure > significant IO for them. Not sure if I know enough git to pull your trees, or if this cobbler's child will have much in the way of bigger systems, but there is a chance I might - contact me offline with some pointers on how to pull and build the bits and such. >>>Netperf UDP unidirectional send test (10 runs, higher better): >>> >>>Server and client bound to same CPU >>>SLAB AVG=60.111 STD=1.59382 >>>SLQB AVG=60.167 STD=0.685347 >>>SLUB AVG=58.277 STD=0.788328 >>> >>>Server and client bound to same socket, different CPUs >>>SLAB AVG=85.938 STD=0.875794 >>>SLQB AVG=93.662 STD=2.07434 >>>SLUB AVG=81.983 STD=0.864362 >>> >>>Server and client bound to different sockets >>>SLAB AVG=78.801 STD=1.44118 >>>SLQB AVG=78.269 STD=1.10457 >>>SLUB AVG=71.334 STD=1.16809 >>> >> >> > ... >> >>>I haven't done any non-local network tests. Networking is the one of the >>>subsystems most heavily dependent on slab performance, so if anybody >>>cares to run their favourite tests, that would be really helpful. >> >>I'm guessing, but then are these Mbit/s figures? Would that be the sending >>throughput or the receiving throughput? > > > Yes, Mbit/s. They were... hmm, sending throughput I think, but each pair > of numbers seemed to be identical IIRC? Mega *bits* per second? And those were 4K sends right? That seems rather low for loopback - I would have expected nearly two orders of magnitude more. I wonder if the intra-stack flow control kicked-in? You might try adding test specific -S and -s options to set much larger socket buffers to try to avoid that. Or simply use TCP. netperf -H <foo> ... -- -s 1M -S 1M -m 4K >>I love to see netperf used, but why UDP and loopback? > > > No really good reason. I guess I was hoping to keep other variables as > small as possible. But I guess a real remote test would be a lot more > realistic as a networking test. Hmm, but I could probably set up a test > over a simple GbE link here. I'll try that. If bandwidth is an issue, that is to say one saturates the link before much of anything "interesting" happens in the host you can use something like aggregate TCP_RR - ./configure with --enable_burst and then something like netperf -H <remote> -t TCP_RR -- -D -b 32 and it will have as many as 33 discrete transactions in flight at one time on the one connection. The -D is there to set TCP_NODELAY to preclude TCP chunking the single-byte (default, take your pick of a more reasonable size) transactions into one segment. >>Also, how about the service demands? > > > Well, over loopback and using CPU binding, I was hoping it wouldn't > change much... Hope... but verify :) > but I see netperf does some measurements for you. I > will consider those in future too. > > BTW. is it possible to do parallel netperf tests? Yes, by (ab)using the confidence intervals code. Poke around in http://www.netperf.org/svn/netperf2/doc/netperf.html in the "Aggregates" section, and I can go into further details offline (or here if folks want to see the discussion). rick jones ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2009-02-12 16:07 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <BC02C49EEB98354DBA7F5DD76F2A9E800317003CB0@azsmsx501.amr.corp.intel.com>
[not found] ` <200901161503.13730.nickpiggin@yahoo.com.au>
[not found] ` <20090115201210.ca1a9542.akpm@linux-foundation.org>
2009-01-16 6:46 ` Mainline kernel OLTP performance update Nick Piggin
2009-01-16 6:55 ` Matthew Wilcox
2009-01-16 7:06 ` Nick Piggin
2009-01-16 7:53 ` Zhang, Yanmin
2009-01-16 10:20 ` Andi Kleen
2009-01-20 5:16 ` Zhang, Yanmin
2009-01-21 23:58 ` Christoph Lameter
2009-01-22 8:36 ` Zhang, Yanmin
2009-01-22 9:15 ` Pekka Enberg
2009-01-22 9:28 ` Zhang, Yanmin
2009-01-22 9:47 ` Pekka Enberg
2009-01-23 3:02 ` Zhang, Yanmin
2009-01-23 6:52 ` Pekka Enberg
2009-01-23 8:06 ` Pekka Enberg
2009-01-23 8:30 ` Zhang, Yanmin
2009-01-23 8:40 ` Pekka Enberg
2009-01-23 9:46 ` Pekka Enberg
2009-01-23 15:22 ` Christoph Lameter
2009-01-23 15:31 ` Pekka Enberg
2009-01-23 15:55 ` Christoph Lameter
2009-01-23 16:01 ` Pekka Enberg
2009-01-24 2:55 ` Zhang, Yanmin
2009-01-24 7:36 ` Pekka Enberg
2009-02-12 5:22 ` Zhang, Yanmin
2009-02-12 5:47 ` Zhang, Yanmin
2009-02-12 15:25 ` Christoph Lameter
2009-02-12 16:07 ` Pekka Enberg
2009-02-12 16:03 ` Pekka Enberg
2009-01-26 17:36 ` Christoph Lameter
2009-02-01 2:52 ` Zhang, Yanmin
2009-01-23 8:33 ` Nick Piggin
2009-01-23 9:02 ` Zhang, Yanmin
2009-01-23 18:40 ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones
2009-01-23 18:51 ` Grant Grundler
2009-01-24 3:03 ` Zhang, Yanmin
2009-01-26 18:26 ` Rick Jones
2009-01-16 7:00 ` Mainline kernel OLTP performance update Andrew Morton
2009-01-16 7:25 ` Nick Piggin
2009-01-16 8:59 ` Nick Piggin
2009-01-16 18:11 ` Rick Jones
2009-01-19 7:43 ` Nick Piggin
2009-01-19 22:19 ` Rick Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).