From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Zhang, Yanmin" Subject: Re: Mainline kernel OLTP performance update Date: Tue, 20 Jan 2009 13:16:23 +0800 Message-ID: <1232428583.11429.83.camel@ymzhang> References: <200901161503.13730.nickpiggin@yahoo.com.au> <20090115201210.ca1a9542.akpm@linux-foundation.org> <200901161746.25205.nickpiggin@yahoo.com.au> <20090116065546.GJ31013@parisc-linux.org> <1232092430.11429.52.camel@ymzhang> <87sknjeemn.fsf@basil.nowhere.org> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Matthew Wilcox , Nick Piggin , Andrew Morton , netdev@vger.kernel.org, sfr@canb.auug.org.au, matthew.r.wilcox@intel.com, chinang.ma@intel.com, linux-kernel@vger.kernel.org, sharad.c.tripathi@intel.com, arjan@linux.intel.com, suresh.b.siddha@intel.com, harita.chilukuri@intel.com, douglas.w.styner@intel.com, peter.xihong.wang@intel.com, hubert.nueckel@intel.com, chris.mason@oracle.com, srostedt@redhat.com, linux-scsi@vger.kernel.org, andrew.vasquez@qlogic.com, anirban.chakraborty@qlogic.com To: Andi Kleen , Christoph Lameter , Pekka Enberg Return-path: Received: from mga05.intel.com ([192.55.52.89]:19988 "EHLO fmsmga101.fm.intel.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750769AbZATFQb (ORCPT ); Tue, 20 Jan 2009 00:16:31 -0500 In-Reply-To: <87sknjeemn.fsf@basil.nowhere.org> Sender: netdev-owner@vger.kernel.org List-ID: On Fri, 2009-01-16 at 11:20 +0100, Andi Kleen wrote: > "Zhang, Yanmin" writes: >=20 >=20 > > I think that's because SLQB > > doesn't pass through big object allocation to page allocator. > > netperf UDP-U-1k has less improvement with SLQB. >=20 > That sounds like just the page allocator needs to be improved. > That would help everyone. We talked a bit about this earlier, > some of the heuristics for hot/cold pages are quite outdated > and have been tuned for obsolete machines and also its fast path > is quite long. Unfortunately no code currently. Andi, Thanks for your kind information. I did more investigation with SLUB on netperf UDP-U-4k issue. oprofile shows: 328058 30.1342 linux-2.6.29-rc2 copy_user_generic_string 134666 12.3699 linux-2.6.29-rc2 __free_pages_ok 125447 11.5231 linux-2.6.29-rc2 get_page_from_freelist 22611 2.0770 linux-2.6.29-rc2 __sk_mem_reclaim 21442 1.9696 linux-2.6.29-rc2 list_del 21187 1.9462 linux-2.6.29-rc2 __ip_route_output_key So =EF=BB=BF__free_pages_ok and =EF=BB=BFget_page_from_freelist consume= too much cpu time. With SLQB, these 2 functions almost don't consume time. Command 'slabinfo -AD' shows: Name Objects Alloc Free %Fast :0000256 1685 29611065 29609548 99 99 :0000168 2987 164689 161859 94 39 :0004096 1471 114918 113490 99 97 So kmem_cache =EF=BB=BF:0000256 is very active. Kernel stack dump in =EF=BB=BF__free_pages_ok shows [] __free_pages_ok+0x109/0x2e0 [] autoremove_wake_function+0x0/0x2e [] __kfree_skb+0x9/0x6f [] skb_free_datagram+0xc/0x31 [] udp_recvmsg+0x1e7/0x26f [] sock_common_recvmsg+0x30/0x45 [] sock_recvmsg+0xd5/0xed The callchain is: =EF=BB=BF__kfree_skb =3D> kfree_skbmem =3D> kmem_cache_free(skbuff_head_cache, skb); kmem_cache =EF=BB=BFskbuff_head_cache's object size is just 256, so it = shares the kmem_cache with =EF=BB=BF:0000256. Their order is 1 which means every slab consist= s of 2 physical pages. =EF=BB=BFnetperf UDP-U-4k is a UDP stream testing. client process keeps= sending 4k-size packets to server process and server process just receives the packets one by o= ne. If we start CPU_NUM clients and the same number of servers, every clien= t will send lots of packets within one sched slice, then process scheduler schedules the= server to receive many packets within one sched slice; then client resends again. So ther= e are many packets in the queue. When server receive the packets, it frees =EF=BB=BFskbuff= _head_cache. When the slab's objects are all free, the slab will be released by calling __free_pages= =2E Such batch sending/receiving creates lots of slab free activity. Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page = buffer for page order 0. But here =EF=BB=BFskbuff_head_cache's order is 1, so UDP-U-4k couldn't = benefit from the page buffer. SLQB has no such issue, because: 1) SLQB has a percpu freelist. Free objects are put to the list firstly= and can be picked up later on quickly without lock. A batch parameter to control the free ob= ject recollection is mostly 1024. 2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pa= ges/free_pages, it can benefit from =EF=BB=BFzone_pcp(zone, cpu)->pcp page buffer. So SLUB need resolve such issues that one process allocates a batch of = objects and another process frees them batchly. yanmin