From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id D569C6B0038 for ; Thu, 14 Sep 2017 12:49:44 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id q76so6121670pfq.5 for ; Thu, 14 Sep 2017 09:49:44 -0700 (PDT) Received: from EUR01-DB5-obe.outbound.protection.outlook.com (mail-db5eur01on0059.outbound.protection.outlook.com. [104.47.2.59]) by mx.google.com with ESMTPS id 87si11830267pft.107.2017.09.14.09.49.43 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 14 Sep 2017 09:49:43 -0700 (PDT) From: Tariq Toukan Subject: Page allocator bottleneck Message-ID: Date: Thu, 14 Sep 2017 19:49:31 +0300 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: David Miller , Jesper Dangaard Brouer , Mel Gorman , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm Hi all, As part of the efforts to support increasing next-generation NIC speeds, I am investigating SW bottlenecks in network stack receive flow. Here I share some numbers I got for a simple experiment, in which I simulate the page allocation rate needed in 200Gpbs NICs. I ran the test below over 3 different (modified) mlx5 driver versions, loaded on server side (RX): 1) RX page cache disabled, 2 packets per page. 2) RX page cache disabled, one packet per page. 3) Huge RX page cache, one packet per page. All page allocations are of order 0. NIC: Connectx-5 100 Gbps. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Test: 128 TCP streams (using super_netperf). Changing num of RX queues. HW LRO OFF, GRO ON, MTU 1500. Observe: BW as a function of num RX queues. Results: Driver #1: #rings BW (Mbps) 1 23,813 2 44,086 3 62,128 4 78,058 6 94,210 (linerate) 8 94,205 (linerate) 12 94,202 (linerate) 16 94,191 (linerate) Driver #2: #rings BW (Mbps) 1 18,835 2 36,716 3 50,521 4 61,746 6 63,637 8 60,299 12 51,048 16 43,337 Driver #3: #rings BW (Mbps) 1 19,316 2 44,850 3 69,549 4 87,434 6 94,342 (linerate) 8 94,350 (linerate) 12 94,327 (linerate) 16 94,327 (linerate) Insights: Major degradation between #1 and #2, not getting any close to linerate! Degradation is fixed between #2 and #3. This is because page allocator cannot stand the higher allocation rate. In #2, we also see that the addition of rings (cores) reduces BW (!!), as result of increasing congestion over shared resources. Congestion in this case is very clear. When monitored in perf top: 85.58% [kernel] [k] queued_spin_lock_slowpath I think that page allocator issues should be discussed separately: 1) Rate: Increase the allocation rate on a single core. 2) Scalability: Reduce congestion and sync overhead between cores. This is clearly the current bottleneck in the network stack receive flow. I know about some efforts that were made in the past two years. For example the ones from Jesper et al.: - Page-pool (not accepted AFAIK). - Page-allocation bulking. - Optimize order-0 allocations in Per-Cpu-Pages. I am not an mm expert, but wanted to raise the issue again, to combine the efforts and hear from you guys about status and possible directions. Best regards, Tariq Toukan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 5E05F6B0069 for ; Thu, 14 Sep 2017 16:19:20 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id y77so618532pfd.2 for ; Thu, 14 Sep 2017 13:19:20 -0700 (PDT) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTPS id v81si11208116pgb.504.2017.09.14.13.19.18 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 14 Sep 2017 13:19:18 -0700 (PDT) From: Andi Kleen Subject: Re: Page allocator bottleneck References: Date: Thu, 14 Sep 2017 13:19:17 -0700 In-Reply-To: (Tariq Toukan's message of "Thu, 14 Sep 2017 19:49:31 +0300") Message-ID: <87vaklyqwq.fsf@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: Tariq Toukan Cc: David Miller , Jesper Dangaard Brouer , Mel Gorman , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm Tariq Toukan writes: > > Congestion in this case is very clear. > When monitored in perf top: > 85.58% [kernel] [k] queued_spin_lock_slowpath Please look at the callers. Spinlock profiles without callers are usually useless because it's just blaming the messenger. Most likely the PCP lists are too small for your extreme allocation rate, so it goes back too often to the shared pool. You can play with the vm.percpu_pagelist_fraction setting. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f200.google.com (mail-io0-f200.google.com [209.85.223.200]) by kanga.kvack.org (Postfix) with ESMTP id 4F2296B0253 for ; Fri, 15 Sep 2017 03:28:51 -0400 (EDT) Received: by mail-io0-f200.google.com with SMTP id e9so4729576iod.4 for ; Fri, 15 Sep 2017 00:28:51 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 76si191441oic.515.2017.09.15.00.28.49 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 15 Sep 2017 00:28:49 -0700 (PDT) Date: Fri, 15 Sep 2017 09:28:39 +0200 From: Jesper Dangaard Brouer Subject: Re: Page allocator bottleneck Message-ID: <20170915092839.690ea9e9@redhat.com> In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Tariq Toukan Cc: David Miller , Mel Gorman , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm , brouer@redhat.com On Thu, 14 Sep 2017 19:49:31 +0300 Tariq Toukan wrote: > Hi all, >=20 > As part of the efforts to support increasing next-generation NIC speeds, > I am investigating SW bottlenecks in network stack receive flow. >=20 > Here I share some numbers I got for a simple experiment, in which I=20 > simulate the page allocation rate needed in 200Gpbs NICs. Thanks for bringing this up again.=20 > I ran the test below over 3 different (modified) mlx5 driver versions, > loaded on server side (RX): > 1) RX page cache disabled, 2 packets per page. 2 packets per page basically reduce the overhead you see from the page allocator to half. > 2) RX page cache disabled, one packet per page. This, should stress the page allocator. > 3) Huge RX page cache, one packet per page. A driver level page-cache will look nice, as long as it "works". =20 Drivers usually have no other option than basing their recycle facility to be based on the page-refcnt (as there is no destructor callback). Which implies packets/pages need to be returned quickly enough for it to work. > All page allocations are of order 0. >=20 > NIC: Connectx-5 100 Gbps. > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz >=20 > Test: > 128 TCP streams (using super_netperf). > Changing num of RX queues. > HW LRO OFF, GRO ON, MTU 1500. With TCP streams and GRO, is actually a good stress test for the page allocator (or drivers page-recycle cache). As Eric Dumazet have made some nice optimizations, that (in most situations) cause us to quickly free/recycle the SKB (coming from driver) and store the pages in 1-SKB. This cause us to hit the SLUB fastpath for the SKBs, but once the pages need to be free'ed this stress the page allocator more. Also be aware that with TCP flows, the packets are likely delivered into a socket, that is consumed on another CPU. Thus, the pages are allocated on one CPU and free'ed on another. AFAIK this stress the order-0 cache PCP (Per-Cpu-Pages). > Observe: BW as a function of num RX queues. >=20 > Results: >=20 > Driver #1: > #rings BW (Mbps) > 1 23,813 > 2 44,086 > 3 62,128 > 4 78,058 > 6 94,210 (linerate) > 8 94,205 (linerate) > 12 94,202 (linerate) > 16 94,191 (linerate) >=20 > Driver #2: > #rings BW (Mbps) > 1 18,835 > 2 36,716 > 3 50,521 > 4 61,746 > 6 63,637 > 8 60,299 > 12 51,048 > 16 43,337 >=20 > Driver #3: > #rings BW (Mbps) > 1 19,316 > 2 44,850 > 3 69,549 > 4 87,434 > 6 94,342 (linerate) > 8 94,350 (linerate) > 12 94,327 (linerate) > 16 94,327 (linerate) >=20 >=20 > Insights: > Major degradation between #1 and #2, not getting any close to linerate! > Degradation is fixed between #2 and #3. > This is because page allocator cannot stand the higher allocation rate. > In #2, we also see that the addition of rings (cores) reduces BW (!!),=20 > as result of increasing congestion over shared resources. >=20 > Congestion in this case is very clear. > When monitored in perf top: > 85.58% [kernel] [k] queued_spin_lock_slowpath Well, we obviously need to know the caller of the spin_lock. In this case it is likely the page allocator lock. It could also be the TCP socket locks, but given GRO is enabled, they should be hit much less. > I think that page allocator issues should be discussed separately: > 1) Rate: Increase the allocation rate on a single core. > 2) Scalability: Reduce congestion and sync overhead between cores. Yes, but this no small task. I is on my TODO-list (emacs org-mode), but I have other tasks that have higher priority atm. I'll be working on XDP_REDIRECT for the next many months. Currently trying to convince people that we do an explicit packet-page return/free callback (which would avoid many of these issues). > This is clearly the current bottleneck in the network stack receive > flow. >=20 > I know about some efforts that were made in the past two years. > For example the ones from Jesper et al.: > > - Page-pool (not accepted AFAIK). The page-pool have many purposes. 1. generic page-cache for drivers, 2. keep pages DMA-mapped 3. facilitate drivers to change RX-ring memory model =46rom a MM-point-of-view the page pool is just a destructor callback, that can "steal" the page. If I can convince XDP_REDIRECT to use an explicit destructor callback, then I almost get what I need. Except for the generic part, and the normal network path will not see the benefit. Thus, not helping your use-case, I guess. > - Page-allocation bulking. Notice, that page-allocator bulking, would still be needed by the page-pool and other page-cache facilities. We should implement it regardless of the page_pool. =20 Without a page pool facility to hide the use of page bulking. You could use page-bulk-alloc in driver RX-ring refill, and find where TCP free the GRO packets, and do page-bulk-free there. > - Optimize order-0 allocations in Per-Cpu-Pages. There is a need to optimize PCP some more for the single-core XDP performance target (~14Mpps). I guess, the easiest way around this is implement/integrate a page bulk API into PCP. The TCP-GRO use-case you are hitting is a different bottleneck. It is a multi-CPU parallel workload, that exceed the PCP cache size, and cause you to hit the page buddy allocator. I wonder if you could "solve"/mitigate the issue if you tune the size of the PCP cache? AFAIK it only keeps 128 pages cached per CPU... I know you can see this via a proc file, but I cannot remember which(?). And I'm not sure how you tune this(?) > I am not an mm expert, but wanted to raise the issue again, to combine=20 > the efforts and hear from you guys about status and possible directions. Regarding recent changes... if you have you kernel compiled with CONFIG_NUMA then the page-allocator is slower (due to keeping numa-stats), except that this was recently optimized and merged(?) What (exact) kernel git tree did you run these tests on? --=20 Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) by kanga.kvack.org (Postfix) with ESMTP id 7BBAE6B0033 for ; Fri, 15 Sep 2017 06:23:32 -0400 (EDT) Received: by mail-wm0-f69.google.com with SMTP id b195so2712583wmb.6 for ; Fri, 15 Sep 2017 03:23:32 -0700 (PDT) Received: from outbound-smtp05.blacknight.com (outbound-smtp05.blacknight.com. [81.17.249.38]) by mx.google.com with ESMTPS id q30si543577wra.109.2017.09.15.03.23.31 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 15 Sep 2017 03:23:31 -0700 (PDT) Received: from mail.blacknight.com (pemlinmail03.blacknight.ie [81.17.254.16]) by outbound-smtp05.blacknight.com (Postfix) with ESMTPS id 149DB99682 for ; Fri, 15 Sep 2017 10:23:31 +0000 (UTC) Date: Fri, 15 Sep 2017 11:23:20 +0100 From: Mel Gorman Subject: Re: Page allocator bottleneck Message-ID: <20170915102320.zqceocmvvkyybekj@techsingularity.net> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Tariq Toukan Cc: David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote: > Insights: > Major degradation between #1 and #2, not getting any close to linerate! > Degradation is fixed between #2 and #3. > This is because page allocator cannot stand the higher allocation rate. > In #2, we also see that the addition of rings (cores) reduces BW (!!), as > result of increasing congestion over shared resources. > Unfortunately, no surprises there. > Congestion in this case is very clear. > When monitored in perf top: > 85.58% [kernel] [k] queued_spin_lock_slowpath > While it's not proven, the most likely candidate is the zone lock and that should be confirmed using a call-graph profile. If so, then the suggestion to tune to the size of the per-cpu allocator would mitigate the problem. > I think that page allocator issues should be discussed separately: > 1) Rate: Increase the allocation rate on a single core. > 2) Scalability: Reduce congestion and sync overhead between cores. > > This is clearly the current bottleneck in the network stack receive flow. > > I know about some efforts that were made in the past two years. > For example the ones from Jesper et al.: > - Page-pool (not accepted AFAIK). Indeed not and it would also need driver conversion. > - Page-allocation bulking. Prototypes exist but it's pointless without the pool or driver conversion so it's in the back burner for the moment. > - Optimize order-0 allocations in Per-Cpu-Pages. > This had a prototype that was reverted as it must be able to cope with both irq and noirq contexts. Unfortunately I never found the time to revisit it but a split there to handle both would mitigate the problem. Probably not enough to actually reach line speed though so tuning of the per-cpu allocator sizes would still be needed. I don't know when I'll get the chance to revisit it. I'm travelling all next week and am mostly occupied with other work at the moment that is consuming all my concentration. > I am not an mm expert, but wanted to raise the issue again, to combine the > efforts and hear from you guys about status and possible directions. The recent effort to reduce overhead from stats will help mitigate the problem. Finishing the page pool, the bulk allocator and converting drivers would be the most likely successful path forward but it's currently stalled as everyone that was previously involved is too busy. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 71DC86B0038 for ; Sun, 17 Sep 2017 11:43:23 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id m30so13655160pgn.2 for ; Sun, 17 Sep 2017 08:43:23 -0700 (PDT) Received: from EUR01-VE1-obe.outbound.protection.outlook.com (mail-ve1eur01on0065.outbound.protection.outlook.com. [104.47.1.65]) by mx.google.com with ESMTPS id k2si3432548pgc.704.2017.09.17.08.43.21 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Sun, 17 Sep 2017 08:43:21 -0700 (PDT) Subject: Re: Page allocator bottleneck References: <87vaklyqwq.fsf@linux.intel.com> From: Tariq Toukan Message-ID: Date: Sun, 17 Sep 2017 18:43:09 +0300 MIME-Version: 1.0 In-Reply-To: <87vaklyqwq.fsf@linux.intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andi Kleen , Tariq Toukan Cc: David Miller , Jesper Dangaard Brouer , Mel Gorman , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm On 14/09/2017 11:19 PM, Andi Kleen wrote: > Tariq Toukan writes: >> >> Congestion in this case is very clear. >> When monitored in perf top: >> 85.58% [kernel] [k] queued_spin_lock_slowpath > > Please look at the callers. Spinlock profiles without callers > are usually useless because it's just blaming the messenger. > > Most likely the PCP lists are too small for your extreme allocation > rate, so it goes back too often to the shared pool. > > You can play with the vm.percpu_pagelist_fraction setting. Thanks Andi. That was my initial guess, but I wasn't familiar with these tunes in VM to verify that. Indeed, bottleneck is released when increasing the PCP size, and BW becomes significantly better. > > -Andi > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 279CE6B0038 for ; Sun, 17 Sep 2017 12:16:28 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id v82so13723957pgb.5 for ; Sun, 17 Sep 2017 09:16:28 -0700 (PDT) Received: from EUR01-VE1-obe.outbound.protection.outlook.com (mail-ve1eur01on0082.outbound.protection.outlook.com. [104.47.1.82]) by mx.google.com with ESMTPS id r69si3364371pfg.503.2017.09.17.09.16.25 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Sun, 17 Sep 2017 09:16:26 -0700 (PDT) Subject: Re: Page allocator bottleneck References: <20170915092839.690ea9e9@redhat.com> From: Tariq Toukan Message-ID: <6069fd36-ed0e-145c-3134-35232bf951a7@mellanox.com> Date: Sun, 17 Sep 2017 19:16:15 +0300 MIME-Version: 1.0 In-Reply-To: <20170915092839.690ea9e9@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jesper Dangaard Brouer , Tariq Toukan Cc: David Miller , Mel Gorman , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm On 15/09/2017 10:28 AM, Jesper Dangaard Brouer wrote: > On Thu, 14 Sep 2017 19:49:31 +0300 > Tariq Toukan wrote: > >> Hi all, >> >> As part of the efforts to support increasing next-generation NIC speeds, >> I am investigating SW bottlenecks in network stack receive flow. >> >> Here I share some numbers I got for a simple experiment, in which I >> simulate the page allocation rate needed in 200Gpbs NICs. > > Thanks for bringing this up again. Sure. We need to keep up with the increasing NIC speeds. > >> I ran the test below over 3 different (modified) mlx5 driver versions, >> loaded on server side (RX): >> 1) RX page cache disabled, 2 packets per page. > > 2 packets per page basically reduce the overhead you see from the page > allocator to half. > >> 2) RX page cache disabled, one packet per page. > > This, should stress the page allocator. > >> 3) Huge RX page cache, one packet per page. > > A driver level page-cache will look nice, as long as it "works". I verified that it worked in the experiment. > > Drivers usually have no other option than basing their recycle facility > to be based on the page-refcnt (as there is no destructor callback). > Which implies packets/pages need to be returned quickly enough for it > to work. Yes, that's how our current default (small) RX page-cache is implemented. Unfortunately, the timing and terms for a fair reuse rate are not always satisfied. > >> All page allocations are of order 0. >> >> NIC: Connectx-5 100 Gbps. >> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz >> >> Test: >> 128 TCP streams (using super_netperf). >> Changing num of RX queues. >> HW LRO OFF, GRO ON, MTU 1500. > > With TCP streams and GRO, is actually a good stress test for the page > allocator (or drivers page-recycle cache). As Eric Dumazet have made > some nice optimizations, that (in most situations) cause us to quickly > free/recycle the SKB (coming from driver) and store the pages in 1-SKB. > This cause us to hit the SLUB fastpath for the SKBs, but once the pages > need to be free'ed this stress the page allocator more. Yep, bulking would help here, as you mention below. > > Also be aware that with TCP flows, the packets are likely delivered > into a socket, that is consumed on another CPU. Thus, the pages are > allocated on one CPU and free'ed on another. AFAIK this stress the > order-0 cache PCP (Per-Cpu-Pages). > Good point. Do you know of any tool/kernel counters that help observe and quantify this behavior? > >> Observe: BW as a function of num RX queues. >> >> Results: >> >> Driver #1: >> #rings BW (Mbps) >> 1 23,813 >> 2 44,086 >> 3 62,128 >> 4 78,058 >> 6 94,210 (linerate) >> 8 94,205 (linerate) >> 12 94,202 (linerate) >> 16 94,191 (linerate) >> >> Driver #2: >> #rings BW (Mbps) >> 1 18,835 >> 2 36,716 >> 3 50,521 >> 4 61,746 >> 6 63,637 >> 8 60,299 >> 12 51,048 >> 16 43,337 >> >> Driver #3: >> #rings BW (Mbps) >> 1 19,316 >> 2 44,850 >> 3 69,549 >> 4 87,434 >> 6 94,342 (linerate) >> 8 94,350 (linerate) >> 12 94,327 (linerate) >> 16 94,327 (linerate) >> >> >> Insights: >> Major degradation between #1 and #2, not getting any close to linerate! >> Degradation is fixed between #2 and #3. >> This is because page allocator cannot stand the higher allocation rate. >> In #2, we also see that the addition of rings (cores) reduces BW (!!), >> as result of increasing congestion over shared resources. >> >> Congestion in this case is very clear. >> When monitored in perf top: >> 85.58% [kernel] [k] queued_spin_lock_slowpath > > Well, we obviously need to know the caller of the spin_lock. In this > case it is likely the page allocator lock. It could also be the TCP > socket locks, but given GRO is enabled, they should be hit much less. > It is the page allocator lock. I verified this based on Andi's suggestion, see other mail. It's nice to have the option to dynamically play with the parameter. But maybe we should also think of changing the default fraction guaranteed to the PCP, so that unaware admins of networking servers would also benefit. > >> I think that page allocator issues should be discussed separately: >> 1) Rate: Increase the allocation rate on a single core. >> 2) Scalability: Reduce congestion and sync overhead between cores. > > Yes, but this no small task. I is on my TODO-list (emacs org-mode), > but I have other tasks that have higher priority atm. I'll be working > on XDP_REDIRECT for the next many months. Currently trying to convince > people that we do an explicit packet-page return/free callback (which > would avoid many of these issues). > > >> This is clearly the current bottleneck in the network stack receive >> flow. >> >> I know about some efforts that were made in the past two years. >> For example the ones from Jesper et al.: >> >> - Page-pool (not accepted AFAIK). > > The page-pool have many purposes. > 1. generic page-cache for drivers, > 2. keep pages DMA-mapped > 3. facilitate drivers to change RX-ring memory model > > From a MM-point-of-view the page pool is just a destructor callback, > that can "steal" the page. > > If I can convince XDP_REDIRECT to use an explicit destructor callback, > then I almost get what I need. Except for the generic part, and the > normal network path will not see the benefit. Thus, not helping your > use-case, I guess. > I see. > >> - Page-allocation bulking. > > Notice, that page-allocator bulking, would still be needed by the > page-pool and other page-cache facilities. We should implement it > regardless of the page_pool. I agree. It fits perfectly with our Striding RQ feature, in which each RX descriptor is relatively large and serves multiple received packets, requiring the allocation of many order-0 pages. > > Without a page pool facility to hide the use of page bulking. You > could use page-bulk-alloc in driver RX-ring refill, and find where TCP > free the GRO packets, and do page-bulk-free there. > Exactly. > >> - Optimize order-0 allocations in Per-Cpu-Pages. > > There is a need to optimize PCP some more for the single-core XDP > performance target (~14Mpps). I guess, the easiest way around this is > implement/integrate a page bulk API into PCP. > > The TCP-GRO use-case you are hitting is a different bottleneck. > It is a multi-CPU parallel workload, that exceed the PCP cache size, > and cause you to hit the page buddy allocator. > Indeed, I verified that. > I wonder if you could "solve"/mitigate the issue if you tune the size > of the PCP cache? > AFAIK it only keeps 128 pages cached per CPU... I know you can see this > via a proc file, but I cannot remember which(?). And I'm not sure how > you tune this(?) > /proc/sys/vm/percpu_pagelist_fraction > >> I am not an mm expert, but wanted to raise the issue again, to combine >> the efforts and hear from you guys about status and possible directions. > > Regarding recent changes... if you have you kernel compiled with > CONFIG_NUMA then the page-allocator is slower (due to keeping Yes it is. > numa-stats), except that this was recently optimized and merged(?) > Sounds useful, I should get familiar with these stats. Do you how to observe them? > What (exact) kernel git tree did you run these tests on? > I had a few mlx5 driver patches on top of: 96e5ae4e76f1 bpf: fix numa_node validation Many thanks! Regards, Tariq -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 6603B6B0038 for ; Mon, 18 Sep 2017 03:35:06 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id 6so16365523pgh.0 for ; Mon, 18 Sep 2017 00:35:06 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com. [134.134.136.65]) by mx.google.com with ESMTPS id t185si4253977pgd.542.2017.09.18.00.35.04 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 18 Sep 2017 00:35:04 -0700 (PDT) Date: Mon, 18 Sep 2017 15:34:48 +0800 From: Aaron Lu Subject: Re: Page allocator bottleneck Message-ID: <20170918073447.GB4107@intel.com> References: <20170915092839.690ea9e9@redhat.com> <6069fd36-ed0e-145c-3134-35232bf951a7@mellanox.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="rwEMma7ioTxnRzrJ" Content-Disposition: inline In-Reply-To: <6069fd36-ed0e-145c-3134-35232bf951a7@mellanox.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tariq Toukan Cc: Jesper Dangaard Brouer , David Miller , Mel Gorman , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm , Dave Hansen --rwEMma7ioTxnRzrJ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote: > > It's nice to have the option to dynamically play with the parameter. > But maybe we should also think of changing the default fraction guaranteed > to the PCP, so that unaware admins of networking servers would also benefit. I collected some performance data with will-it-scale/page_fault1 process mode on different machines with different pcp->batch sizes, starting from the default 31(calculated by zone_batchsize(), 31 is the standard value for any zone that has more than 1/2MiB memory), then incremented by 31 upwards till 527. PCP's upper limit is 6*batch. An image is plotted and attached: batch_full.png(full here means the number of process started equals to CPU number). >>From the image: - For EX machines, they all see throughput increase with increased batch size and peaked at around batch_size=310, then fall; - For EP machines, Haswell-EP and Broadwell-EP also see throughput increase with increased batch size and peaked at batch_size=279, then fall, batch_size=310 also delivers pretty good result. Skylake-EP is quite different in that it doesn't see any obvious throughput increase after batch_size=93, though the trend is still increasing, but in a very small way and finally peaked at batch_size=403, then fall. Ivybridge EP behaves much like desktop ones. - For Desktop machines, they do not see any obvious changes with increased batch_size. So the default batch size(31) doesn't deliver good enough result, we probbaly should change the default value. --rwEMma7ioTxnRzrJ Content-Type: image/png Content-Disposition: attachment; filename="batch_full.png" Content-Transfer-Encoding: base64 iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAMAAADfDTFxAAABWVBMVEX///8AAACgoKD/AAAA wAAAgP/AAP8A7u7AQADIyABBaeH/wCAAgEDAgP8wYICLAABAgAD/gP9//9SlKir//wBA4NAA AAAaGhozMzNNTU1mZmZ/f3+ZmZmzs7PAwMDMzMzl5eX////wMjKQ7pCt2ObwVfDg///u3YL/ tsGv7u7/1wAA/wAAZAAA/38iiyIui1cAAP8AAIsZGXAAAIAAAM2HzusA////AP8AztH/FJP/ f1DwgID/RQD6gHLplnrw5oy9t2u4hgv19dyggCD/pQDugu6UANPdoN2QUEBVay+AFACAFBSA QBSAQICAYMCAYP+AgAD/gED/oED/oGD/oHD/wMD//4D//8DNt57w//Cgts3B/8HNwLB8/0Cg /yC+vr7f399fX18fHx8/Pz+fn5+/v79PT08AnnNWtOnmnwDw5EIAcrLlHhAAOysAimQATzkA Ew4AYkcAdlaf7LUmAAAACXBIWXMAAA7EAAAOxAGVKw4bAAAgAElEQVR4nO3da4KjSJqtaxgH 8/FxcBHbIzOqq/c+8/9xZAYIJIEEmH2wgPfpak+/KMQSLlhuXJMEAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAh5J6+d4xAAC4nqLaOwEAABeU3vZOAADAGeRZmZZF/vS9unhs ac6LMq36n97KrXIBAHBqVVkndfnUq3V1y7oCLqv83sH549HZltkAADgt36j165blroBv/r9p V7s1h2ABABBP7iq2dMdXtVXcFXBR9h8TDsECACCqm6vd+t7CedlUbFfAzZdZ2j6OQ7AAAIgn b/YBZ2ldtbt7HwVcDL/iECwAACKq0rr9b/vJRAEDAIB4im7L8i3t9vFObIJ+lwIAcB1R+zfr +jcvq+7TiYOwRgo4ahQLJAxHwnAkDEfCcCQMFzVh9jjJqCrzrngnTkMyjmKChOFIGI6E4UgY joThYia8Pdo1czuAq7L9PG+++XIhDtMoNkgYjoThSBiOhOFIGC5mwrLZqF34s5DcGcFFt2PX 7/7Ni3R4KUrTKDZIGI6E4UgYjoThSBhOKKFQlAkkDEfCcCQMR8JwJAwnlFAoygQShiNhOBKG I2E4EoYTSigUZQIJw5EwHAnDkTAcCcMJJRSKMoGE4UgYjoThSBiOhOGEEgpFmUDCcCQMR8Jw JAxHwnBCCYWiTCBhOBKGI2E4EoYjYTihhEJRAAAwJtR6QlEAADAm1HpCUQAAx1NXZVpleZJk b7cdeL4T0PvPW7fHbRKK5iqOleG9c4VaTygKAOBwbum9fG/u3rcBBZy3n+X+HrpZd2ddC0Kt JxQFAHA4zV1v81uUAr5/ekvyDzcQCifUekJRAACHU3b3oW8K9pb+n2b8WlVNAbvty2WR9z+/ t2tepWnVj3IHBexu62e5AVqq9YSiAAAOp3hsL3YF625Q7zcj5+729N0IuOnU7ufurkF5PtjO PCzgPC0XbYD+WZpXqPWEogAADuc+mi2LW574gvWt6mu3/+Af5L/f/rw9xqp6DJ1vg1v43R9S LJk8BQwAuKo6K9KmYIvS9asf/LZHU7kP/q65t/7nSbOPt98lPBwBJ/c+XzJxChgAcGF5mbq6 bQex97Ft7Ye6roB96bYbpNv71KfdaUd1c/LRsICz9PY+BP5ZZzSqUOsJRQEAHNTtXrhZWTfd eS/Uwg9jXQGXbrhbNyPg9uevRzkPCri+/2zRWUiMgAEA19SUaXbv0KxM2oYtC1e4TQH7ti2a Am5/3u/9bQwKuCy7DzNRwACAayqrOslvbuOy26nbNGyWNt3SboLOb10BNz93R0EnefYYB/cF 3Ax+6wUnAlPAAIBryt1BVn47sz+qqnZVnLe7cV0B54U7Dbgr4PbnRXn/J3n3FN2lKPOuebPh UVmRCbWeUBQAwCnUlteSDCTUekJRAABnkL/u5FUi1HpCUQAAJ1ClVb53hmlCrScUBQAAY0Kt JxQFAABjQq0nFAUAAGNCrScUBQAAY0KtJxQFACDOn7Lb3N73mw+HQmfp4E5J/XnA7opZ/pkt 7wgs1HpCUYClFl8D58qYWYjBX7Qqn1WQSwo4bz/L25soGZ5GLNR6QlGApeiUBZhZiKHpylkX 2lhTwPdPb0m+4EqUywm1nlAUYCk6ZQFmFmLoCzgr88pdcbKu0ua831u/cfpWplXd3ZTw5vo0 d5/l7qHdbQrHCzipynnj69WEWk8oCrDIl5t+Yog5hUj6TdBZWTQ3Tijy+9d58+OmPOs0u3/P jYBd9xbu6s+3tLkHQ+43L38o4Dwt52+A/h18nEuo9YSiAPM1bfLTf0EZjxrMFuYNYmiOl3J9 2+6o9Y3b3XyhHen6jc+56133WenviVR1h1a5b40fhNVsss4ezzXDb7K0f5VaTygKME/fsyOV 8kMbOyNz4MqzA/E0I+Di3sBtgTb7a323utsipe62R833yqZp69SVsrtdUvNtd0+kDyPgpEqX bIH+Xdq/Sq0nFAX47rlUvlXKBdt4+sX+sBcYEbRdeR+l+psLJv4+g+6mv/7Gv83X7fcqf1/C OquSKnMdnKePs426Aq6bL4cFnKW39yHw7zqjr0Co9YSiAJ8Ftui5y3jWKzvh68bW2q68j26b Ah6MgP09gevnEfD9m1WWZJXbBZz0BzdPj4DdHYGXnIXECBiwFrk1lwyNtVtr2Z8U2q8FR9CN gB8F3O7vLdp+LVwB9/uAk6K6t2md+u/05yVNF3BZdh9mYR8wYMp6yPqljTVLa91YXvO14Eja fcD+NCT/jeaIZ/fdoqzzmy/g/ihod3zV/WPqy9kdBZ3kWfahgJvBbz33RGCOggbsbL69+L2M xUorbCO62IvB8TSXonQn87YF3J8HfK/lsmj2/3bnAXfj4KrZqJwXZVpm+eSlKPOuebPhUVmR CbWeUBTg2e47a+/T/9P9TyFMjBA0MK5OqPWEogC93fuu88f/X+NnhHmA2BPSmK3AboRaTygK 0JApX+/Pnz+ffhynlEdPG7JpeKFZC+xBqPWEogBq5evr98+XCn63vJSfL5hh0LyDY1Wk5i+w OaHWE4qCq1Mr38by/h31uZR/EqPifRicrSE4k4HtCLWeUBRcmWb5Ju0+4CgV/G7TvcmD6xVo zmlgG0KtJxQFVyVbvokb/7pDsFwHm05mg9c/uCyf7NwGNiDUekJRcEXK5ev0vWtawebzwF0W 95chMAy0F9HI5147w5s4D9hdSMs/m+UdgYVaTygKrka9fJNh/75+EZfxfPhttj/3F6dXn+84 kDgFnLef5WWRJIsuBr2YUOsJRcGVHKB8k7fKNd4ObcXVbncU9C8HYiGyuAV8//S28LmWEmo9 oSi4imOUbzI25D1gBb/eku2Xc5EQ1bCA3abksnBfu4tOVrekdkPZm/uRux9hkrurVrrB7Yf7 AVdlbrkBWqr1hKLgCg5TvsnEJueDVfDYHVH99w7zW4C61xFwU59VlSf1vYrddwt3OWh3P0J3 M4bcb17+UMB5Ws7fAP2n+98CQq0nFAVnd6TydSaW6gNV8MQNyf33D/WrgLDuAKrHZmM/1vW3 ZEiaOxCWrmqLqju0yn1r/CCs5m6Fmbu14VzDa8XOJNR6QlFwZkcr3+TTX9UHqeCp+m1+drRf B1Q9jYCz0hXpzd3R9+a/fW/ZOnWVXGbtDYP9XZQ+jICTKl2yBXr5WfpCrScUBefSr+APWL7J l61aB2jgT/U75+fAPMMCLkq3o9cVsN/be/PD4axKqsx1cJ4+zjbqCrhuvhwWcJbe3ofAf9YZ DSzUekJRcC5N5x6zfJOvFas+CJ5Tr1QwYhgWsBvm3ku12fxcF25bdJlVWZJVbhdw0h/cPD0C dncEXnIWEiNg4IXfx/jze9DyTeYMcZUreGa1/lDBCDcsYN+wRbf/tzkCq7q3aZ26Hb9+729j uoDLsvswC/uAgVeue3+OvJtxziKtWsGza/UnSWhghHrZBJ3fXAHnVZ3ff+LPQfJjX1/N7ijo JM+yDwXcDH7ruScCcxQ08O7evkdet89coAUr+HfJqPaH7dAINizgvHCnAbsRcF3584Dd9/3Y t9mo7M4OLrN88lKUede82fCorMiEWk8oCk7k4Gv2+b0q1sBLZ/rPin8DHJtQ6wlFwWm4Xb/3 EfBh1+xLWlVpELxihv+s/HfAYQm1nlAUnIUfViXDKw8fy8JKVangjzN7alfZz4x/C5yKUOsJ RcE5/PSbNX+PuWZfXKgKFfxtRk8dLNodKHfEXxSwhlDrCUXBKTQr9MHa/HBr9jVtuncFz5jJ U6dLPg5VP9wvClhFqPWEouAEft76Nznamn1lle5ZwfPqd+rCQP3JYsf6RQHrCLWeUBQcX7dH 8fX7Bzoea32P7tXAc+btpyvzDc/WPs7vCVhLqPWEouDofqb613/zGKv2kBbdZRA8d+Nzsw94 NOLT9VIO8nuCFH/Srz+99+0KVtlTx7z/vHV7XCXaf1beJh4Xg1DrCUXBwT0O55n4+RFW7YEV unkFf5+n7bD3cRT0WMTnK5Yd4NcELbf0Xr63sggq4HzwWZ7NvRLWGkKtJxQFh9Zf9nl6Ba5f wcEFumkFf52foxudR7730sDyvyZoKf0lnvNbtAJ2V86KmO+FUOsJRcGRDY7k+fQw8XV7jPbc rIK/nng0meTtB68X7Rb/NUFM2d9jwRXsLf0/zYUn3b0Xsnarclnk/c+z9naF/T2PXgu44FKU wEw/M/s30V63R6rOLRr422FtX+7R9vrTt9tmKP+aoKZ43D3QFWyW3hK3Obq5L3A3As6rcvBz f1uGfHDXwbUj4H8ai+IKtZ5QFBzWYPU96za0oiv3aMVpPggOa9/2MU9fvt+4SvW3BD330WxZ 3PLEF6xvVV+7/Qf/IP/99ueJr+PB7Qnbg7CqtoBn7wP+Z/BxLqHWE4qCg/pZ1r/+YYor95it aVrBn+fe7PuTPz1u7M6Rkr8lSKqzIm0Ktihdv/rBrx8GNy1cuna99T9vbhw82CU8HAG7DdZz j8GigHFtT6eRzv5Xgiv3uJ1pVsHf6nfBUw0fPHrvZsHfElTlpb/DYNoMau9j29oPdV0B+9Jt N0g3P8/T7rSjujv5KG+f5za69/efdUaTCrWeUBQc0c/zSaQL/qXayj16YZpU8Oc7Liye5OAf jDaw3G8Jum73ws3KOi2aL/LCD29dAfvxbN2MgNufv25i/lbAUxgB48KCziCVWrlbtGX0Cv40 x5a3b/vPus/GG1jrtwRJTZlm9+bMyqRt2LJIb81327YtmgJuf97v/W1QwMBCP6FXcNA5Hsto g3HUp/0ws1a2b/tvm/9OFDAVjG/Kqk7ym9u47HbqNg2bpU25tJug81tXwM3P3VHQSZ49xsGr C5ijoHFRL2vsdatpjbW72TFT8QbB0zMqpH3bJ/D/mWxgkV8SVOXuICu/ndkfVVW7Ks6bgbAv 4LxwpwF3Bdz+vCjv/yTvnqK7FGW+sIDXEGo9oSg4lp8o/ZtIrN0tj1mOU8GTMym4fdtncR+n G5jrU2Khuj/HV41Q6wlFwaG8XTwp4Ll2r2Db83bDK3JqBsVp3/apks8NvPfvCIeSv+7kFSLU ekJRcCCvw9/QIdK+q3fzS1eF9eTEzInYvu3zfSrgvX9HOJQqrfK9M0wSaj2hKDiOkesmhT7l jqv3LS4duX4a4zMmdvu2T/qxgalgnIJQ6wlFwVG8D3+j7CLc65DobW6esK4wx+eJSfu2z/z5 51Qwjk+o9YSi4CBGRklvq+XH7WeX2WP9vtn9A5e35sbt2z79559TwTg6odYTioJDGBn+jo1/ //j/W2779fuGN/BdVp1js8K4fRP35xUVjHMTaj2hKDiC0Yv2jz1wdVVsvH7fsH+TJRU8Mhvs 29f5YRSMj5q7Jfjb+37z4VDoLB3cKak/D7h7+lt4zklCrScUBfrGhr/jA+DGuqlsuX7ftn+T uRX8Pgu2aV/nJ6GC8YG/UkZze99vlhRwPvgsn307wjWEWk8oCuRNXK9/7Jvr69c/5Vbr9837 d9Y0317+du3r/DST/PwgKviymq6cdaGNtQXszmNaH/CbnVqvu9SXQBQc0Pjwd7R/XWH8CTr7 dZtDovfo36/d9vrKt23f5PF31tcKtk8CRX0BZ2VeuStO1lXanPd76zdO38q0qrubEt7cgDZ3 n+Xuod1tCj8UcGF4QcodW694+YuEAsZMUzfLef+Wb1//P4vrT8S0TwF/nDEvr3rz9nW6X/XC vxRwDf0m6KwsXJfWaZHfv86bHzcbp+s0u3/PjYBd9xbu6s+3tLkHQ565Jo40Av63segF7Nh6 6cu+bQoYs0wMf0f692mtHdYe1mv4vfo3mey2t/rdIsu7xy+bCsab5ngp17dZsxnaN25384V2 pOs3Pueud91npb8nUtU+1H9r/CCsqi3g2fuANy/gunjajFxX95H+beoh7o4T1eDRt9cd5xQw Zpiq3/cCfl1jBzaI6Rp+x/5NhnPqt/v4/Gp3Gfw2Br9uKhgvmhFwcW/gtkCbsvTd6m6LlLph XvO9smnaOnWl7G6X1Hzb3RPpwwg4bW+tNMfWBVxXt2xQwG7wf38Zt4mHlFXuZ1Snen1ZFDC+ +3CfuucvR9bWoS1it4bft3+HAX79/z+90h3b1xn+xqlgPGm78j7i9TcXTPKmf4rS3/i3+br9 XuXvS1hn1b16XAfn6eNso66A6+7ko/z56V/8u87oCwhsvWEBN39ClK/HmrUPaV5KP5iv314a BYxvpoe/yawNpqFNYrSGnxfrd/AxfoQuw+/zQWc7t6/z9DungjHQNuS9dZoCHoyA/cC1fh4B 3795H/hlldsFPCijL/uA59phH/B4AfuXWrdj4fYhRdl/9J+9HRROAeOLuTeJnV5LR7gbX/xV /NxQv0nE/v199edP91kfbP/2dZ5/7VQwHroR8KOA2/29RduvhWuhfh/wvXbuY9869d/pz0s6 RwHnpd8EXfujzu5fVU8Pab7sX+XrIVgUML74MPxNnjeZfniSCI0SfRU//5JUk/37VqdfjeVw h4z//vltAom0b/L+dxcVjFa7D9ifhuS/0Rzx7L5blHV+8wXcHwXtduom7ixYV87uKOgkz7Kz FHBSl80+b39E2uNI8K6Ai6fHvx2CRQHjs0+3p3vaZvr5aWK0StxV/MwN0EvrdFWUP+2Fs3Xa 13n71X+vYNPt9VDRHCXlTuZtC7g/D/hey/choS+k7jzgbhxcNUdMu+OCyyyfvBRlvrCA14hY wHVzWlWzwzt9XJtkvIDjR8GpfRr+Dle033sjSrVErOBPcV4qdnoEHMtv0IU7zbz/8r9WcNTt 9YCNiAXcnlblP97S6uUhr5ugR6L0wkLhfL7cnb37ZF5zRKmXWBX8FmZ6YLtBp2j27+jvf+4o GFATreuGBdxVbO52B1ePfbwTB2GNhAqLgvP6PPx9rGhnF0ecgomy5ffP47m+bk3eZquqYv1O /AH2rYJNkgARxR8Bp/7T/PGTidOQokfBaX2u325Fu6Q4YnVMWAX7Y4+j7sMNt/bmydbG3wOf fue/HI0FeXEK2J9ydGv2AWftVcG6W0RNXYgjdhSc1Jfhb9u/C8dt0YZ5i1fyz2Ndta770/1P zsS7YPIX6bfXU8HQFtR6j0tmNuf8dpeirLuDvIcPcQelPV2KMm4UnNWX+m36d0WfRisZt5L/ tnV4aguzYtOpmnojjP/uu98IFQxlQq0nFAUqvg1//Zp23XA2Xvm1x/u8HjL1fb8u/bvA9Dvh 8++fCoYuodYTigIRX+v33nqrtyZHrL/f7pjbRafm0r+LfHgzUMFotGfu5nPvYORNnAfcnWb8 ds2oiIRaTygKJHwf/rrzVtc/f8zjfVdcEIP+XejT24EKhhOngPPBZ/ns2xGuIdR6QlGg4Hv9 BldoxM3Qy896oYCX+viOoIJhUcDtmT1GhFpPKAr2N2P4G6HCoh0MnSw+75T+XezLW4IKxrCA 3QZkd4OC9mb0t6R2p+fc3I/c/QiT3F210l2x8UsBF4YXpBRqPaEo2N0Gw9/mSSI8x6prZNC/ K3x7V1DBV/c6As796bBVlSf1vYrddwt3Uo67H2HenDdbxxwB/12aV6j1hKJgZzOGv/d1bZT1 6U4XfqJ/V/n6vqCCr607gOqxCdqPdbvrMrobMpSuat3dcNsrR1VTB2FVbQHP3wf811mWV6j1 hKJgX7PqN9qlBvfoQvp3pXlvjWlU8Lk9jYAzd38+V75lefPfvrdsnbpKLrPusozuLkofRsBu K/bs3cl/k8VjYKHWE4qCPX0f/vp17Db3IzJCAa80Y9eEf3tMX9KLCj6zYQEXpdvR6wrY7+29 +eFwViVV5jo472+H0BVw3Z18lD8/24u/64zmFWo9oSjY0cytjDFXo5vXIf272pwGTtrbGk/M Zir4vIYF7Ieudbv5uS7ctugyq7Ikq9wu4MGNCb7sA56NETCO7uvwt93CGHcdunEh0r8BZjVw 8vmuilTwWQ0L2Dds8bgvnx8UV/carlO349fv/W1EK2D2AePY5h7mGnsFuumhWPRviAUFTAVf zssm6PzmCjiv6vz+E38Okh/7dncrcJfZyOIVMEdB49C+DX/7FWr81ed2rUj/hpnXwMnH/k2o 4HMaFnBeuNOA3Qi4u0/Q/ft+7Ou6uDk7uMzyyUtR5ssLeDGh1hOKgl3Mv86Cxbpzs16kgAPN 2w3s/u/zrKaCsTeh1hOKgh18Gf4OV6Y2K86NipH+DTbvIml/vu9ZoIKxL6HWE4qC7X0b/g4+ t1prbrIjmP4NN3MjtEMFQ5lQ6wlFwdYWDH/N+jfZoh3p3xgWNPDXOU4FYz9CrScUBRtbUL+W /Wvfj/RvHDEbmArGboRaTygKNuPWpJ+Hv68bEW3XlrYNSf/GsqiBqWCIEmo9oSjYzM/Su7xa rypNdwRTwLEsKeA5850Kxh6EWk8oCjbz83H4+96GG6wn7VqS/o0ndgNTwUc1dbJuVr583TdM fxGs8SfsLhLd3Izh9unBgYRaTygKtvHTmPrxyGB0k5WkVU/SvzEtbGAq+KwMCjgffJbPvx3h CkKtJxQFm3DdO70aHVtlbrSGtGlK+jeuZQ08a/ZTwQc0t4AHlhSwu3DWqlyzCLWeUBSY60a+ U2vR8RHLVutHi66kfyNbWMDzfgFU8OHcW7KtzKpsrvjcNGxW5lVzM1//WdGMgG9lWtW+gPvP /N0K6+ETPn1WGF6QUqj1hKLA1mCz8/hadGKD4XYrR4NDsSjg2BY3MBV8Rq4l27sruHsQFkl7 E+CsLOp7NbeftZug6zTL88rVbv+ZuytDnqX18AmHny0YAf9naXih1hOKAkOfD7vypu7jGjvK igwyz4flDTzzl0AFH4prycJtbs5cXfqabT+4SnU/aT7z32xGvO6WDIPP/LbqfrN0exBW1Rbw /H3A/3GWhRdqPaEosDKjfScHKhuvFuM2Jv1rwaiBqeAjcS1Zu4ptK/XWDoPbTc73nzaf+Y9N mZbV+2f9LuPhCNgdBT37GCwKGMrmtO/0dsLN14kxO5P+NbG4gGfvW6CCD8P35b1yffX6Gq77 Ea//vOlW93X/mOFn3WlHdXfyUT586jf/WWc0vFDrCUVBdJ9ONxqYXj/usEKMtyOY/jWyvIFn /y6o4INotxN3pxndvywehduOgPuvp0fAL0/48tksjIAhaV75fm68XVaHsYqTArZi2MBUcFR/ uv/F5lvyPqD1250TNxhuBreDfcB9AY/uA65GnvDls1koYOiZ276fB5w7rQvjrC/oXzMrCnjB lo2LVPDv4KOdP/7/4msPVS67w5iztKkSf+xzexR08/3k7SjoojsKOsmz7PkJnz+biaOgIWV2 +35ZK+62IoyxxqB/Da1p4AW/kWtU8G+yxSL2x+Y6601L3tLuIKp7n/r/Ds8D9l+Pngfsz1oq yvsD8/4JG/nyAl5MqPWEoiCGmbt9vS+L5o5rwfCVBv1ral0DU8FPfu0XsT8N68kk7SFYM80/ xNmGUOsJRUGweUc8t//7tljuuwoMXGfQv8ZWNfCSX8v5K/h3i5e4Tf0m+efrTPbcdufKfIj7 hVDrCUVBmPm7fX37flss917/Ba026F9r6wp40S/mxBX867gR8Ab3+bTZB/ysSqt83iNvzxeg 3IdQ6wlFQYAFG55n/lW8+9ovZLVBAZtb28AXr+Dfpnv9p4n1KNgv5zZHQR+aUOsJRcFKS3b7 JnN3Cwms+tZvO2OFs4GVDbzsl3OmCu6rt/my/Wj3Cv+MfIZEqvWEomCNZeXbmDEA1ljvrVxv sLrZxCYNfI4Kfune1x+aTHO4kLNEPBFqPaEoWGxN+87aLaSy0lu14mBts421Bbx008axK/hj 93aPMZjunw9fXZ1Q6wlFwTKr2rdZ+33bLaSzxlux4mBds5XVDbz0d3TMCv6d073dQ2NP/HUO L5vj7am4+exbEjnulN+sL5TmngrFbeLB5fj3NyHUekJRsMDK9p23ICqt7pbXKQW8mc0a+GgV PL96+38Rc/oj2xgWzfE4BXx/jjxLx89P6gs4276DhFpPKApmWnjQ1ZN5d0df++wWlh6KRf9u KKCBz1rBy7u3+4fxMozO2yUzPFoBuyt0FKMPpoA9oSiYI6B856711NZ0i1bV9O+WAt6Ly39T 6hW8YJPzxL+Pk2NqKV8ww4cFfPNbkt3X7uKR1a256NXN/Sh3n+Xdmb3jBZwU7luPB7XP0RSw e5LCb6m+F7V7hPsX/cUs51pcYkKtJxQFXwW179wlUG81t2DNQf9ua9MG9hW1zT0MFgqs3v5p wp/jw2ydP8NfR8B55fqyqvKkvlex+27hNi3f0uamCrm/B9JEAd/uP+sf1D2HK+DM30Cpu4Pw /RFVmTe3c8hv6cS+4xHp4hYTaj2hKPgssH2P278L1hz079aCGnhVBW9zD4PZInVv92yhT/Bx ns6e392dER6boP1YtytFd9nJ0vVm4W4rWHbfmijg+v7P+gd1z3Ev4ObOhe0/8Y/wd3Tob2g4 EwUMayG7fVsz13ZCq7aBmeHp3+1tsE3mye/vBvcwmCdu93bPGfTPv8zQufP7aQScla6K78VZ ljf/7Xtl1qmrZLehOM1+/WN+p0fAt6QpcjfqfTxHWZTNFSmbf9I8oiof90+afb3oNFlcY0Kt JxQFE9vXwss3mb3oiaza3s3KTwFvL+y9ueY39vsQNOkQlpMPeNqvf6euKWBflLkrYL8j9+aH w1mVVFnuNy63+hFw3Xw52AQ9eFD3HFnaHR7tCzdvRsZF2W+SfrlgdLrO6MsTaj2hKEhG7hEa pX3nL3myBTznFdC/ewhs4BWboR/v0t9XQVFmTt18OqufPt4iMixgfzxU3W46rgtXjGVWZUlW 3fzgM+t+Ix8Owno6mto/R1Z2h0cHjoDTwce5hFpPKAqcp+1rkdr3DP074zXQv/vY5NCE3qd9 wLEaefxiNZuNuldNZd6fMvPm9rCAfTUWjyDisEgAACAASURBVIOi/KC4uldonfpbELoPzWz5 cBrSy80K78+Rld0JSu0+4CpZtQ/481B36h8tebAtoSh4XsSjte/8UYZy/359GfTvTjY6OqG1 7CjolYX86y/XOrxzwsZbvJdPa+5MnPW4l03Q+c0VcF7VuT+m2R2jlbjuy/xj/ue/v/n//u/4 PuD2QhzuKOj751n/HG5/cNPA/hnvn2d5c6NgdxR0suQo6OWEWk8oyrV164ffH/9JvPZd0Eza /Zt8fiX0726C36rb/e7mDpF///xp+3envc0LJ7ngj5g5jxwWcF6404BdIdZVcw7v/ft+7Oub 8/e/xf9Ny+y/ye//frgUpTv9t8zy/jn8ecC1fx73o6DzgBcTaj2hKFf1tIT/3tdm9/aNeZ/Q 8/Tvx9dCAe/nQA38aryR21t27nyg1/zH7natmqczs+PMrQ2uEi3UekJRLmfkb3A38v356f/0 jjCV+YvbAQp4+tXQv3sKb2CZ359b7ObdMts+yszH7Xa11vcVVIRVFgUMa9ObtV42PQd38IJl 8wj9O/mCdl9ZXtt258nZ88375/fP7/4VPG/p3+t+JePpgiuYAoadz3uUXPu+rsuCOnjBknaM /k3GX9Pea8rLO0sDt6X75/cexx2JtXeoGQv/TnfsnE6253b7eYRaTyjKuc05CvPn8eH9366a 6Bn7d+xV7b2aRJQG3vvX2A95+0Otdw/1ZcFcly/4RX1ZIYlXsFDrCUU5q7mHUX4+7nlNB5+z f0de194rSURp4F1/j5MbnPeu4I8L/tpsQa9pzrpIuoKFWk8oyvksOoPh+wps4UB4yYpDeWkZ 8efjl9jDphdsi+3z7t7dK3jqBwHB1v/LuSsh4S3RQq0nFOVMFp88OPe03/nPuuLqBsfxtOah fyXEaeAdfpkzDrbauYInFvqgUK//eOrajy/HRPmzfVsv17d6e8LBecBfn3dTQq0nFOUU1l0B b9FVN+Y9/6n7Nxm+PvpXRJxLx2z865x7qLNeBYcmevnnswr493d4qY1vBXx/wv/+bzrxoMfz Ztt3kFDrCUU5uIAr5ixfc32b0pqr+x3Mn5f/YneHa+BFJxrtXcEvX4eneX6GGQX8utKZUcBJ 8v+l//P5eSlghAm8WN3Ki05+muayhfOQ/dusEAVOFMFDrDuHbPMrXX6a777vtafFPUqUp+e4 92VbmVXZXp/ZN+zjypC//+s+K9qbFZVpVfsC7j/zdxqsh0/o/1ukLvnjh+6ilI9rUd7SrPBb qpP+WpTLL0W5+H0n1HpCUcSNXAR+3fbmZ0HXfJ6Y+CX616+E/jAAVhLt7iGRnufDFFaeviNS wZFiDJ/G9WV7dwV3D8IiaW/S6+6NcK/m//fb3CWhGbA2t05wtdt/5m64kGf9bXwH9wNOfv/7 +GFV5Uld5L6AMz+BpoNq94iqzLsJzr8Zw/Lr5gu1nlAUdcPboEW7QnvwKmskxkX6N1m9GoWV n0Nshg552+xcwXEzvBawvwdg5m9JlCaPD/fW/P2fsrtPoP9me/PA6ukzv0253yzdFbC/lXD1 f916yv2wa9Z7AQ+esP3n7n6Ey25HmFDAl/HbXnA83vH1cW559Bxo6dJ53AIWuWIvBqLdwNrs 1xr8jtn1HeeW9IjTHzyV68u6GaIm7eDXD4Oz1K1e3E+bpvQf/VA5Kav3z/pdxv0I+Nb88PfX /bAsb3nzwKKsH0/Y/fOqbL+e2iX9zl8+f9nLFmo9oSjCvt++bJV4dxzsB8Jrbm9+WNSvmnh3 0TT5zcZ5w+z6ros78f7JfOPdKzdvhqj3GvZ1nDSnHdXtLXybvuwfM/ysdX+s+/i0Cbr/YbMz 2G95bo+Pbgq3faKiK+C635bd+Fln9FULtZ5QFCkjtyhrRsDxRLzjr+eDXqt/2Qes5dNKb7Ho v9mIf67tV8F/Il9j6vFCbu2W53ZAev/SbQW+j1ofA9K+gD+MgHv9QVhJ/0O/Mq0LX+d1Wjye cP0IeM17Tqj1hKLsbKxynx+QxOyr2PXb+LNw6Tx4/7b/g46feO/sqC0Xe2PJThXspxq1grvX 4RvvPg71250TNxhOb25C/S7ZQQGP7gN+OSmprdCmZgc/dE96b9v7szU/6vYBN0+0dB/w56Hu BKHWE4qytd+vlfv88MHHcDb16xamJdvJD92/UOTe2PEqONLzmNTlDhX8crOISE/a/Kfpy6rs Nv1maeqn4o99vqW3pwJ+OQq66I6CTvLsMQ5uGj1rNjQ/fphXdf77/9oN2r6B21OemqdM88EE zQi1nlCUADPLcVnjGjIa/j4+m/nq6F/E1ryzpY7FMjtSYOsKHkwu5uqredqmgG9pN/L8b7OB eHBa7qCA384D9mctFeX9gXn3tM+Xonz8sL53fOVPakpc61ZusBt4HvBiQq0nFCXE1OZhmcod sqnfl+HC/EE9EF+sXcGhHWd7nN6WFfzndQGP98zv3/r9fTsK6pPlfTm9etrgItFCrScUJcjj ACnJyn1iU7/jS9HHGSA5d3AacSo4pOI2OEp+swp+n45ZA7v1Rv75OpM9t2m5mn3O0PNkRr9N AR+RdOUOWQ1/J9YDH2aJ/KzC0UV5s6+tuK3OUdtkMqMTiba2e3py/6yV3xo8x+35ApSLjL4A CvhomuI9RJ8Y1e/nYcL0Qd2ArShbolc03KZniG8wzp74fvwG3ngIs8+QSaj1hKKs0o7w4p4i ZMaofr+vod7f5weYWziFCBW8sOC2vz6L7QQ/PHus/voT9+mW2GGaQq0nFGWx4SUy+o+6zIa/ s5b+x9z6bb6yyQK8CX/jLyi4na6OZjjVz88caUn+E++pFtt8ukKtJxRlkSPs7n2x2/C3d6jN BTiPCBU871E7XiXSaNLfnzZSA++5Qt14bS7UekJRZjtg+doNf5dunjvQDnOcSPDO4O/v892v DG5yuY8Zj4mxOvz93XnmbblOF2o9oSizHLJ8Det31Vkah5yFOLrAheBzve3evl7sEHOfL3SJ 9v9+7xm43WpJqPWEonx10PJN7LY+r1zgGQFjF6EVPPkDifb14l7Aev5DA5bo1XdSi26rFbxQ 6wlF+ei45as2/GUfMPYTtiyMvt+F2teLeM+lJQ9eu34c/DuF+bjJal6o9YSiTDtw+VrWb0D/ 0sDYSdDO4LdyU2tfL06mxc+yZi35/G8k5uUGK3uh1hOKMu7IQ1/Hrn41FhdgqdAK7m5CKdm+ XoRgqw7uWPr413+gMT/NV/lCrScU5d3Ry9d2+KuxsADLBSwX9/f9H1/B2gtAYLqV/3zR+nLs wSrz1HbFL9R6QlGeHb98E4a/wIT1W6L/NKKmsRAScf0/nX8v8PFHysxXy9W/UOsJRemdonxN h786ywmw0rrl4yj9m4TcRiJgojNvBT75KJ05a9cCQq0nFKVxkvK1rV+hpQRYbXUFxw5iZVXS 0Fc340bgnx6hNHONumCn1ku9XCHKuNOUb2K69ZndvziLFRXc7gM+iMXLaoyF+/NtwL+tZKVm rkkj7Nh6xcttlmUK+Ezly/AXmGnpzuDBUdDHsKxR47yw6XXpnLWs1sz1xRD33MkdWy+9vXy9 T4wXpypf4/oVWzyAULbLy/7mV3C8bVvja9R561m5NUx79fpoJRHUenXxvBm5rtIym3pIXpRp NXj0rYwZJYpzDX0d29WJ3NIBhKKCm4fFnOb7WnX2ilZvHRP3DjIhrVdXt2xYwLf7mDbPbhMP Kav83sH9w6uXqt65gM9XvubDX71lAwhHBUdftl/WrUtWtYJrmZhNEdh6wwLOX7cpPz3k5v+T Plq3fjkEa88CPmP5Jgx/gXWCb1go7lu/Gizbv/2+04UrW70VjcoIOHku4OzxXKU7vKpu+7h9 SFH2H/1nL4dg7VXAJy1f87/j9RYLIJ4LV7DNpq1mPfu7YviotqrR2Qf8XMBVlZVp6Vq3vg90 87J6ekjzZV/S78Pl7Qv4tOVrX79qCwUQ2VUr2GzRdvtOV61wtVY2UkdBDwv43r51fvMbmbO0 rrrdvV0BF0+PfzsEa4MCfppzJy7fxHrrs9gSAZg4+Zbo8Qq2XLbXrnJPvL6JWsD1/WPlm7VK /ReDh7wUcPwoM3TbDs5dvvZ/u594eQCGrlbBppu21u87Pe8aJ+Ym6LT/zi2tXh7yugl6JEov LNQ0vw3k3OW7wTrjvEsD8OpSFWy6aIfsOz3ZOida1w0LuOgLOC+rxz7eiYOwRkKFRZnl7OWb mG99ZvcvLub8Fdz+z3bRDtp3etqVTsQCrvtN0FWZP34ycRpS9ChzxDx+XBPDXyC2c+8Mbm9r rP2XtXS4AHEKuDnlqCprfzEOdwxWtzO47+jXC3HEjjJD1OPHFdmvJ866HAAf7VnB9kv1AW6r KB5vraDWazZiV49zfovmNKTaj3PztBg+JMmL9OlSlHGjzBH3+HFB9uuIky4FwFf7VfBGBSy+ cIvHW2n/CzA/CEU5pA2Gv+dcBoBZ9qrgDSYrX7/JSRtYqPWEohwRw1/A2MY7g39GmEzoGDc2 1k+4nFDrCUU5ng1WDGd8+wML2S5po2X7M/2IOM18lBsb6ydcTKj1hKIczRZ/l5/wzQ+sEHVp m9Oi86YX1szHOMz7fCshodYTinIwm+wjsp8GcAxB483l7Ri4fM9qZgp4H0KtJxTlUBj+Altr l7p5BWq/E3eVyBuyt3C6FZFQ6wlFOZBNFpnTve2BUH7Jm9hmfKhS8x77m8WTn21VJNR6QlGO Y5P6PdubHojhXlBH6a2vDvOXxMlWRkKtJxTlKBj+ArsRq6Ygx9mWfq71kVDrCUXR9/P4YO1c 73cgpuM373rL6zjO3DrVGkmo9YSi6PsxH/52d0gxnQhwaFcu4GdztlZHmltnWicJtZ5QFH0b bAHSv0MKsDMKeMJoHceaWydaKwm1nlAUcRvtgznC9WEByIu96/g8Kyah1hOKIs2/iTe48uQR 7pAC4DCaI1ciNPF5VktCrScURdfPkgsArNZUL/ULIJ7ntVZIE59mzSTUekJRRA3eq2YF3A97 j3GHFAAHMbHWWtPEZ1kzCbWeUBRBG+307d/XR7lDCoBzWNLEJ1kzCbWeUBQ19uXL/l4AIuY0 8TlWV0KtJxRFinX70r0AJH06ePoUKy2h1hOKIsN4wzPdC+AIRpr4DKsuodYTiqLBtHzpXgDH M2jisRXYwS6NItR6QlEEWLYv3Qvg4O6ryD/vW6cp4LWEogSIcfCw4YZnBr4ATqJZlT1tnd6i gCNOQ6j1hKKECD191qx86V4Ap/K8Qvv5JOJUKWBlITVn1L50L4ATelutTa5AP7bzssqmgHX9 6Sz+lzYbnuleAKf1unKLuA7dYkAt1HpCUdb583QJ5T9/lnSxWfnGf1YAUPGyimMf8FpCUZYa 9uzIPuBvXWzQvgx8AVzB9us5CljGW61+OQr6rYvjb3imewFcx+ZrOwpYQPDtcpdto57/lLGe DAAO4MDrPKHWE4ryWaTSHAx9/4SXMd0L4JIOvOITaj2hKFOijVc/bHie0cWvm7npXgDXddzV n1DrCUV5F3NT8ey9vpNd3B/oxcAXwNUddh0o1HpCUYbi7qVde8TzSxe3pzvRvQBw2AYWaj2h KI3I1RvriOfoh24BwLEddHUo1HpCUSwKLurpRtQvAPS2WiPGuN1OT6j1JKIYjS0jn+wbesMH ADiX7Ro44tpXovUaO0cx26xrcKmNqH+DAcDxbdbAEXuCAjas3sTw5oIAgAHbBv7zJNKTXqmA 3weONtX7qFzaFwA2Em9V/mfE8IfRJnSlAh5uvDcd9foPtC8AbGjVzrmPZTs6EfYBr/Q4fdZ0 W8UPQ18A2Ny3clxatqOT4CjodTY5fTbu7ZoBAHMNVvARytbelQrY9PTZn755aV8A2Jx22Y65 VAFbnD778z7kpYABYAdHKd7OlQo46sb7keJ9/CjOFAAACxzuEkVXKuA42MkLAIKOd4kiodYT ijLmw5AXAIClhFpPKMoQxQsAMCDUekJRHIoXAGBIqPVEolC8AIANiLSes3cUihcAsJ29W29g iyijZw3RvACAzV24gCleAMB+LljAFC8AYH9XKmCKFwAg40oFnHCVSACACgoYAIAdUMAAAOzg YgUMAIAGodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAx odYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIA gDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOK AgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHW E4oCAICxnVov9XKFKAAA7GDH1iuq568pYADAdezYeunt5et9YgAAsIOg1quL183IRVpMPSQv yrQaPPpWxowCAMChhLReXd2y5wK+lS8FPHhIWeX3Du4fXmURowAAcCyBrfdcwHl5ex0BPx5y 8/9JH61bv4ydKWAAwJVELeCiSnwBl+7wqrrdx9s+pCj7j+1j40YBAOBAYhZwndZNAdf3gW5e Vk8Pab7MHtN7PQSLAgYAXEnMAi7v5dtsgs7Suup293YFXDw9/u0QLAoYAHAlEQvYH2HV7gOu 0vto+OkhLwUcPwoAAAcSr4Bzv1G5LeBbWr085HUT9EiUXlgoAABEReu6QQHX3TPW7nDo6rGP d+IgrJFQYVEAADiQqEdBJ90IuCrzx08mTkOKHgUAgAOJU8D1Y7zrCzhzo+CqfHrI24U4YkcB AOBAglqv2eRcvRZw7ce5ue/ix0OSvEifLkUZNwoAAIci1HpCUQAAMCbUekJRAAAwJtR6QlEA ADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpC UQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbU ekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAw JtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEA ADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpC UQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbU ekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAw JtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEA ADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpC UQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbU ekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAw JtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADAm1HpCUQAAMCbUekJRAAAwJtR6QlEA ADAm1HpCUQAAMCbUekJRAAAwJtR6QlEAADC2U+ulXq4QBQCAHezYekX1/DUFDAC4jh1bL729 fL1PDAAAdhDUenUx3IycZ2VaFvnUQ/KiTKvBT29lzCgAABxKSOvV1S0bFHBV1kldllMPKav8 3sGDh2cRowAAcCyBrTcsYF+o9euG5e4hN/+f9NG69cshWBQwAOBKIhawl7uKLd3hVV0Vtw8p yv6j/+zlECwKGABwJbEL+OZqt763cF5WTw9pvswe03sbKVPAAIALiVzAebMPOEvrqtvd2xVw 8fT4t0OwKGAAwJVELuAqrdv/tp9MFXD8KAAAHEjcAi66Dcu3tHp5yOsm6JEovbBQAACIitZ1 zwWcdf2bl9VjH+/EQVgjocKiAABwIDELOHucZFSV+eMnE6chRY8CAMCBxClgf8rR7dGumdsB XJVPD3m7EEfsKAAAHEhQ6zUbsau2gMvmy8KfheTOCC6GD0nyIn26FGXcKAAAHIpQ6wlFAQDA mFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUB AMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJ RQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ 6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDA mFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUB AMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJ RQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ 6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDA mFDrCUUBAMCYUOsJRQGu6d/G3jGAaxBqPaEowDVRwMCGhFpPKApwTVsUMCUPtIRaTygKcDn/ fhN5QrGeDjguodYTigKc3Xi/3v/7d6ocvzb0zMqmgIGWUOsJRQEWOUCnzCzHv06sF7LJgBo4 LqHWE4oCLLJJpyyayMq+8+PfaAX8NZrpdAB5Qq0nFAVYZPcCjjW+/Ntan/K7QUJqGNcm1HpC UYBFPm5rjbURdvivIvXts656//ZFbNHEb5FpYVyVUOsJRQHmCm/UDbr7u65sX1vXsomf0MK4 IKHWE4oCfPc+IN1iejbdO+jXiardpIm3mI2ADqHWE4oCfPLWE4ct4BWNat/EtDCuQqj1hKIA 46Y70HwbbfQCDi1R2yamhXEBQq0nFAV49bn9NtlNGk/cuH/NqphN0jg3odYTigL05rRAe/Dw EVj+qWDTxLQwzkqo9YSiAM7sAdhGRwoH2yzklyZeEYIWxgkJtZ5QFFzdwm2f7emzyj28U7Kx Jl4fhU3SOBeh1hOKggtbvI53bfLUKXI1rBBn0MSh2+tpYZyFUOsJRcElrRlfPartrVE0hsMK GZ5F2k1MC+MEhFpPKAouZu2mzRk1smMPy3Vvy4+A47UwNYzDEmo9oSi4jIBV+KL+2LqHVcvX GcwJBsO4NKHWE4qCCwgbPa2sjS1qWG+r85uXeDHmystvc5NrkwGBhFpPKApO5H1VHL5uDmwL u+GwfvdOizBT+t8sBYwjEGo9oSg4keGqOMpaOVrHRe7hI5dvL1oNn6SAT/NCMEao9YSi4ET+ jbo+Nii5CJVzgK3OywTPk9P01mleCMYItZ5QFJxIzDWYYcutHg6frXsHAmr4NL11mheCMUKt JxQFJ6I8+B2dyILSOXH59lbV8Gm2QZ/ldWCUUOsJRcFZxNv6vG3TTfTw35dHbBlpZwtruPut H766KOBTE2o9oSg4hWa9FWMNtlfXPbfO44uLde/AitHwscvrRIeTYYRQ6wlFwQlEW2ft3nbd cPhxBal94+xv8WjYMIspP4Q/yVAeY3ZqvdTLFaLgjOINGXTqbuuLaalbMDuO2V4vqY/5IvDR jq1XVM9fU8CI5DyD3yehdxE6pZk1fLzyGnkTU8Gns2PrpbeXr/eJgZOJNvjVat+Ezc8fjNXw 85cHK69h3L/j38YJRG29unjdrDz8Vl6UadX/+FZaRsE1nXHT84BiJiHDGn5v5AOV11P9Pr+Q A70KfBez9erqlr0U8PBbZZXfO/jx8yozjIJLOummZyzS7yt//SUepLyeY769kIO8CswRufVe C3jwrZv/b9rVbv32SAoYIU4++MUizWHjb7/IA5TXS0R/9Ptxx/L4zKaAS3d8Vd3s5O0KuCj7 j8n7IVgUMEIw+MWTvxN/SImX13u8o/4hgTlsCri+D3Tzshp+qynl+1fdhG+v/5YCxkrnPe4K a00fsSbcXSPv46mTz6jgUzDaBJ2ldVXmT98qi+FXb4dgUcBYiU3PGDM1BpbtrrFcf5PJ089U XwYWsNoHXKVp/fyt5wK2j4JLYPCLTyYGwYLdNZqpjz/29lR8GVjEqoBvafXyrZdN0CNRenFD 4axoX3wz8YtV667xPH8nv/j4zyDOrOu6ts3LqtvJO3EQ1kiouFFwdmx6xgxTv1yl7prI8vHK Ip//KY7BqICrMu93Bzf/fTkNyTwKTo3BL2aa/P2qdNdU/b7fjZIKPhubAs7cDuCqHH7r7UIc 1lFwYrQv5pv+JUt010SG0dDH2aWNWaK2XrNRu/JnISVJnhb9t9zXRTq8FKVtFJxXxMFvnOeB uulf9O7VNfVmnoh8kF3amEmo9YSiQBabnrHChwbetbompz4ZeOptSwUfklDrCUXBZpa1IMdd YZ0Pv+/9qmt6yp/enkc4qgwzCbWeUBRsZNEN9hj8Yr1Pv/J9qutD/X5+f2rv0sYSQq0nFAUb WXCLedoXQT7+3revrg9T/PoGFT+qDPMJtZ5QFGzj74vJB3LcFcJ9/N1vW12fpjbnLaq6TxsL CbWeUBSYG71p62shf2/mmRN7PH3gE+HQPv/6t6uuj1Oa9x6V3KeNxYRaTygKDA0rdUa7+tXJ ZDPPGi2094alfS/v23ttkxCfC3LuG0oTRAAAIABJREFUu/TT25kKPgyh1hOKAhNjlfltUPJt VTKjmf0wm/ZF8r2B7ZvrS/0ueJ9q7dPGKkKtJxQFka3bkBy0Fom8FRvnMGuDi50vT7/sXSq0 TxsrCbWeUBREs7r9oh13teRQa5zft7eCYXN9e+ql79LPCxYVfABCrScUBREEDTxjrjwYAGPo 63vBqLm+705Z/pwqh5VhJaHWE4qCIKHbfOMNflvULwa+vx0Mmuv7U656l35ZzqhgcUKtJxQF K0XY3Rq9fYEXM96hkd+FM+p37WKz/3FlWE+o9YSiYLk4G3pZXWADc96oEd+JM97VAYvOjju1 EUqo9YSiYIloRxkz+MVG5rxfY70b5zxP0OLz9cWwXMkSaj2hKJgn6gk+rCWwoVmD4AhvyVnP EelSb4EpsD2h1hOKgm/iVO+/jYTBLzY3690b/Lac9e/D/4bd5bgyhBNqPaEomBZz1PvvQ5zn A+ab9yYOem/OG/7GWJpmPAlLmSCh1hOKAu/1qpHRT6ilfbEj6xthzvuHsZao7Q/tRjih1hOK guTpsGarizlSwNjT3Pf0qnfozPd1vMVqq73aiEio9YSiIOnuYWB5GSm2P2NnZoPguf8i5rI1 a0FlcZMi1HpCUeCXZttLON7XBBQwdjb3Db7sXTr70ZGXr20OLEM8Qq0nFAVJ8/e0Wf8+jn2m gLGr2W/x+e/T+fUbffna4MAyxCTUekJRYLrxmc6Fjvnv8Xlv2/lvboula+YiyxIoQqj1hKJc XrMYW9WvydMC6yz4M/P7e3fBu9to85L1sd2ISaj1hKJcnOGeXwa/0LPg/f75/bvk3W22kNns 1oYJodYTinJpdvVL+0LTkrf89Ht40dvb8PDG2Uswy+PuhFpPKMqFWdav1TMDgRY18PgbeVn9 2t6i2uDIMpgQaj2hKJdlecqv0TMDESx644+8mZe9v23rN1myILvgnI2wG6HWE4pyUVb1y7IN ecve/C/v6IVvcPP+XTQNLsi+I6HWE4pySZxzhEtb9v4fXEdm6Tt8i/5deGQZBbwTodYTinJB ZvVr8rRAfAsXgbUjx236d9lmaAp4J0KtJxTlcmzql2Uah7KugRdOY6v+TZackEQD70So9YSi XIzJOoHlGYezdCv08trasH7nT63fls4yuzGh1hOKcilG9WvwpICxhUdiLS7gbft37sI9eCF0 8LaEWk8oyoUY3eY3/pMCG1iyPCwv4K37d9UkWXw3JNR6QlEuw6B+WXxxaIvP37F46ojWTJSF eCtCrScU5SLi1y8LLg7Pqia3PPwqfLosypsQaj2hKNdgUL+xnxHYnk1R7lS/AZOmg+0JtZ5Q lCuI/fc4SyvOwuRGvQbPaT9xTk4yJtR6QlHOL3L9spjiTOJvLd61f8NeEAu3IaHWE4pydtHr N+rTAbuLXJg7929oAjrYilDrCUU5t7j1y7KJM4q5kOx1+NWTwAws5yaEWk8oyplFXRmwVOKs 4i0nCvWbRFjyWdrjE2o9oSjnFbl+Iz4ZoCXWoiLSv0mMJHRwZEKtJxTlrGLWL4siTi7O4qLT v3GysODHJNR6QlHOKWL9shDiCiIsMUr9G2sdwOIfjVDrCUU5o3j1y+KHqwhdaCQOv3oSKRAr gTiEWk8oyvnErN9YzwTIC1tu5Oo3ibgqoIMjEGo9oShnwzIHrBOy6Cj2bxIzFuuDUEKtJxTl XGLVL0sbLmj94iPav3GDsVYIItR6QlHOJF79xnke4GBWLkGy/cu1eHQItZ5QlPOItKSxiOHC 1ixEeodfPeFWLBqEWk8oylkEXoK9vRUKCxcubvlypF2/icnNSFlNLCfUekJRziHsb/CugFms gKWLknz/mgzR6eDFhFpPKMoZhC5f//ZDYODqli1NB+jfxCYla4xlhFpPKMoJRKpflibAWfIH 7TH61yonq40FhFpPKMrhrRz+/jvoXQoYGJq7TIkffjVkFpU1x0xCrScU5eAWLlb/jg53KWDg ybzF6jj169ilHVl5HGvWbEKo9YSiHNrM+h2v3befRwwGHNucJetoJWOZ93n98dcxnNp2Ir4K odYTinJgX97i32oXwKTva97oDfNPI/bT9mxLcbCm+Zts8efJBlOI+YeEUOsJRTmsiTcGtYuL MeqtpwVsZBrxV///DD4aMWvGv3PFmuC//tls13L//r2vR6NNQ6j1hKIc1MsbmdqFIvMxnZ/I 4GNUg2XsbRoWg8kNCjhZ14ILetVfUOBDcc1u6i+Vvaocl0/877/RftNCrScUJcAWlTc+jfY9 Qe1ivS3KcYtKMZxIv+p9nca8lfI/q8R9Da/cuuPv67fWdeCUf+OOTle0ZpzXwQhY204FfH93 H7B2txkLbTGRLZylHNdNY11vWfi78Psv1s2t5f9wgXsJhdbSd762rP3dYCpR/5AQaj2hKAHa X9I20xi8J0wnaOXgY6HBNFatW5dOZPBRahrhnWbWWwv/1TzdwvY8DaNFsJ+I4Vtsi+OjNlkz Ri3HiWk0FR/p2YRaTyhKAOO/vp610zSp3y065TwFvME0dMtxi9exlOlE2gXuaRpW/fUyj22W SZsx77NNCjhqOU5OI+LrEGo9oSgrtb+ajd8CRktO3DXYFj0Q0yYz6zQvRG8a1n8//n2Zxqab oCxemX3+jQr4YNMQaj2hKMs9ficbbQR5vAXMlvxPq8lYPaA7FopSfutmStwXojiNVa9ezN8P X23i+LMQiVTrCUVZ5PnvobjHyH2Yop+mTf3Gbo7pCQ0+mjlLb1GOUv5OfL4lfleHJ9R6QlHm eu7eWMe5z/G3mWCU5xrv19Os7reYyBm2qmKZvyOf7YE3xZEJtZ5QlBnao9Hfe3bkgMLYZ6Il K/9ZY+Z4dpNOOQu68YK6k2cVzkDg3XdQQq0nFGXM3NZcU41fG/r5Cd0Zh3//zl3gVm4/plOA z8w2cK3C8npAQq23RZQlN9Ueq8BZx7+ZbHJ+izO14JvtsAXw5MNiuBeW+GO5VAF/Hbh+GuDa H98+2z9+w1d/iRz6Ftie61+5Bk4YCh/JtQo4eQxPF+2GFepe7x+fmL4FdqWyB3gMK4dDuFIB D694OvOfbHBe93JuyfrL8VHAzqR2AY/gT3R5VyrgZX+wKlavc1+iOEAZkKBcvy1KWNm1Cnjm H6yq3dsMfjlAGcASrC9UXaqAZ/zBqtu9XfsCwHKUsKCLFfAnyt2b+E3PABCAobCYvVtvYMco 4t3L4BdALLSwjssXsHz30r4AYhuWMAeV7OZKBfx6I8cDdG/CpmcANrra5bSK3Vy0gI/RvQx+ Adh6jH1Z1ezgggV8kO6lfQFsoRkJs7bZwQUL2HoycdC+ADbhVzXsBd7DXgVcV2lZP3+LAu6x KADYyGMTNCW8tZ0KuE5veV5sHOUoBcxCAGA7T0dBU8Jb2qmAq+L9exSww7sfwL4o4a1Ebb26 KtPq9vytIk3z5tO8uP+0/TzN7p9vvQn6AHjbA1BACW8hZuvVaZEnWTps4Lq6ZV0Bl1V+72D/ Re52AD9+YBDlkHjDAxBCCVuL2XqZf7Kyev1u7v978/9Ns2ay7j/l82D52gXMOx2AHkrYkk0B +xKum7FwV8BF2X9MSgp4iLc4AFmUsJWYrZeXfhN07bo3u3/VDIW7Am6+bEo6ubEJusd7G4A4 OthC3IOwyjTthr111ezu7Qu4GH7FQVgt3tYADoGBcHSxD8LK24OwqjRt+3W8gI2jHATvZwBH QglHFbP1qrL/eEu7Y7FGN0GPRulFDCWMNzKA46GEA9l0XVexudsdXHWnI40ehDUaKmIUfbyF ARwWJRyFwQg49Z/mj729Y6chWUcRx3sXwNFRwsFitt6t2QecJc2h0E0fJ2MX4jCPIo03LYBz oISDmFyKsvbj3Dwtuo3dftt0XqSPS1GaR5HF2xXAqVDCqwm1nlAUK7xPAZwR67ZVhFpPKIoJ 3qEAzouB8HJCrScUxQBvTQBnRwkvI9R6QlFi400J4CIo4fmEWk8oSlS8GwFcCyU8j1DrCUWJ iLchgCuihL8Taj2hKLHwBgRwYX0J/9PYN44codYTihKgf5vxZgOAZoXYrA1ZJz4Taj2hKAG6 txntCwCNbvDLavGZUOsJRQnQvMuoXwDo/dNtHWRD9IBQ6wlFCeDfZvydBwADT5ug/xmxY7b9 CLWeUJT12NACAG++7gOOUMrHq3Kh1hOKslI/9j3SWwAArK0qx4WlfLy1r1DrCUVZoxv7Hu5v MAA4jA+lTAEHEIqyGJULADs57B5lodYTirLMcX7bAHBagxHwQepYqPWEoiyg/MsFgOuY3AQt W8dCrScUZS6t3yUAXNjMghWqY6HWE4oyx+6/OgBAoIU7j+PWtlDrCUX5ivIFgPP5Vsdxj7QW aj2hKJ/RvgBwAe91TAHvivYFgCvqi5gC3gPtCwAXxgh4J7QvAFwbBbwH2hcALo+joDdH+wIA YhNqPaEoQ7QvAMCAUOsJRXmgfQEANoRaTyhKg/YFAJgRaj2hKAntCwCwJdR6QlFoXwCAMaHW U4lC+wIA7Km0XiIShfYFAGxCovUa+0ehfQEAW9m/9R52jkL7AgA2RAE7ES8tBgDAHBQwQ18A wA6uXsC0LwBgF5cuYNoXALCX6xYw7QsA2NGVCnhwI0faFwCwr0sVcPuR9gUA7O5yBUz7AgAU XKyA7+1L/wIABFysgB8fAQDYFQUMAMAOLlXA/VHQAADs60oFDACADKHWE4oCAIAxodYTigIA gDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOK AgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHW E4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAx odYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIA gDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOK AgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAsb1ar67Ssn7+FgUMALiOnVqvTm95XkhEAQBg Bzu1XlW8f48CBgBcR9zWcxuWs+fvFGmaN5/mRZlW7edpdv+cTdAAgMuK2nq39Jbk2W3wnbq6 ZV0Bl1V+72D/Re52AD9+YBEFAABpMVsvT28j3+169ub/mzYDZP+f8vnhFDAA4Dpitl72eLKy SvyBVs13c/+9ouw/JiUFDAC4tJitV1VZmfpWre8j3Ny3cF/AzZdtSd/YBA0AuLSYrXdv3zq/ +a3LWVpXze7evoCL4VcchAUAuLS4BewqtfIbmas0bft1vICNowAAoC3qJmj/ZE3F3tKq/e7o JujRKL2IoQAA0GHTdUVfwHlZdYdEjx6ENRoqYhQAALTFbL263wRdlfljb+/YaUjWUQAA0Ba1 9aqy9hfjcMdgdTuDk7ELcdhHAQBAWtzWK5rTkGo/zs3TotvY7Xf/5kX6uBSlfRQAAJQJtZ5Q FAAAjAm1nlAUAACMCbWeUBQAAIwJtZ5QFAAAjAm1nlAUAACMCbWeUBQAAIwJtZ5QFAAAjAm1 nlAUAACMCbWeUBQAAIwJtZ5QFAAAjAm1nlAUAACMCbWeUBQAAIwJtZ5QFAAAjAm1nlAUAACM CbWeUBQAAIwJtZ5QFAAAjAm1nlAUAACMCbWeUBQAAIwJtZ5QFAAAjAm1nlAUAACMCbWeUBQA AIwJtZ5QFAAAjAm1nlAUAACMCbWeUBQAAIwJtZ5QFAAAjAm1nlAUAACMCbWeUBQAAIwJtZ5Q FAAAjAm1nlAUAACMCbWeUBQAAIwJtZ5QFAAAjAm1nlAUAACMCbWeUBQAAIwJtZ5QFAAAjAm1 3hZR/rPBNAAA+O5SBfwfx3wqtDwA4DsK+JATAQAc3ZUK+D+rrJqKRXwAwJlcqYDXdeO62g7q 8DmpDJ4TALAhCniDiTDKBgC8ulQBbzJwXFOOSwfRFDAAHN7FCngTJpucd9jMzXZuADAk1HpC UdSt2czNdm4AkCLUekJR1K0vVLnt3FQ8gKsSaj2hKPpsNjnPFXeK0Z5tejL2kwCApYRaTygK ko/lOLupv1X2JgW8UcsDwDJCrScUBUnk3lpe2YvLfJsXMj0V8ykAOBmh1hOKAm+D1lpdqPt0 edwXAuDahFpPKAo2sklvfZhItMqmgAEsJtR6QlGwGdFLo3x+su1G1gBOTKj1hKLgXDbflm7V zDQ7cCZCrScUBVhmbrsGDZoZXAPnItR6QlGApYIHt9+bmQIGvthiAYk4DaHWE4oCyGAvM+xt 8oYSPa1iz2kItZ5QFEDQcMk/dikfKOpn9JbWRA43DaHWE4oCCPq25Mcp5ZOs7k/zQg7XKTEn MrljZneRZopQ6wlFASQtX+yXrjtirlz2nMa1X4h5/Wxli5m16zSEWk8oCnBin9Z0a9Yup1kV b/JCtrDFzFpsi4kcbhpCrScUBbgY+dV9SECpF6I4kdO8EI6CXk8oCnBRrO7FpnGi3jrPsXfx CLWeUBTgok6zuj/NC6G3zkyo9YSiAJd1mtX9aV4Izkuo9YSiAABgTKj1hKIAAGBMqPWEogAA YEyo9YSiAABgTKj1hKIAAGBMqPWEogAAYEyo9YSiAABgTKj1hKIAAGBMqPWEogAAYEyo9YSi AABgTKj1hKIAAGBMqPWEogAAYEyo9YSiAABgTKj1hKIAAGBMqPWEogAAYEyo9YSiAABgTKj1 hKIAAGBMqPWEogAAYEyo9YSiAABgTKj1hKIAAGBMqPWEogAAYEyo9YSiAABgTKj1hKIAAGBM qPWEogAAYEyo9YSiAABgbK/Wq6u0rJ+/RQEDAK5jp9ar01ueFxJRAADYwU6tVxXv36OAAQDX Ebv1ivS5WusiTfPm07wo06r9PM3un7MJGgBwWZFb71Y+F3Bd3bKugMsqv3ew/yJ3O4AfP7CJ AgCAsLitl5e39HXjctezN//fNGsm6/5T3gyjAACgLG7rFVXiC7isEn+glfteV8BF2X9MSgoY AHBpUVuvTuumgOv7CDf3LdwXcPNl1kzwxiZoAMClRW298l6+zSboLK2rZndvX8DF8CsOwgIA XFrM1vNHWLX7gKs0bft1vICNowAAoC1i6+V+n29bwLe0ar89ugl6NEovXigAAISYdF3dPWXt Doeu0vYQq9GDsEZDxYsCAIC46K3XjICrMn/s7R07DWmTKAAAyLIp4MyNgqtmsDt2IY5togAA IMukgGs/zs3dp81Gab/7Ny/Sx6UoN4kCAIAsodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYT igIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh 1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4qij5kFAAcntCIXihJmgxeSbjK3 zjKN87y1AJyJ0KpJKEqQLcpxkwLmhSybyAbOsowAcISWaKEoQTZY3afJFrOLAhabxkYbPgBs RGiBPskGzzXlmG5hixeiOI2tJnKWAj7JcniiF4LzEnoDHXYFtnk3poOPW8a8kIXzlr8k5KZx nheyidP8SXQwQrNEdJlcuLJeV47rEllOZIsXssk01k3EsNtNxJxdJ+mt07yQs8ys82zAiUgo 7haLy9tUYq/aDNaI+9jihWwys7aZyODjLhOJVtnseBCbxon+kjjNC4lIKO0Wi6ThyAHXdcS/ JJZXNpaI81vyv6lz/LVynj0oMSch1ECbvAGkXjEg5to7HmJOI1plyx7VuYWFr/t4JS9UR+bb i9b+UoGL2GIZ2WQ5POIL2a+3dP8k2m+efHwlFDAAnN9ZjurcZiKDj5MBQis76ihbqPWEogCA gE166ywiz6xP1Rxts0ek54lAKAoAAG/ibpIQaj2hKAAAvIo9yo7zNDEIRQEAwJhQ6wlFAQDA mFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUB AMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJ RQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ 6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDA mFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUB AMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJ RQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ 6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDAmFDrCUUBAMCYUOsJRQEAwJhQ6wlFAQDA mFDrCUUBAMCYUOsJRQEAwNherVdXaVk/f4sCBgBcx06tV6e3PC8kogAAsIOdWq8q3r9HAQMA riNm6+VZmZZF/vS9ukjT9jt5UaZV+3l6f2jFJmgAwGXFbL2qrJO6LIffqqtb1hVwWeX3DvZf 5G4H8OMHBlEAANAWs/Uy96FOby/fbXv25v+bZs1k3X/K50dSwACA64jeernr1rJKHlXcFXBR 9h+TkgIGAFxa9Na7udqt7y2c+xbuC7j5MmsmeGMTNADg0mK3Xt7sA87Sump29/YFXAy/4iAs ALiSn70D6IndelVat/9tP5ko4A2iALgwVvdifn42+JVs8VuPOI3IrVd0h2Dd0qr91ugm6NEo vbihgJM4S6dsspY8y+r+aJ0yPY0NfiPHmIZV12Vd/+Zl1X06ehDWaKioUcaxvKhNhBeyZBon 6ZRtXschVsUiE9loGvZTOdrMitp6WXuSkTslOH/s7R07Dck8yjiWF7WJ8ELUJnKiaZxjdX+8 38hPPKsmHe+V2E8jZuvdHu2auR3AVdl+njffHFyIwzzKBJYXtYnwQpZN4xSdsn7VamyTF6I4 kU/TiDUbV83jGL9VG8tex6SYrVc2G7ULfxaSOyO46DZ2+92/eZE+LkVpHWVczBm360R4IWoT WTUNxfXEmn+0xetYmUnwhRzNwhn/dd5Ger5PE7GcRORpCB3vZB5l87cu8Mm696/em36L17Fu Gsdf3R+wU/azxW898jQuWMDHnwgvRG0idMriSRx/fb/lL+RInYIFLlTALC9qE+GFqDnNCwEO 4UIFDACADqHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCA MaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oC AIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYT igIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh 1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCA MaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oC AIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYT igIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh 1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCA MaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oC AIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYT igIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh 1hOKAgCAMaHWE4oCAIAxodYTigIAgDGh1hOKMoGE4UgYjoThSBiOhOGEEgpFmUDCcCQMR8Jw JAxHwnBCCYWiTCBhOBKGI2E4EoYjYbhdEuZFmVb563eZWeFIGI6E4UgYjoThSDiqrPJ7B+cK URYhYTgShiNhOBKGI2G4PRLe0txNOROIsgwJw5EwHAnDkTAcCcPtkbAo+487R1mGhOFIGI6E 4UgYjoTh9khYVu5j9jppZlY4EoYjYTgShiNhOBKOKQv3MfMboneOsgwJw5EwHAnDkTAcCcNR wAuQMBwJw5EwHAnDkTCc0iZoAACuY4cCnjgICwAAWJo4DQkAAJgavxAHAAAwlRfpyKUoAQAA AAAAAAAAAAAAAAAAAAAAAAAAAKjJi1L13ODaX7PTXzlTMGVdpO1tLfpwWjEfCVVnZJ6VaVnk /lPNedgnVJ2Hya1Ky6p2n4nOwz6h7Dy8K1J/qxrVeZh0CVXn4Wgu0YRS81D46lh1f98mvZR1 devuK9WHk4rZJ1SdkVVZJ3XpFwTRedgnVJ2Ht6x2F9hx/SY6D/uEqvPw7lY2BSw6D5NHQtV5 OJpLNKHSPFS+PnQ/nzRTtvXWh5OL+VbAWgl9jDq96c7DPqHqPGy4dbPqPGy4hLrzMC9vvt50 52GXUHUejuVSTag0D5XvkNTsyZ7PAAACQ0lEQVTPJ82Ubb314eRivhWwXML7msW9/YXnYZtQ eh4m8vMw0Z6HReX/QhCeh11C1Xk4lks1odI8nLhHsIQ67XbAaaZs660PJxfzUcDCM/LmxpfC 87BNqDwPm41oyvOwSSg7D+u0bupNdh4+EqrOw7FcqgmV5mFZNAny/SJMqrPc7YDLVVN2BfwI JxezK2DhGZn7PazC87BNqDsPqzQt/S5g2XnYJZSdhy5OU8Cq8/CRUHUejuVSTag0D/dP8IXf /qeZ8jAF7InOyCrVLo8uoSc6D5tDnJTnYXuYmP9Ubx764bl0AfcJPcF56L3kUk3Yf7Z/wv3H 4N+4hJopD7MJuiE5Iwu3eVd6HrYJG5LzMHnNpZqw/0wqYe5/w8qboAcJG3LzsMX7cKn990J/ 4+aRZsrDHITVUJyRWdtuuvMwG/av5Dx0qlJ5HjrVI43cPGxOC72rVefhIGFDbh62nnOpJuw/ 2z/h/sdhf5HrHc/eOcxpSJ7ijMy6JLLzMHtKojgPPbcykZ2H3mPFpzoPxU9DSt42QQsmfMml mrD/TCDh7mciTytu3b5yzZTiF+JI+jG66Iy89e980XnYJ1Sdhy5XXilfiKNPqDoPPfULcbQJ VefhaC7RhFLzMC/StNozwDR3mbDHlQrlUjbbhNzf9X04rZiPhKozsmwSuvWK6DzsE6rOw9Fc oglV56HXXYpSch567YKiOQ95HwIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAgN39/1WWGfFRwdLUAAAAAElFTkSuQmCC --rwEMma7ioTxnRzrJ-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 1A4F26B0038 for ; Mon, 18 Sep 2017 03:44:15 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id f84so14351187pfj.0 for ; Mon, 18 Sep 2017 00:44:15 -0700 (PDT) Received: from mga05.intel.com (mga05.intel.com. [192.55.52.43]) by mx.google.com with ESMTPS id f25si4294856pga.566.2017.09.18.00.44.13 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 18 Sep 2017 00:44:13 -0700 (PDT) Date: Mon, 18 Sep 2017 15:44:04 +0800 From: Aaron Lu Subject: Re: Page allocator bottleneck Message-ID: <20170918074404.GD4107@intel.com> References: <20170915092839.690ea9e9@redhat.com> <6069fd36-ed0e-145c-3134-35232bf951a7@mellanox.com> <20170918073447.GB4107@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170918073447.GB4107@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tariq Toukan Cc: Jesper Dangaard Brouer , David Miller , Mel Gorman , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm , Dave Hansen On Mon, Sep 18, 2017 at 03:34:47PM +0800, Aaron Lu wrote: > On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote: > > > > It's nice to have the option to dynamically play with the parameter. > > But maybe we should also think of changing the default fraction guaranteed > > to the PCP, so that unaware admins of networking servers would also benefit. > > I collected some performance data with will-it-scale/page_fault1 process > mode on different machines with different pcp->batch sizes, starting > from the default 31(calculated by zone_batchsize(), 31 is the standard > value for any zone that has more than 1/2MiB memory), then incremented > by 31 upwards till 527. PCP's upper limit is 6*batch. > > An image is plotted and attached: batch_full.png(full here means the > number of process started equals to CPU number). To be clear: X-axis is the value of batch size(31, 62, 93, ..., 527), Y-axis is the value of per_process_ops, generated by will-it-scale, higher is better. > > From the image: > - For EX machines, they all see throughput increase with increased batch > size and peaked at around batch_size=310, then fall; > - For EP machines, Haswell-EP and Broadwell-EP also see throughput > increase with increased batch size and peaked at batch_size=279, then > fall, batch_size=310 also delivers pretty good result. Skylake-EP is > quite different in that it doesn't see any obvious throughput increase > after batch_size=93, though the trend is still increasing, but in a very > small way and finally peaked at batch_size=403, then fall. > Ivybridge EP behaves much like desktop ones. > - For Desktop machines, they do not see any obvious changes with > increased batch_size. > > So the default batch size(31) doesn't deliver good enough result, we > probbaly should change the default value. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id AA3756B0038 for ; Mon, 18 Sep 2017 05:16:23 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id 188so16646987pgb.3 for ; Mon, 18 Sep 2017 02:16:23 -0700 (PDT) Received: from EUR01-HE1-obe.outbound.protection.outlook.com (mail-he1eur01on0041.outbound.protection.outlook.com. [104.47.0.41]) by mx.google.com with ESMTPS id x86si4182317pfk.293.2017.09.18.02.16.21 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 18 Sep 2017 02:16:22 -0700 (PDT) Subject: Re: Page allocator bottleneck References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> From: Tariq Toukan Message-ID: Date: Mon, 18 Sep 2017 12:16:09 +0300 MIME-Version: 1.0 In-Reply-To: <20170915102320.zqceocmvvkyybekj@techsingularity.net> Content-Type: text/plain; charset=iso-8859-15; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Tariq Toukan Cc: David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm On 15/09/2017 1:23 PM, Mel Gorman wrote: > On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote: >> Insights: Major degradation between #1 and #2, not getting any >> close to linerate! Degradation is fixed between #2 and #3. This is >> because page allocator cannot stand the higher allocation rate. In >> #2, we also see that the addition of rings (cores) reduces BW (!!), >> as result of increasing congestion over shared resources. >> > > Unfortunately, no surprises there. > >> Congestion in this case is very clear. When monitored in perf top: >> 85.58% [kernel] [k] queued_spin_lock_slowpath >> > > While it's not proven, the most likely candidate is the zone lock > and that should be confirmed using a call-graph profile. If so, then > the suggestion to tune to the size of the per-cpu allocator would > mitigate the problem. > Indeed, I tuned the per-cpu allocator and bottleneck is released. >> I think that page allocator issues should be discussed separately: >> 1) Rate: Increase the allocation rate on a single core. 2) >> Scalability: Reduce congestion and sync overhead between cores. >> >> This is clearly the current bottleneck in the network stack receive >> flow. >> >> I know about some efforts that were made in the past two years. For >> example the ones from Jesper et al.: - Page-pool (not accepted >> AFAIK). > > Indeed not and it would also need driver conversion. > >> - Page-allocation bulking. > > Prototypes exist but it's pointless without the pool or driver > conversion so it's in the back burner for the moment. > As I already mentioned in another reply (to Jesper), this would perfectly fit with our Striding RQ feature, as we have large descriptors that serve several packets, requiring the allocation of several pages at once. I'd gladly move to using the bulking API. >> - Optimize order-0 allocations in Per-Cpu-Pages. >> > > This had a prototype that was reverted as it must be able to cope > with both irq and noirq contexts. Yeah, I remember that I tested and reported the issue. Unfortunately I never found the time to > revisit it but a split there to handle both would mitigate the > problem. Probably not enough to actually reach line speed though so > tuning of the per-cpu allocator sizes would still be needed. I don't > know when I'll get the chance to revisit it. I'm travelling all next > week and am mostly occupied with other work at the moment that is > consuming all my concentration. > >> I am not an mm expert, but wanted to raise the issue again, to >> combine the efforts and hear from you guys about status and >> possible directions. > > The recent effort to reduce overhead from stats will help mitigate > the problem. I should get more familiar with these stats, check how costly they are, and whether they can be turned off in Kconfig. > Finishing the page pool, the bulk allocator and converting drivers > would be the most likely successful path forward but it's currently > stalled as everyone that was previously involved is too busy. > I think we should consider changing the default allocation of PCP fraction as well, or implement some smart dynamic heuristic. This turned on to have significant effect over networking performance. Many thanks Mel! Regards, Tariq -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id BE8F76B0033 for ; Mon, 18 Sep 2017 11:33:39 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id i130so880412pgc.5 for ; Mon, 18 Sep 2017 08:33:39 -0700 (PDT) Received: from EUR01-HE1-obe.outbound.protection.outlook.com (mail-he1eur01on0066.outbound.protection.outlook.com. [104.47.0.66]) by mx.google.com with ESMTPS id g24si5081123plj.233.2017.09.18.08.33.35 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 18 Sep 2017 08:33:38 -0700 (PDT) Subject: Re: Page allocator bottleneck References: <20170915092839.690ea9e9@redhat.com> <6069fd36-ed0e-145c-3134-35232bf951a7@mellanox.com> <20170918073447.GB4107@intel.com> <20170918074404.GD4107@intel.com> From: Tariq Toukan Message-ID: <082e7901-7842-e9d9-221d-45322da0fcff@mellanox.com> Date: Mon, 18 Sep 2017 18:33:20 +0300 MIME-Version: 1.0 In-Reply-To: <20170918074404.GD4107@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu , Tariq Toukan Cc: Jesper Dangaard Brouer , David Miller , Mel Gorman , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm , Dave Hansen On 18/09/2017 10:44 AM, Aaron Lu wrote: > On Mon, Sep 18, 2017 at 03:34:47PM +0800, Aaron Lu wrote: >> On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote: >>> >>> It's nice to have the option to dynamically play with the parameter. >>> But maybe we should also think of changing the default fraction guaranteed >>> to the PCP, so that unaware admins of networking servers would also benefit. >> >> I collected some performance data with will-it-scale/page_fault1 process >> mode on different machines with different pcp->batch sizes, starting >> from the default 31(calculated by zone_batchsize(), 31 is the standard >> value for any zone that has more than 1/2MiB memory), then incremented >> by 31 upwards till 527. PCP's upper limit is 6*batch. >> >> An image is plotted and attached: batch_full.png(full here means the >> number of process started equals to CPU number). > > To be clear: X-axis is the value of batch size(31, 62, 93, ..., 527), > Y-axis is the value of per_process_ops, generated by will-it-scale, > higher is better. > >> >> From the image: >> - For EX machines, they all see throughput increase with increased batch >> size and peaked at around batch_size=310, then fall; >> - For EP machines, Haswell-EP and Broadwell-EP also see throughput >> increase with increased batch size and peaked at batch_size=279, then >> fall, batch_size=310 also delivers pretty good result. Skylake-EP is >> quite different in that it doesn't see any obvious throughput increase >> after batch_size=93, though the trend is still increasing, but in a very >> small way and finally peaked at batch_size=403, then fall. >> Ivybridge EP behaves much like desktop ones. >> - For Desktop machines, they do not see any obvious changes with >> increased batch_size. >> >> So the default batch size(31) doesn't deliver good enough result, we >> probbaly should change the default value. Thanks Aaron for sharing your experiment results. That's a good analysis of the effect of the batch value. I agree with your conclusion. From networking perspective, we should reconsider the defaults to be able to reach the increasing NICs linerates. Not only for pcp->batch, but also for pcp->high. Regards, Tariq -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 003E46B025E for ; Tue, 19 Sep 2017 03:24:02 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id p87so4514313pfj.4 for ; Tue, 19 Sep 2017 00:24:02 -0700 (PDT) Received: from mga01.intel.com (mga01.intel.com. [192.55.52.88]) by mx.google.com with ESMTPS id h69si927145pfa.198.2017.09.19.00.24.00 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 19 Sep 2017 00:24:01 -0700 (PDT) Date: Tue, 19 Sep 2017 15:23:43 +0800 From: Aaron Lu Subject: Re: Page allocator bottleneck Message-ID: <20170919072342.GB7263@intel.com> References: <20170915092839.690ea9e9@redhat.com> <6069fd36-ed0e-145c-3134-35232bf951a7@mellanox.com> <20170918073447.GB4107@intel.com> <20170918074404.GD4107@intel.com> <082e7901-7842-e9d9-221d-45322da0fcff@mellanox.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="ew6BAiZeqk4r7MaW" Content-Disposition: inline In-Reply-To: <082e7901-7842-e9d9-221d-45322da0fcff@mellanox.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tariq Toukan Cc: Jesper Dangaard Brouer , David Miller , Mel Gorman , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm , Dave Hansen --ew6BAiZeqk4r7MaW Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Mon, Sep 18, 2017 at 06:33:20PM +0300, Tariq Toukan wrote: > > > On 18/09/2017 10:44 AM, Aaron Lu wrote: > > On Mon, Sep 18, 2017 at 03:34:47PM +0800, Aaron Lu wrote: > > > On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote: > > > > > > > > It's nice to have the option to dynamically play with the parameter. > > > > But maybe we should also think of changing the default fraction guaranteed > > > > to the PCP, so that unaware admins of networking servers would also benefit. > > > > > > I collected some performance data with will-it-scale/page_fault1 process > > > mode on different machines with different pcp->batch sizes, starting > > > from the default 31(calculated by zone_batchsize(), 31 is the standard > > > value for any zone that has more than 1/2MiB memory), then incremented > > > by 31 upwards till 527. PCP's upper limit is 6*batch. > > > > > > An image is plotted and attached: batch_full.png(full here means the > > > number of process started equals to CPU number). > > > > To be clear: X-axis is the value of batch size(31, 62, 93, ..., 527), > > Y-axis is the value of per_process_ops, generated by will-it-scale, One correction here, Y-axis isn't per_process_ops but per_process_ops * nr_processes. Still, higher is better. > > higher is better. > > > > > > > > From the image: > > > - For EX machines, they all see throughput increase with increased batch > > > size and peaked at around batch_size=310, then fall; > > > - For EP machines, Haswell-EP and Broadwell-EP also see throughput > > > increase with increased batch size and peaked at batch_size=279, then > > > fall, batch_size=310 also delivers pretty good result. Skylake-EP is > > > quite different in that it doesn't see any obvious throughput increase > > > after batch_size=93, though the trend is still increasing, but in a very > > > small way and finally peaked at batch_size=403, then fall. > > > Ivybridge EP behaves much like desktop ones. > > > - For Desktop machines, they do not see any obvious changes with > > > increased batch_size. > > > > > > So the default batch size(31) doesn't deliver good enough result, we > > > probbaly should change the default value. > > Thanks Aaron for sharing your experiment results. > That's a good analysis of the effect of the batch value. > I agree with your conclusion. > > From networking perspective, we should reconsider the defaults to be able to > reach the increasing NICs linerates. > Not only for pcp->batch, but also for pcp->high. I guess I didn't make it clear in my last email: when pcp->batch is changed, pcp->high is also changed. Their relationship is: pcp->high = pcp->batch * 6. Manipulating percpu_pagelist_fraction could increase pcp->high, but not pcp->batch(it has an upper limit as 96 currently). My test shows even when pcp->high being the same, changing pcp->batch could further improve will-it-scale's performance. e.g. in the below two cases, pcp->high are both set to 1860 but with different pcp->batch: will-it-scale native_queued_spin_lock_slowpath(perf) pcp->batch=96 15762348 79.95% pcp->batch=310 19291492 +22.3% 74.87% -5.1% Granted, this is the case for will-it-scale and may not apply to your case. I have a small patch that adds a batch interface for debug purpose, echo a value could set batch and high will be batch * 6. You are welcome to give it a try if you think it's worth(attached). Regards, Aaron --ew6BAiZeqk4r7MaW Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0001-percpu_pagelist_batch-add-a-batch-interface.patch" --ew6BAiZeqk4r7MaW-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id 0BB996B0033 for ; Thu, 2 Nov 2017 13:21:21 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id 15so196057pgc.21 for ; Thu, 02 Nov 2017 10:21:21 -0700 (PDT) Received: from EUR01-VE1-obe.outbound.protection.outlook.com (mail-ve1eur01on0085.outbound.protection.outlook.com. [104.47.1.85]) by mx.google.com with ESMTPS id t61si2667786plb.707.2017.11.02.10.21.19 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 02 Nov 2017 10:21:19 -0700 (PDT) Subject: Re: Page allocator bottleneck References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> From: Tariq Toukan Message-ID: <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> Date: Thu, 2 Nov 2017 19:21:09 +0200 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=iso-8859-15; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Tariq Toukan , Linux Kernel Network Developers , linux-mm Cc: Mel Gorman , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko On 18/09/2017 12:16 PM, Tariq Toukan wrote: > > > On 15/09/2017 1:23 PM, Mel Gorman wrote: >> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote: >>> Insights: Major degradation between #1 and #2, not getting any >>> close to linerate! Degradation is fixed between #2 and #3. This is >>> because page allocator cannot stand the higher allocation rate. In >>> #2, we also see that the addition of rings (cores) reduces BW (!!), >>> as result of increasing congestion over shared resources. >>> >> >> Unfortunately, no surprises there. >> >>> Congestion in this case is very clear. When monitored in perf top: >>> 85.58% [kernel] [k] queued_spin_lock_slowpath >>> >> >> While it's not proven, the most likely candidate is the zone lock >> and that should be confirmed using a call-graph profile. If so, then >> the suggestion to tune to the size of the per-cpu allocator would >> mitigate the problem. >> > Indeed, I tuned the per-cpu allocator and bottleneck is released. > Hi all, After leaving this task for a while doing other tasks, I got back to it now and see that the good behavior I observed earlier was not stable. Recall: I work with a modified driver that allocates a page (4K) per packet (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps NICs. Performance is good as long as pages are available in the allocating cores's PCP. Issue is that pages are allocated in one core, then free'd in another, making it's hard for the PCP to work efficiently, and both the allocator core and the freeing core need to access the buddy allocator very often. I'd like to share with you some testing numbers: Test: ./super_netperf 128 -H 24.134.0.51 -l 1000 100% cpu on all cores, top func in perf: 84.98% [kernel] [k] queued_spin_lock_slowpath system wide (all cores) 1135941 kmem:mm_page_alloc 2606629 kmem:mm_page_free 0 kmem:mm_page_alloc_extfrag 4784616 kmem:mm_page_alloc_zone_locked 1337 kmem:mm_page_free_batched 6488213 kmem:mm_page_pcpu_drain 8925503 net:napi_gro_receive_entry Two types of cores: A core mostly running napi (8 such cores): 221875 kmem:mm_page_alloc 17100 kmem:mm_page_free 0 kmem:mm_page_alloc_extfrag 766584 kmem:mm_page_alloc_zone_locked 16 kmem:mm_page_free_batched 35 kmem:mm_page_pcpu_drain 1340139 net:napi_gro_receive_entry Other core, mostly running user application (40 such): 2 kmem:mm_page_alloc 38922 kmem:mm_page_free 0 kmem:mm_page_alloc_extfrag 1 kmem:mm_page_alloc_zone_locked 8 kmem:mm_page_free_batched 107289 kmem:mm_page_pcpu_drain 34 net:napi_gro_receive_entry As you can see, sync overhead is enormous. PCP-wise, a key improvement in such scenarios would be reached if we could (1) keep and handle the allocated page on same cpu, or (2) somehow get the page back to the allocating core's PCP in a fast-path, without going through the regular buddy allocator paths. Regards, Tariq >>> I think that page allocator issues should be discussed separately: 1) >>> Rate: Increase the allocation rate on a single core. 2) >>> Scalability: Reduce congestion and sync overhead between cores. >>> >>> This is clearly the current bottleneck in the network stack receive >>> flow. >>> >>> I know about some efforts that were made in the past two years. For >>> example the ones from Jesper et al.: - Page-pool (not accepted >>> AFAIK). >> >> Indeed not and it would also need driver conversion. >> >>> - Page-allocation bulking. >> >> Prototypes exist but it's pointless without the pool or driver >> conversion so it's in the back burner for the moment. >> > > As I already mentioned in another reply (to Jesper), this would > perfectly fit with our Striding RQ feature, as we have large descriptors > that serve several packets, requiring the allocation of several pages at > once. I'd gladly move to using the bulking API. > >>> - Optimize order-0 allocations in Per-Cpu-Pages. >>> >> >> This had a prototype that was reverted as it must be able to cope >> with both irq and noirq contexts. > Yeah, I remember that I tested and reported the issue. > > Unfortunately I never found the time to >> revisit it but a split there to handle both would mitigate the >> problem. Probably not enough to actually reach line speed though so >> tuning of the per-cpu allocator sizes would still be needed. I don't >> know when I'll get the chance to revisit it. I'm travelling all next >> week and am mostly occupied with other work at the moment that is >> consuming all my concentration. >> >>> I am not an mm expert, but wanted to raise the issue again, to >>> combine the efforts and hear from you guys about status and >>> possible directions. >> >> The recent effort to reduce overhead from stats will help mitigate >> the problem. > I should get more familiar with these stats, check how costly they are, > and whether they can be turned off in Kconfig. > >> Finishing the page pool, the bulk allocator and converting drivers >> would be the most likely successful path forward but it's currently >> stalled as everyone that was previously involved is too busy. >> > I think we should consider changing the default allocation of PCP > fraction as well, or implement some smart dynamic heuristic. > This turned on to have significant effect over networking performance. > > Many thanks Mel! > > Regards, > Tariq -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 889A86B025F for ; Fri, 3 Nov 2017 09:40:43 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id k15so1636113wrc.1 for ; Fri, 03 Nov 2017 06:40:43 -0700 (PDT) Received: from outbound-smtp09.blacknight.com (outbound-smtp09.blacknight.com. [46.22.139.14]) by mx.google.com with ESMTPS id s52si5075809eda.2.2017.11.03.06.40.42 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 03 Nov 2017 06:40:42 -0700 (PDT) Received: from outbound-smtp14.blacknight.com (outbound-smtp14.blacknight.com [46.22.139.231]) by outbound-smtp09.blacknight.com (Postfix) with ESMTPS id E35A11C2959 for ; Fri, 3 Nov 2017 13:40:41 +0000 (GMT) Received: from mail.blacknight.com (unknown [81.17.254.17]) by outbound-smtp14.blacknight.com (Postfix) with ESMTPS id D2AB51C29FA for ; Fri, 3 Nov 2017 13:40:41 +0000 (GMT) Date: Fri, 3 Nov 2017 13:40:20 +0000 From: Mel Gorman Subject: Re: Page allocator bottleneck Message-ID: <20171103134020.3hwquerifnc6k6qw@techsingularity.net> References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tariq Toukan Cc: Linux Kernel Network Developers , linux-mm , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote: > > > On 18/09/2017 12:16 PM, Tariq Toukan wrote: > > > > > > On 15/09/2017 1:23 PM, Mel Gorman wrote: > > > On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote: > > > > Insights: Major degradation between #1 and #2, not getting any > > > > close to linerate! Degradation is fixed between #2 and #3. This is > > > > because page allocator cannot stand the higher allocation rate. In > > > > #2, we also see that the addition of rings (cores) reduces BW (!!), > > > > as result of increasing congestion over shared resources. > > > > > > > > > > Unfortunately, no surprises there. > > > > > > > Congestion in this case is very clear. When monitored in perf > > > > top: 85.58% [kernel] [k] queued_spin_lock_slowpath > > > > > > > > > > While it's not proven, the most likely candidate is the zone lock > > > and that should be confirmed using a call-graph profile. If so, then > > > the suggestion to tune to the size of the per-cpu allocator would > > > mitigate the problem. > > > > > Indeed, I tuned the per-cpu allocator and bottleneck is released. > > > > Hi all, > > After leaving this task for a while doing other tasks, I got back to it now > and see that the good behavior I observed earlier was not stable. > > Recall: I work with a modified driver that allocates a page (4K) per packet > (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps > NICs. > There is almost new in the data that hasn't been discussed before. The suggestion to free on a remote per-cpu list would be expensive as it would require per-cpu lists to have a lock for safe remote access. However, I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git . It's an unfinished prototype I worked on a few weeks ago. I was going to revisit in about a months time when 4.15-rc1 was out. I'd be interested in seeing if it has a postive gain in normal page allocations without destroying the performance of interrupt and softirq allocation contexts. The interrupt/softirq context testing is crucial as that is something that hurt us before when trying to improve page allocator performance. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 2A3484403E0 for ; Wed, 8 Nov 2017 00:42:27 -0500 (EST) Received: by mail-pf0-f198.google.com with SMTP id 76so1381524pfr.3 for ; Tue, 07 Nov 2017 21:42:27 -0800 (PST) Received: from EUR02-AM5-obe.outbound.protection.outlook.com (mail-eopbgr00050.outbound.protection.outlook.com. [40.107.0.50]) by mx.google.com with ESMTPS id w19si3135997pfa.59.2017.11.07.21.42.20 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 07 Nov 2017 21:42:21 -0800 (PST) Subject: Re: Page allocator bottleneck References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> <20171103134020.3hwquerifnc6k6qw@techsingularity.net> From: Tariq Toukan Message-ID: Date: Wed, 8 Nov 2017 14:42:04 +0900 MIME-Version: 1.0 In-Reply-To: <20171103134020.3hwquerifnc6k6qw@techsingularity.net> Content-Type: text/plain; charset=iso-8859-15; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Tariq Toukan Cc: Linux Kernel Network Developers , linux-mm , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko On 03/11/2017 10:40 PM, Mel Gorman wrote: > On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote: >> >> >> On 18/09/2017 12:16 PM, Tariq Toukan wrote: >>> >>> >>> On 15/09/2017 1:23 PM, Mel Gorman wrote: >>>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote: >>>>> Insights: Major degradation between #1 and #2, not getting any >>>>> close to linerate! Degradation is fixed between #2 and #3. This is >>>>> because page allocator cannot stand the higher allocation rate. In >>>>> #2, we also see that the addition of rings (cores) reduces BW (!!), >>>>> as result of increasing congestion over shared resources. >>>>> >>>> >>>> Unfortunately, no surprises there. >>>> >>>>> Congestion in this case is very clear. When monitored in perf >>>>> top: 85.58% [kernel] [k] queued_spin_lock_slowpath >>>>> >>>> >>>> While it's not proven, the most likely candidate is the zone lock >>>> and that should be confirmed using a call-graph profile. If so, then >>>> the suggestion to tune to the size of the per-cpu allocator would >>>> mitigate the problem. >>>> >>> Indeed, I tuned the per-cpu allocator and bottleneck is released. >>> >> >> Hi all, >> >> After leaving this task for a while doing other tasks, I got back to it now >> and see that the good behavior I observed earlier was not stable. >> >> Recall: I work with a modified driver that allocates a page (4K) per packet >> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps >> NICs. >> > > There is almost new in the data that hasn't been discussed before. The > suggestion to free on a remote per-cpu list would be expensive as it would > require per-cpu lists to have a lock for safe remote access. That's right, but each such lock will be significantly less congested than the buddy allocator lock. In the flow in subject two cores need to synchronize (one allocates, one frees). We also need to evaluate the cost of acquiring and releasing the lock in the case of no congestion at all. > However, > I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch > ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git . It's an > unfinished prototype I worked on a few weeks ago. I was going to revisit > in about a months time when 4.15-rc1 was out. I'd be interested in seeing > if it has a postive gain in normal page allocations without destroying > the performance of interrupt and softirq allocation contexts. The > interrupt/softirq context testing is crucial as that is something that > hurt us before when trying to improve page allocator performance. > Yes, I will test that once I get back in office (after netdev conference and vacation). Can you please elaborate in a few words about the idea behind the prototype? Does it address page-allocator scalability issues, or only the rate of single core page allocations? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) by kanga.kvack.org (Postfix) with ESMTP id 7C5AE4403E0 for ; Wed, 8 Nov 2017 04:35:50 -0500 (EST) Received: by mail-wm0-f69.google.com with SMTP id e8so2054935wmc.2 for ; Wed, 08 Nov 2017 01:35:50 -0800 (PST) Received: from outbound-smtp04.blacknight.com (outbound-smtp04.blacknight.com. [81.17.249.35]) by mx.google.com with ESMTPS id a23si3886440edn.387.2017.11.08.01.35.48 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 08 Nov 2017 01:35:49 -0800 (PST) Received: from mail.blacknight.com (pemlinmail06.blacknight.ie [81.17.255.152]) by outbound-smtp04.blacknight.com (Postfix) with ESMTPS id A51D298C20 for ; Wed, 8 Nov 2017 09:35:48 +0000 (UTC) Date: Wed, 8 Nov 2017 09:35:47 +0000 From: Mel Gorman Subject: Re: Page allocator bottleneck Message-ID: <20171108093547.ctsjv4a42xjvfsf7@techsingularity.net> References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> <20171103134020.3hwquerifnc6k6qw@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Tariq Toukan Cc: Linux Kernel Network Developers , linux-mm , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko On Wed, Nov 08, 2017 at 02:42:04PM +0900, Tariq Toukan wrote: > > > Hi all, > > > > > > After leaving this task for a while doing other tasks, I got back to it now > > > and see that the good behavior I observed earlier was not stable. > > > > > > Recall: I work with a modified driver that allocates a page (4K) per packet > > > (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps > > > NICs. > > > > > > > There is almost new in the data that hasn't been discussed before. The > > suggestion to free on a remote per-cpu list would be expensive as it would > > require per-cpu lists to have a lock for safe remote access. > > That's right, but each such lock will be significantly less congested than > the buddy allocator lock. That is not necessarily true if all the allocations and frees always happen on the same CPUs. The contention will be equivalent to the zone lock. Your point will only hold true if there are also heavy allocation streams from other CPUs that are unrelated. > In the flow in subject two cores need to > synchronize (one allocates, one frees). > We also need to evaluate the cost of acquiring and releasing the lock in the > case of no congestion at all. > If the per-cpu structures have a lock, there will be a light amount of overhead. Nothing too severe, but it shouldn't be done lightly either. > > However, > > I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch > > ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git . It's an > > unfinished prototype I worked on a few weeks ago. I was going to revisit > > in about a months time when 4.15-rc1 was out. I'd be interested in seeing > > if it has a postive gain in normal page allocations without destroying > > the performance of interrupt and softirq allocation contexts. The > > interrupt/softirq context testing is crucial as that is something that > > hurt us before when trying to improve page allocator performance. > > > Yes, I will test that once I get back in office (after netdev conference and > vacation). Thanks. > Can you please elaborate in a few words about the idea behind the prototype? > Does it address page-allocator scalability issues, or only the rate of > single core page allocations? Short answer -- maybe. All scalability issues or rates of allocation are context and workload dependant so the question is impossible to answer for the general case. Broadly speaking, the patch reintroduces the per-cpu lists being for !irq context allocations again. The last time we did this, hard and soft IRQ allocations went through the buddy allocator which couldn't scale and the patch was reverted. With this patch, it goes through a very large pagevec-like structure that is protected by a lock but the fast paths for alloc/free are extremely simple operations so the lock hold times are very small. Potentially, a development path is that the current per-cpu allocator is replaced with pagevec-like structures that are dynamically allocated which would also allow pages to be freed to remote CPU lists (if we could detect when that is appropriate which is unclear). We could also drain remote lists without using IPIs. The downside is that the memory footprint of the allocator would be higher and the size could no longer be tuned so there would need to be excellent justification for such a move. I haven't posted the patches properly yet because mmotm is carrying too many patches as it is and this patch indirectly depends on the contents. I also didn't write memory hot-remove support which would be a requirement before merging. I hadn't intended to put further effort into it until I had some evidence the approach had promise. My own testing indicated it worked but the drivers I was using for network tests did not allocate intensely enough to show any major gain/loss. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f71.google.com (mail-oi0-f71.google.com [209.85.218.71]) by kanga.kvack.org (Postfix) with ESMTP id A46F6440460 for ; Wed, 8 Nov 2017 22:51:04 -0500 (EST) Received: by mail-oi0-f71.google.com with SMTP id 82so3603795oid.11 for ; Wed, 08 Nov 2017 19:51:04 -0800 (PST) Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id e29sor952872oth.163.2017.11.08.19.51.03 for (Google Transport Security); Wed, 08 Nov 2017 19:51:03 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20171108093547.ctsjv4a42xjvfsf7@techsingularity.net> References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> <20171103134020.3hwquerifnc6k6qw@techsingularity.net> <20171108093547.ctsjv4a42xjvfsf7@techsingularity.net> From: "Figo.zhang" Date: Thu, 9 Nov 2017 11:51:02 +0800 Message-ID: Subject: Re: Page allocator bottleneck Content-Type: multipart/alternative; boundary="001a113e55ec8eb87c055d84b66e" Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Tariq Toukan , Linux Kernel Network Developers , linux-mm , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko --001a113e55ec8eb87c055d84b66e Content-Type: text/plain; charset="UTF-8" @Tariq, some ideas would steal from DPDK to improve the high speed network card? such as a physical CPU dedicated for the RX and TX thread (no context switch and interrupt latency), and the memory has prepared and allocated. 2017-11-08 17:35 GMT+08:00 Mel Gorman : > On Wed, Nov 08, 2017 at 02:42:04PM +0900, Tariq Toukan wrote: > > > > Hi all, > > > > > > > > After leaving this task for a while doing other tasks, I got back to > it now > > > > and see that the good behavior I observed earlier was not stable. > > > > > > > > Recall: I work with a modified driver that allocates a page (4K) per > packet > > > > (MTU=1500), in order to simulate the stress on page-allocator in > 200Gbps > > > > NICs. > > > > > > > > > > There is almost new in the data that hasn't been discussed before. The > > > suggestion to free on a remote per-cpu list would be expensive as it > would > > > require per-cpu lists to have a lock for safe remote access. > > > > That's right, but each such lock will be significantly less congested > than > > the buddy allocator lock. > > That is not necessarily true if all the allocations and frees always happen > on the same CPUs. The contention will be equivalent to the zone lock. > Your point will only hold true if there are also heavy allocation streams > from other CPUs that are unrelated. > > > In the flow in subject two cores need to > > synchronize (one allocates, one frees). > > We also need to evaluate the cost of acquiring and releasing the lock in > the > > case of no congestion at all. > > > > If the per-cpu structures have a lock, there will be a light amount of > overhead. Nothing too severe, but it shouldn't be done lightly either. > > > > However, > > > I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch > > > ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git . It's > an > > > unfinished prototype I worked on a few weeks ago. I was going to > revisit > > > in about a months time when 4.15-rc1 was out. I'd be interested in > seeing > > > if it has a postive gain in normal page allocations without destroying > > > the performance of interrupt and softirq allocation contexts. The > > > interrupt/softirq context testing is crucial as that is something that > > > hurt us before when trying to improve page allocator performance. > > > > > Yes, I will test that once I get back in office (after netdev conference > and > > vacation). > > Thanks. > > > Can you please elaborate in a few words about the idea behind the > prototype? > > Does it address page-allocator scalability issues, or only the rate of > > single core page allocations? > > Short answer -- maybe. All scalability issues or rates of allocation are > context and workload dependant so the question is impossible to answer > for the general case. > > Broadly speaking, the patch reintroduces the per-cpu lists being for !irq > context allocations again. The last time we did this, hard and soft IRQ > allocations went through the buddy allocator which couldn't scale and > the patch was reverted. With this patch, it goes through a very large > pagevec-like structure that is protected by a lock but the fast paths > for alloc/free are extremely simple operations so the lock hold times are > very small. Potentially, a development path is that the current per-cpu > allocator is replaced with pagevec-like structures that are dynamically > allocated which would also allow pages to be freed to remote CPU lists > (if we could detect when that is appropriate which is unclear). We could > also drain remote lists without using IPIs. The downside is that the memory > footprint of the allocator would be higher and the size could no longer > be tuned so there would need to be excellent justification for such a move. > > I haven't posted the patches properly yet because mmotm is carrying too > many patches as it is and this patch indirectly depends on the contents. I > also didn't write memory hot-remove support which would be a requirement > before merging. I hadn't intended to put further effort into it until I > had some evidence the approach had promise. My own testing indicated it > worked but the drivers I was using for network tests did not allocate > intensely enough to show any major gain/loss. > > -- > Mel Gorman > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > --001a113e55ec8eb87c055d84b66e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
@Tariq, some=C2=A0ideas=C2=A0would=C2=A0steal=C2=A0from DP= DK to improve the high=C2=A0speed=C2=A0network=C2=A0card?=C2=A0
= such=C2=A0as a physical CPU dedicated=C2=A0for=C2=A0the RX and TX=C2=A0thr= ead (no context=C2=A0switch and interrupt latency), and the=C2=A0
memory has prepared=C2=A0and allocated.
=

2017-11-08 = 17:35 GMT+08:00 Mel Gorman <mgorman@techsingularity.net><= /span>:
On Wed, Nov 08, = 2017 at 02:42:04PM +0900, Tariq Toukan wrote:
> > > Hi all,
> > >
> > > After leaving this task for a while doing other tasks, I got= back to it now
> > > and see that the good behavior I observed earlier was not st= able.
> > >
> > > Recall: I work with a modified driver that allocates a page = (4K) per packet
> > > (MTU=3D1500), in order to simulate the stress on page-alloca= tor in 200Gbps
> > > NICs.
> > >
> >
> > There is almost new in the data that hasn't been discussed be= fore. The
> > suggestion to free on a remote per-cpu list would be expensive as= it would
> > require per-cpu lists to have a lock for safe remote access.
>
> That's right, but each such lock will be significantly less conges= ted than
> the buddy allocator lock.

That is not necessarily true if all the allocations and frees always= happen
on the same CPUs. The contention will be equivalent to the zone lock.
Your point will only hold true if there are also heavy allocation streams from other CPUs that are unrelated.

> In the flow in subject two cores need to
> synchronize (one allocates, one frees).
> We also need to evaluate the cost of acquiring and releasing the lock = in the
> case of no congestion at all.
>

If the per-cpu structures have a lock, there will be a light amount = of
overhead. Nothing too severe, but it shouldn't be done lightly either.<= br>
> >=C2=A0 However,
> > I'd be curious if you could test the mm-pagealloc-irqpvec-v1r= 4 branch
> > ttps://git.kernel.org/pub/scm/= linux/kernel/git/mel/linux.git .=C2=A0 It's an
> > unfinished prototype I worked on a few weeks ago. I was going to = revisit
> > in about a months time when 4.15-rc1 was out. I'd be interest= ed in seeing
> > if it has a postive gain in normal page allocations without destr= oying
> > the performance of interrupt and softirq allocation contexts. The=
> > interrupt/softirq context testing is crucial as that is something= that
> > hurt us before when trying to improve page allocator performance.=
> >
> Yes, I will test that once I get back in office (after netdev conferen= ce and
> vacation).

Thanks.

> Can you please elaborate in a few words about the idea behind the prot= otype?
> Does it address page-allocator scalability issues, or only the rate of=
> single core page allocations?

Short answer -- maybe. All scalability issues or rates of allocation= are
context and workload dependant so the question is impossible to answer
for the general case.

Broadly speaking, the patch reintroduces the per-cpu lists being for !irq context allocations again. The last time we did this, hard and soft IRQ
allocations went through the buddy allocator which couldn't scale and the patch was reverted. With this patch, it goes through a very large
pagevec-like structure that is protected by a lock but the fast paths
for alloc/free are extremely simple operations so the lock hold times are very small. Potentially, a development path is that the current per-cpu
allocator is replaced with pagevec-like structures that are dynamically
allocated which would also allow pages to be freed to remote CPU lists
(if we could detect when that is appropriate which is unclear). We could also drain remote lists without using IPIs. The downside is that the memory=
footprint of the allocator would be higher and the size could no longer
be tuned so there would need to be excellent justification for such a move.=

I haven't posted the patches properly yet because mmotm is carrying too=
many patches as it is and this patch indirectly depends on the contents. I<= br> also didn't write memory hot-remove support which would be a requiremen= t
before merging. I hadn't intended to put further effort into it until I=
had some evidence the approach had promise. My own testing indicated it
worked but the drivers I was using for network tests did not allocate
intensely enough to show any major gain/loss.

--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.= =C2=A0 For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=3Dmailto:"dont@kvack.org"> email@kva= ck.org </a>

--001a113e55ec8eb87c055d84b66e-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id C11EF440460 for ; Thu, 9 Nov 2017 00:07:10 -0500 (EST) Received: by mail-pf0-f199.google.com with SMTP id p87so4250596pfj.21 for ; Wed, 08 Nov 2017 21:07:10 -0800 (PST) Received: from EUR01-HE1-obe.outbound.protection.outlook.com (mail-he1eur01on0054.outbound.protection.outlook.com. [104.47.0.54]) by mx.google.com with ESMTPS id m8si5319212pgt.327.2017.11.08.21.07.09 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 08 Nov 2017 21:07:09 -0800 (PST) Subject: Re: Page allocator bottleneck References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> <20171103134020.3hwquerifnc6k6qw@techsingularity.net> <20171108093547.ctsjv4a42xjvfsf7@techsingularity.net> From: Tariq Toukan Message-ID: Date: Thu, 9 Nov 2017 14:06:33 +0900 MIME-Version: 1.0 In-Reply-To: <20171108093547.ctsjv4a42xjvfsf7@techsingularity.net> Content-Type: text/plain; charset=iso-8859-15; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Tariq Toukan Cc: Linux Kernel Network Developers , linux-mm , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko On 08/11/2017 6:35 PM, Mel Gorman wrote: > On Wed, Nov 08, 2017 at 02:42:04PM +0900, Tariq Toukan wrote: >>>> Hi all, >>>> >>>> After leaving this task for a while doing other tasks, I got back to it now >>>> and see that the good behavior I observed earlier was not stable. >>>> >>>> Recall: I work with a modified driver that allocates a page (4K) per packet >>>> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps >>>> NICs. >>>> >>> >>> There is almost new in the data that hasn't been discussed before. The >>> suggestion to free on a remote per-cpu list would be expensive as it would >>> require per-cpu lists to have a lock for safe remote access. >> >> That's right, but each such lock will be significantly less congested than >> the buddy allocator lock. > > That is not necessarily true if all the allocations and frees always happen > on the same CPUs. The contention will be equivalent to the zone lock. > Your point will only hold true if there are also heavy allocation streams > from other CPUs that are unrelated. That's exactly the case. I saw no issues when working with a single core allocating pages (and many others consuming the SKBs), this does not stress the buddy allocator enough to expose the problem. On my server, problem becomes visible when working with >= 4 allocator cores (RX rings). So "distributing" the locks between the different PCPs and doing remote-free (instead of using the centralized buddy allocator lock), would give a huge performance under high load (although it might cause a slight degradation when load is low). > >> In the flow in subject two cores need to >> synchronize (one allocates, one frees). >> We also need to evaluate the cost of acquiring and releasing the lock in the >> case of no congestion at all. >> > > If the per-cpu structures have a lock, there will be a light amount of > overhead. Nothing too severe, but it shouldn't be done lightly either. > If the trade-off is a huge gain under load, it might be worth it. >>> However, >>> I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch >>> ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git . It's an >>> unfinished prototype I worked on a few weeks ago. I was going to revisit >>> in about a months time when 4.15-rc1 was out. I'd be interested in seeing >>> if it has a postive gain in normal page allocations without destroying >>> the performance of interrupt and softirq allocation contexts. The >>> interrupt/softirq context testing is crucial as that is something that >>> hurt us before when trying to improve page allocator performance. >>> >> Yes, I will test that once I get back in office (after netdev conference and >> vacation). > > Thanks. > >> Can you please elaborate in a few words about the idea behind the prototype? >> Does it address page-allocator scalability issues, or only the rate of >> single core page allocations? > > Short answer -- maybe. All scalability issues or rates of allocation are > context and workload dependant so the question is impossible to answer > for the general case. > > Broadly speaking, the patch reintroduces the per-cpu lists being for !irq > context allocations again. The last time we did this, hard and soft IRQ > allocations went through the buddy allocator which couldn't scale and > the patch was reverted. With this patch, it goes through a very large > pagevec-like structure that is protected by a lock but the fast paths > for alloc/free are extremely simple operations so the lock hold times are > very small. Potentially, a development path is that the current per-cpu > allocator is replaced with pagevec-like structures that are dynamically > allocated which would also allow pages to be freed to remote CPU lists > (if we could detect when that is appropriate which is unclear). We could > also drain remote lists without using IPIs. The downside is that the memory > footprint of the allocator would be higher and the size could no longer > be tuned so there would need to be excellent justification for such a move. > > I haven't posted the patches properly yet because mmotm is carrying too > many patches as it is and this patch indirectly depends on the contents. I > also didn't write memory hot-remove support which would be a requirement > before merging. I hadn't intended to put further effort into it until I > had some evidence the approach had promise. My own testing indicated it > worked but the drivers I was using for network tests did not allocate > intensely enough to show any major gain/loss. > Thanks for the description. This sounds intriguing. Once I'll get to testing it, I'll magnify the effect by stressing the page-allocator the same way I did earlier to simulate a load of 200Gbps. Regards, Tariq -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot0-f198.google.com (mail-ot0-f198.google.com [74.125.82.198]) by kanga.kvack.org (Postfix) with ESMTP id DDD6A440460 for ; Thu, 9 Nov 2017 00:21:37 -0500 (EST) Received: by mail-ot0-f198.google.com with SMTP id w17so1233186oti.22 for ; Wed, 08 Nov 2017 21:21:37 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id l128si1018821oig.518.2017.11.08.21.21.36 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 08 Nov 2017 21:21:36 -0800 (PST) Date: Thu, 9 Nov 2017 06:21:01 +0100 From: Jesper Dangaard Brouer Subject: Re: Page allocator bottleneck Message-ID: <20171109062101.64bde3b6@redhat.com> In-Reply-To: <20171108093547.ctsjv4a42xjvfsf7@techsingularity.net> References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> <20171103134020.3hwquerifnc6k6qw@techsingularity.net> <20171108093547.ctsjv4a42xjvfsf7@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Tariq Toukan , Linux Kernel Network Developers , linux-mm , David Miller , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko , brouer@redhat.com, "Michael S. Tsirkin" On Wed, 8 Nov 2017 09:35:47 +0000 Mel Gorman wrote: > On Wed, Nov 08, 2017 at 02:42:04PM +0900, Tariq Toukan wrote: > > > > Hi all, > > > > > > > > After leaving this task for a while doing other tasks, I got back to it now > > > > and see that the good behavior I observed earlier was not stable. > > > > > > > > Recall: I work with a modified driver that allocates a page (4K) per packet > > > > (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps > > > > NICs. > > > > > > > > > > There is almost new in the data that hasn't been discussed before. The > > > suggestion to free on a remote per-cpu list would be expensive as it would > > > require per-cpu lists to have a lock for safe remote access. > > > > That's right, but each such lock will be significantly less congested than > > the buddy allocator lock. > > That is not necessarily true if all the allocations and frees always happen > on the same CPUs. The contention will be equivalent to the zone lock. > Your point will only hold true if there are also heavy allocation streams > from other CPUs that are unrelated. > > > In the flow in subject two cores need to > > synchronize (one allocates, one frees). > > We also need to evaluate the cost of acquiring and releasing the lock in the > > case of no congestion at all. > > > > If the per-cpu structures have a lock, there will be a light amount of > overhead. Nothing too severe, but it shouldn't be done lightly either. > > > > However, > > > I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch > > > ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git . It's an > > > unfinished prototype I worked on a few weeks ago. I was going to revisit > > > in about a months time when 4.15-rc1 was out. I'd be interested in seeing > > > if it has a postive gain in normal page allocations without destroying > > > the performance of interrupt and softirq allocation contexts. The > > > interrupt/softirq context testing is crucial as that is something that > > > hurt us before when trying to improve page allocator performance. > > > > > Yes, I will test that once I get back in office (after netdev conference and > > vacation). > > Thanks. I'll also commit to testing this (when I return home, as Tariq I'm also in Seoul ATM). > > Can you please elaborate in a few words about the idea behind the prototype? > > Does it address page-allocator scalability issues, or only the rate of > > single core page allocations? > > Short answer -- maybe. All scalability issues or rates of allocation are > context and workload dependant so the question is impossible to answer > for the general case. > > Broadly speaking, the patch reintroduces the per-cpu lists being for !irq > context allocations again. The last time we did this, hard and soft IRQ > allocations went through the buddy allocator which couldn't scale and > the patch was reverted. With this patch, it goes through a very large > pagevec-like structure that is protected by a lock but the fast paths > for alloc/free are extremely simple operations so the lock hold times are > very small. Potentially, a development path is that the current per-cpu > allocator is replaced with pagevec-like structures that are dynamically > allocated which would also allow pages to be freed to remote CPU lists I've had huge success using ptr_ring, as a queue between CPUs, to minimize cross-CPU cache-line touching. With the recently accepted BPF map called "cpumap" used for XDP_REDIRECT. It's important to handle the two borderline cases in ptr_ring, of the queue being almost full (default handled in ptr_ring) or almost empty. Like describe in[1] slide 14: [1] http://people.netfilter.org/hawk/presentations/NetConf2017_Seoul/XDP_devel_update_NetConf2017_Seoul.pdf The use of XDP_REDIRECT + cpumap, do expose issues with the page allocator. E.g. slide 19 show ixgbe recycle scheme failing, but still hitting the PCP. Also notice slide 22 deducing the overhead. Scale stressing ptr_ring is showed in extra slides 35-39. > (if we could detect when that is appropriate which is unclear). We could > also drain remote lists without using IPIs. The downside is that the memory > footprint of the allocator would be higher and the size could no longer > be tuned so there would need to be excellent justification for such a move. > > I haven't posted the patches properly yet because mmotm is carrying too > many patches as it is and this patch indirectly depends on the contents. I > also didn't write memory hot-remove support which would be a requirement > before merging. I hadn't intended to put further effort into it until I > had some evidence the approach had promise. My own testing indicated it > worked but the drivers I was using for network tests did not allocate > intensely enough to show any major gain/loss. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id 73FC46B0005 for ; Sat, 21 Apr 2018 04:14:41 -0400 (EDT) Received: by mail-pf0-f197.google.com with SMTP id p189so5916992pfp.1 for ; Sat, 21 Apr 2018 01:14:41 -0700 (PDT) Received: from mga01.intel.com (mga01.intel.com. [192.55.52.88]) by mx.google.com with ESMTPS id 33-v6si7799064plb.19.2018.04.21.01.14.39 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 21 Apr 2018 01:14:39 -0700 (PDT) Date: Sat, 21 Apr 2018 16:15:05 +0800 From: Aaron Lu Subject: Re: Page allocator bottleneck Message-ID: <20180421081505.GA24916@intel.com> References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tariq Toukan Cc: Linux Kernel Network Developers , linux-mm , Mel Gorman , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko Sorry to bring up an old thread... On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote: > > > On 18/09/2017 12:16 PM, Tariq Toukan wrote: > > > > > > On 15/09/2017 1:23 PM, Mel Gorman wrote: > > > On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote: > > > > Insights: Major degradation between #1 and #2, not getting any > > > > close to linerate! Degradation is fixed between #2 and #3. This is > > > > because page allocator cannot stand the higher allocation rate. In > > > > #2, we also see that the addition of rings (cores) reduces BW (!!), > > > > as result of increasing congestion over shared resources. > > > > > > > > > > Unfortunately, no surprises there. > > > > > > > Congestion in this case is very clear. When monitored in perf > > > > top: 85.58% [kernel] [k] queued_spin_lock_slowpath > > > > > > > > > > While it's not proven, the most likely candidate is the zone lock > > > and that should be confirmed using a call-graph profile. If so, then > > > the suggestion to tune to the size of the per-cpu allocator would > > > mitigate the problem. > > > > > Indeed, I tuned the per-cpu allocator and bottleneck is released. > > > > Hi all, > > After leaving this task for a while doing other tasks, I got back to it now > and see that the good behavior I observed earlier was not stable. I posted a patchset to improve zone->lock contention for order-0 pages recently, it can almost eliminate 80% zone->lock contention for will-it-scale/page_fault1 testcase when tested on a 2 sockets Intel Skylake server and it doesn't require PCP size tune, so should have some effects on your workload where one CPU does allocation while another does free. It did this by some disruptive changes: 1 on free path, it skipped doing merge(so could be bad for mixed workloads where both 4K and high order pages are needed); 2 on allocation path, it avoided touching multiple cachelines. RFC v2 patchset: https://lkml.org/lkml/2018/3/20/171 repo: https://github.com/aaronlu/linux zone_lock_rfc_v2 > Recall: I work with a modified driver that allocates a page (4K) per packet > (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps > NICs. > > Performance is good as long as pages are available in the allocating cores's > PCP. > Issue is that pages are allocated in one core, then free'd in another, > making it's hard for the PCP to work efficiently, and both the allocator > core and the freeing core need to access the buddy allocator very often. > > I'd like to share with you some testing numbers: > > Test: ./super_netperf 128 -H 24.134.0.51 -l 1000 > > 100% cpu on all cores, top func in perf: > 84.98% [kernel] [k] queued_spin_lock_slowpath > > system wide (all cores) > 1135941 kmem:mm_page_alloc > > 2606629 kmem:mm_page_free > > 0 kmem:mm_page_alloc_extfrag > 4784616 kmem:mm_page_alloc_zone_locked > > 1337 kmem:mm_page_free_batched > > 6488213 kmem:mm_page_pcpu_drain > > 8925503 net:napi_gro_receive_entry > > > Two types of cores: > A core mostly running napi (8 such cores): > 221875 kmem:mm_page_alloc > > 17100 kmem:mm_page_free > > 0 kmem:mm_page_alloc_extfrag > 766584 kmem:mm_page_alloc_zone_locked > > 16 kmem:mm_page_free_batched > > 35 kmem:mm_page_pcpu_drain > > 1340139 net:napi_gro_receive_entry > > > Other core, mostly running user application (40 such): > 2 kmem:mm_page_alloc > > 38922 kmem:mm_page_free > > 0 kmem:mm_page_alloc_extfrag > 1 kmem:mm_page_alloc_zone_locked > > 8 kmem:mm_page_free_batched > > 107289 kmem:mm_page_pcpu_drain > > 34 net:napi_gro_receive_entry > > > As you can see, sync overhead is enormous. > > PCP-wise, a key improvement in such scenarios would be reached if we could > (1) keep and handle the allocated page on same cpu, or (2) somehow get the > page back to the allocating core's PCP in a fast-path, without going through > the regular buddy allocator paths. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 12D136B0005 for ; Sun, 22 Apr 2018 12:43:44 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id y129so4954207pgb.5 for ; Sun, 22 Apr 2018 09:43:44 -0700 (PDT) Received: from EUR01-DB5-obe.outbound.protection.outlook.com (mail-db5eur01on0055.outbound.protection.outlook.com. [104.47.2.55]) by mx.google.com with ESMTPS id v23si9540956pfk.116.2018.04.22.09.43.42 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Sun, 22 Apr 2018 09:43:42 -0700 (PDT) Subject: Re: Page allocator bottleneck References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> <20180421081505.GA24916@intel.com> From: Tariq Toukan Message-ID: <127df719-b978-60b7-5d77-3c8efbf2ecff@mellanox.com> Date: Sun, 22 Apr 2018 19:43:29 +0300 MIME-Version: 1.0 In-Reply-To: <20180421081505.GA24916@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: Linux Kernel Network Developers , linux-mm , Mel Gorman , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko On 21/04/2018 11:15 AM, Aaron Lu wrote: > Sorry to bring up an old thread... > I want to thank you very much for bringing this up! > On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote: >> >> >> On 18/09/2017 12:16 PM, Tariq Toukan wrote: >>> >>> >>> On 15/09/2017 1:23 PM, Mel Gorman wrote: >>>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote: >>>>> Insights: Major degradation between #1 and #2, not getting any >>>>> close to linerate! Degradation is fixed between #2 and #3. This is >>>>> because page allocator cannot stand the higher allocation rate. In >>>>> #2, we also see that the addition of rings (cores) reduces BW (!!), >>>>> as result of increasing congestion over shared resources. >>>>> >>>> >>>> Unfortunately, no surprises there. >>>> >>>>> Congestion in this case is very clear. When monitored in perf >>>>> top: 85.58% [kernel] [k] queued_spin_lock_slowpath >>>>> >>>> >>>> While it's not proven, the most likely candidate is the zone lock >>>> and that should be confirmed using a call-graph profile. If so, then >>>> the suggestion to tune to the size of the per-cpu allocator would >>>> mitigate the problem. >>>> >>> Indeed, I tuned the per-cpu allocator and bottleneck is released. >>> >> >> Hi all, >> >> After leaving this task for a while doing other tasks, I got back to it now >> and see that the good behavior I observed earlier was not stable. > > I posted a patchset to improve zone->lock contention for order-0 pages > recently, it can almost eliminate 80% zone->lock contention for > will-it-scale/page_fault1 testcase when tested on a 2 sockets Intel > Skylake server and it doesn't require PCP size tune, so should have > some effects on your workload where one CPU does allocation while > another does free. > That is great news. In our driver's memory scheme (and many others as well) we allocate only order-0 pages (the only flow that does not do that yet in upstream will do so very soon, we already have the patches in our internal branch). Allocation of order-0 pages is not only the common case, but is the only type of allocation in our data-path. Let's optimize it! > It did this by some disruptive changes: > 1 on free path, it skipped doing merge(so could be bad for mixed > workloads where both 4K and high order pages are needed); I think there are so many advantages to not using high order allocations, especially in production servers that are not rebooted for long periods and become fragmented. AFAIK, the community direction (at least in networking) is using order-0 pages in datapath, so optimizing their allocaiton is a very good idea. Need of course to perf evaluate possible degradations, and see how important these use cases are. > 2 on allocation path, it avoided touching multiple cachelines. > Great! > RFC v2 patchset: > https://lkml.org/lkml/2018/3/20/171 > > repo: > https://github.com/aaronlu/linux zone_lock_rfc_v2 > I will check them out first thing tomorrow! p.s., I will be on vacation for a week starting Tuesday. I hope I can make some progress before that :) Thanks, Tariq > >> Recall: I work with a modified driver that allocates a page (4K) per packet >> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps >> NICs. >> >> Performance is good as long as pages are available in the allocating cores's >> PCP. >> Issue is that pages are allocated in one core, then free'd in another, >> making it's hard for the PCP to work efficiently, and both the allocator >> core and the freeing core need to access the buddy allocator very often. >> >> I'd like to share with you some testing numbers: >> >> Test: ./super_netperf 128 -H 24.134.0.51 -l 1000 >> >> 100% cpu on all cores, top func in perf: >> 84.98% [kernel] [k] queued_spin_lock_slowpath >> >> system wide (all cores) >> 1135941 kmem:mm_page_alloc >> >> 2606629 kmem:mm_page_free >> >> 0 kmem:mm_page_alloc_extfrag >> 4784616 kmem:mm_page_alloc_zone_locked >> >> 1337 kmem:mm_page_free_batched >> >> 6488213 kmem:mm_page_pcpu_drain >> >> 8925503 net:napi_gro_receive_entry >> >> >> Two types of cores: >> A core mostly running napi (8 such cores): >> 221875 kmem:mm_page_alloc >> >> 17100 kmem:mm_page_free >> >> 0 kmem:mm_page_alloc_extfrag >> 766584 kmem:mm_page_alloc_zone_locked >> >> 16 kmem:mm_page_free_batched >> >> 35 kmem:mm_page_pcpu_drain >> >> 1340139 net:napi_gro_receive_entry >> >> >> Other core, mostly running user application (40 such): >> 2 kmem:mm_page_alloc >> >> 38922 kmem:mm_page_free >> >> 0 kmem:mm_page_alloc_extfrag >> 1 kmem:mm_page_alloc_zone_locked >> >> 8 kmem:mm_page_free_batched >> >> 107289 kmem:mm_page_pcpu_drain >> >> 34 net:napi_gro_receive_entry >> >> >> As you can see, sync overhead is enormous. >> >> PCP-wise, a key improvement in such scenarios would be reached if we could >> (1) keep and handle the allocated page on same cpu, or (2) somehow get the >> page back to the allocating core's PCP in a fast-path, without going through >> the regular buddy allocator paths. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f197.google.com (mail-io0-f197.google.com [209.85.223.197]) by kanga.kvack.org (Postfix) with ESMTP id 0F8826B0005 for ; Mon, 23 Apr 2018 04:55:10 -0400 (EDT) Received: by mail-io0-f197.google.com with SMTP id x7-v6so13708966iob.21 for ; Mon, 23 Apr 2018 01:55:10 -0700 (PDT) Received: from EUR01-VE1-obe.outbound.protection.outlook.com (mail-ve1eur01on0050.outbound.protection.outlook.com. [104.47.1.50]) by mx.google.com with ESMTPS id d7-v6si6364215itf.77.2018.04.23.01.55.08 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 23 Apr 2018 01:55:08 -0700 (PDT) Subject: Re: Page allocator bottleneck From: Tariq Toukan References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> <20180421081505.GA24916@intel.com> <127df719-b978-60b7-5d77-3c8efbf2ecff@mellanox.com> Message-ID: <0dea4da6-8756-22d4-c586-267217a5fa63@mellanox.com> Date: Mon, 23 Apr 2018 11:54:57 +0300 MIME-Version: 1.0 In-Reply-To: <127df719-b978-60b7-5d77-3c8efbf2ecff@mellanox.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: Linux Kernel Network Developers , linux-mm , Mel Gorman , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko On 22/04/2018 7:43 PM, Tariq Toukan wrote: > > > On 21/04/2018 11:15 AM, Aaron Lu wrote: >> Sorry to bring up an old thread... >> > > I want to thank you very much for bringing this up! > >> On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote: >>> >>> >>> On 18/09/2017 12:16 PM, Tariq Toukan wrote: >>>> >>>> >>>> On 15/09/2017 1:23 PM, Mel Gorman wrote: >>>>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote: >>>>>> Insights: Major degradation between #1 and #2, not getting any >>>>>> close to linerate! Degradation is fixed between #2 and #3. This is >>>>>> because page allocator cannot stand the higher allocation rate. In >>>>>> #2, we also see that the addition of rings (cores) reduces BW (!!), >>>>>> as result of increasing congestion over shared resources. >>>>>> >>>>> >>>>> Unfortunately, no surprises there. >>>>> >>>>>> Congestion in this case is very clear. When monitored in perf >>>>>> top: 85.58% [kernel] [k] queued_spin_lock_slowpath >>>>>> >>>>> >>>>> While it's not proven, the most likely candidate is the zone lock >>>>> and that should be confirmed using a call-graph profile. If so, then >>>>> the suggestion to tune to the size of the per-cpu allocator would >>>>> mitigate the problem. >>>>> >>>> Indeed, I tuned the per-cpu allocator and bottleneck is released. >>>> >>> >>> Hi all, >>> >>> After leaving this task for a while doing other tasks, I got back to >>> it now >>> and see that the good behavior I observed earlier was not stable. >> >> I posted a patchset to improve zone->lock contention for order-0 pages >> recently, it can almost eliminate 80% zone->lock contention for >> will-it-scale/page_fault1 testcase when tested on a 2 sockets Intel >> Skylake server and it doesn't require PCP size tune, so should have >> some effects on your workload where one CPU does allocation while >> another does free. >> > > That is great news. In our driver's memory scheme (and many others as > well) we allocate only order-0 pages (the only flow that does not do > that yet in upstream will do so very soon, we already have the patches > in our internal branch). > Allocation of order-0 pages is not only the common case, but is the only > type of allocation in our data-path. Let's optimize it! > > >> It did this by some disruptive changes: >> 1 on free path, it skipped doing merge(so could be bad for mixed >> A A workloads where both 4K and high order pages are needed); > > I think there are so many advantages to not using high order > allocations, especially in production servers that are not rebooted for > long periods and become fragmented. > AFAIK, the community direction (at least in networking) is using order-0 > pages in datapath, so optimizing their allocaiton is a very good idea. > Need of course to perf evaluate possible degradations, and see how > important these use cases are. > >> 2 on allocation path, it avoided touching multiple cachelines. >> > > Great! > >> RFC v2 patchset: >> https://lkml.org/lkml/2018/3/20/171 >> >> repo: >> https://github.com/aaronlu/linux zone_lock_rfc_v2 >> > > I will check them out first thing tomorrow! > > p.s., I will be on vacation for a week starting Tuesday. > I hope I can make some progress before that :) > > Thanks, > Tariq > Hi, I ran my tests with your patches. Initial BW numbers are significantly higher than I documented back then in this mail-thread. For example, in driver #2 (see original mail thread), with 6 rings, I now get 92Gbps (slightly less than linerate) in comparison to 64Gbps back then. However, there were many kernel changes since then, I need to isolate your changes. I am not sure I can finish this today, but I will surely get to it next week after I'm back from vacation. Still, when I increase the scale (more rings, i.e. more cpus), I see that queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower than it used to be. This should be root solved by the (orthogonal) changes planned in network subsystem, which will change the SKB allocation/free scheme so that SKBs are released on the originating cpu. Thanks, Tariq >>> Recall: I work with a modified driver that allocates a page (4K) per >>> packet >>> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps >>> NICs. >>> >>> Performance is good as long as pages are available in the allocating >>> cores's >>> PCP. >>> Issue is that pages are allocated in one core, then free'd in another, >>> making it's hard for the PCP to work efficiently, and both the allocator >>> core and the freeing core need to access the buddy allocator very often. >>> >>> I'd like to share with you some testing numbers: >>> >>> Test: ./super_netperf 128 -H 24.134.0.51 -l 1000 >>> >>> 100% cpu on all cores, top func in perf: >>> A A A 84.98%A [kernel]A A A A A A A A A A A A [k] queued_spin_lock_slowpath >>> >>> system wide (all cores) >>> A A A A A A A A A A A 1135941A A A A A kmem:mm_page_alloc >>> >>> A A A A A A A A A A A 2606629A A A A A kmem:mm_page_free >>> >>> A A A A A A A A A A A A A A A A A 0A A A A A kmem:mm_page_alloc_extfrag >>> A A A A A A A A A A A 4784616A A A A A kmem:mm_page_alloc_zone_locked >>> >>> A A A A A A A A A A A A A A 1337A A A A A kmem:mm_page_free_batched >>> >>> A A A A A A A A A A A 6488213A A A A A kmem:mm_page_pcpu_drain >>> >>> A A A A A A A A A A A 8925503A A A A A net:napi_gro_receive_entry >>> >>> >>> Two types of cores: >>> A core mostly running napi (8 such cores): >>> A A A A A A A A A A A A 221875A A A A A kmem:mm_page_alloc >>> >>> A A A A A A A A A A A A A 17100A A A A A kmem:mm_page_free >>> >>> A A A A A A A A A A A A A A A A A 0A A A A A kmem:mm_page_alloc_extfrag >>> A A A A A A A A A A A A 766584A A A A A kmem:mm_page_alloc_zone_locked >>> >>> A A A A A A A A A A A A A A A A 16A A A A A kmem:mm_page_free_batched >>> >>> A A A A A A A A A A A A A A A A 35A A A A A kmem:mm_page_pcpu_drain >>> >>> A A A A A A A A A A A 1340139A A A A A net:napi_gro_receive_entry >>> >>> >>> Other core, mostly running user application (40 such): >>> A A A A A A A A A A A A A A A A A 2A A A A A kmem:mm_page_alloc >>> >>> A A A A A A A A A A A A A 38922A A A A A kmem:mm_page_free >>> >>> A A A A A A A A A A A A A A A A A 0A A A A A kmem:mm_page_alloc_extfrag >>> A A A A A A A A A A A A A A A A A 1A A A A A kmem:mm_page_alloc_zone_locked >>> >>> A A A A A A A A A A A A A A A A A 8A A A A A kmem:mm_page_free_batched >>> >>> A A A A A A A A A A A A 107289A A A A A kmem:mm_page_pcpu_drain >>> >>> A A A A A A A A A A A A A A A A 34A A A A A net:napi_gro_receive_entry >>> >>> >>> As you can see, sync overhead is enormous. >>> >>> PCP-wise, a key improvement in such scenarios would be reached if we >>> could >>> (1) keep and handle the allocated page on same cpu, or (2) somehow >>> get the >>> page back to the allocating core's PCP in a fast-path, without going >>> through >>> the regular buddy allocator paths. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id ABAC16B0003 for ; Mon, 23 Apr 2018 09:10:12 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id i127so6483344pgc.22 for ; Mon, 23 Apr 2018 06:10:12 -0700 (PDT) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTPS id z125si11341335pfz.335.2018.04.23.06.10.11 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Apr 2018 06:10:11 -0700 (PDT) Date: Mon, 23 Apr 2018 21:10:33 +0800 From: Aaron Lu Subject: Re: Page allocator bottleneck Message-ID: <20180423131033.GA13792@intel.com> References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> <20180421081505.GA24916@intel.com> <127df719-b978-60b7-5d77-3c8efbf2ecff@mellanox.com> <0dea4da6-8756-22d4-c586-267217a5fa63@mellanox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0dea4da6-8756-22d4-c586-267217a5fa63@mellanox.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tariq Toukan Cc: Linux Kernel Network Developers , linux-mm , Mel Gorman , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko On Mon, Apr 23, 2018 at 11:54:57AM +0300, Tariq Toukan wrote: > Hi, > > I ran my tests with your patches. > Initial BW numbers are significantly higher than I documented back then in > this mail-thread. > For example, in driver #2 (see original mail thread), with 6 rings, I now > get 92Gbps (slightly less than linerate) in comparison to 64Gbps back then. > > However, there were many kernel changes since then, I need to isolate your > changes. I am not sure I can finish this today, but I will surely get to it > next week after I'm back from vacation. > > Still, when I increase the scale (more rings, i.e. more cpus), I see that > queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower than it > used to be. I wonder if it is on allocation path or free path? Also, increasing PCP size through vm.percpu_pagelist_fraction would still help with my patches since it can avoid touching even more cache lines on allocation path with a higher PCP->batch(which has an upper limit of 96 though at the moment). > > This should be root solved by the (orthogonal) changes planned in network > subsystem, which will change the SKB allocation/free scheme so that SKBs are > released on the originating cpu. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id 7151A6B0005 for ; Fri, 27 Apr 2018 04:44:40 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id j6-v6so1109928pgn.7 for ; Fri, 27 Apr 2018 01:44:40 -0700 (PDT) Received: from mga18.intel.com (mga18.intel.com. [134.134.136.126]) by mx.google.com with ESMTPS id h8-v6si882160pln.54.2018.04.27.01.44.38 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 27 Apr 2018 01:44:38 -0700 (PDT) Date: Fri, 27 Apr 2018 16:45:58 +0800 From: Aaron Lu Subject: Re: Page allocator bottleneck Message-ID: <20180427084558.GB4009@intel.com> References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> <20180421081505.GA24916@intel.com> <127df719-b978-60b7-5d77-3c8efbf2ecff@mellanox.com> <0dea4da6-8756-22d4-c586-267217a5fa63@mellanox.com> <20180423131033.GA13792@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180423131033.GA13792@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tariq Toukan Cc: Linux Kernel Network Developers , linux-mm , Mel Gorman , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko On Mon, Apr 23, 2018 at 09:10:33PM +0800, Aaron Lu wrote: > On Mon, Apr 23, 2018 at 11:54:57AM +0300, Tariq Toukan wrote: > > Hi, > > > > I ran my tests with your patches. > > Initial BW numbers are significantly higher than I documented back then in > > this mail-thread. > > For example, in driver #2 (see original mail thread), with 6 rings, I now > > get 92Gbps (slightly less than linerate) in comparison to 64Gbps back then. > > > > However, there were many kernel changes since then, I need to isolate your > > changes. I am not sure I can finish this today, but I will surely get to it > > next week after I'm back from vacation. > > > > Still, when I increase the scale (more rings, i.e. more cpus), I see that > > queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower than it > > used to be. > > I wonder if it is on allocation path or free path? Just FYI, I have pushed two more commits on top of the branch. They should improve free path zone lock contention for MIGRATE_UNMOVABLE pages(most kernel code alloc such pages), you may consider apply them if free path contention is a problem. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 94DF06B0005 for ; Wed, 2 May 2018 09:38:43 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id k3so12921588pff.23 for ; Wed, 02 May 2018 06:38:43 -0700 (PDT) Received: from EUR02-AM5-obe.outbound.protection.outlook.com (mail-eopbgr00073.outbound.protection.outlook.com. [40.107.0.73]) by mx.google.com with ESMTPS id y23si11639997pff.177.2018.05.02.06.38.42 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 02 May 2018 06:38:42 -0700 (PDT) Subject: Re: Page allocator bottleneck References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> <20180421081505.GA24916@intel.com> <127df719-b978-60b7-5d77-3c8efbf2ecff@mellanox.com> <0dea4da6-8756-22d4-c586-267217a5fa63@mellanox.com> <20180423131033.GA13792@intel.com> <20180427084558.GB4009@intel.com> From: Tariq Toukan Message-ID: Date: Wed, 2 May 2018 16:38:31 +0300 MIME-Version: 1.0 In-Reply-To: <20180427084558.GB4009@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu , Tariq Toukan Cc: Linux Kernel Network Developers , linux-mm , Mel Gorman , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko On 27/04/2018 11:45 AM, Aaron Lu wrote: > On Mon, Apr 23, 2018 at 09:10:33PM +0800, Aaron Lu wrote: >> On Mon, Apr 23, 2018 at 11:54:57AM +0300, Tariq Toukan wrote: >>> Hi, >>> >>> I ran my tests with your patches. >>> Initial BW numbers are significantly higher than I documented back then in >>> this mail-thread. >>> For example, in driver #2 (see original mail thread), with 6 rings, I now >>> get 92Gbps (slightly less than linerate) in comparison to 64Gbps back then. >>> >>> However, there were many kernel changes since then, I need to isolate your >>> changes. I am not sure I can finish this today, but I will surely get to it >>> next week after I'm back from vacation. >>> >>> Still, when I increase the scale (more rings, i.e. more cpus), I see that >>> queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower than it >>> used to be. >> >> I wonder if it is on allocation path or free path? > > Just FYI, I have pushed two more commits on top of the branch. > They should improve free path zone lock contention for MIGRATE_UNMOVABLE > pages(most kernel code alloc such pages), you may consider apply them if > free path contention is a problem. > Hi Aaron, Thanks for the update, I did not analyze the contention yet. I am back in office and will start testing soon. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tariq Toukan Subject: Page allocator bottleneck Date: Thu, 14 Sep 2017 19:49:31 +0300 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit To: David Miller , Jesper Dangaard Brouer , Mel Gorman , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm Return-path: Received: from mail-db5eur01on0070.outbound.protection.outlook.com ([104.47.2.70]:60800 "EHLO EUR01-DB5-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751434AbdINQtn (ORCPT ); Thu, 14 Sep 2017 12:49:43 -0400 Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: Hi all, As part of the efforts to support increasing next-generation NIC speeds, I am investigating SW bottlenecks in network stack receive flow. Here I share some numbers I got for a simple experiment, in which I simulate the page allocation rate needed in 200Gpbs NICs. I ran the test below over 3 different (modified) mlx5 driver versions, loaded on server side (RX): 1) RX page cache disabled, 2 packets per page. 2) RX page cache disabled, one packet per page. 3) Huge RX page cache, one packet per page. All page allocations are of order 0. NIC: Connectx-5 100 Gbps. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Test: 128 TCP streams (using super_netperf). Changing num of RX queues. HW LRO OFF, GRO ON, MTU 1500. Observe: BW as a function of num RX queues. Results: Driver #1: #rings BW (Mbps) 1 23,813 2 44,086 3 62,128 4 78,058 6 94,210 (linerate) 8 94,205 (linerate) 12 94,202 (linerate) 16 94,191 (linerate) Driver #2: #rings BW (Mbps) 1 18,835 2 36,716 3 50,521 4 61,746 6 63,637 8 60,299 12 51,048 16 43,337 Driver #3: #rings BW (Mbps) 1 19,316 2 44,850 3 69,549 4 87,434 6 94,342 (linerate) 8 94,350 (linerate) 12 94,327 (linerate) 16 94,327 (linerate) Insights: Major degradation between #1 and #2, not getting any close to linerate! Degradation is fixed between #2 and #3. This is because page allocator cannot stand the higher allocation rate. In #2, we also see that the addition of rings (cores) reduces BW (!!), as result of increasing congestion over shared resources. Congestion in this case is very clear. When monitored in perf top: 85.58% [kernel] [k] queued_spin_lock_slowpath I think that page allocator issues should be discussed separately: 1) Rate: Increase the allocation rate on a single core. 2) Scalability: Reduce congestion and sync overhead between cores. This is clearly the current bottleneck in the network stack receive flow. I know about some efforts that were made in the past two years. For example the ones from Jesper et al.: - Page-pool (not accepted AFAIK). - Page-allocation bulking. - Optimize order-0 allocations in Per-Cpu-Pages. I am not an mm expert, but wanted to raise the issue again, to combine the efforts and hear from you guys about status and possible directions. Best regards, Tariq Toukan From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andi Kleen Subject: Re: Page allocator bottleneck Date: Thu, 14 Sep 2017 13:19:17 -0700 Message-ID: <87vaklyqwq.fsf@linux.intel.com> References: Mime-Version: 1.0 Content-Type: text/plain Cc: David Miller , Jesper Dangaard Brouer , Mel Gorman , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm To: Tariq Toukan Return-path: Received: from mga06.intel.com ([134.134.136.31]:45246 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751342AbdINUTS (ORCPT ); Thu, 14 Sep 2017 16:19:18 -0400 In-Reply-To: (Tariq Toukan's message of "Thu, 14 Sep 2017 19:49:31 +0300") Sender: netdev-owner@vger.kernel.org List-ID: Tariq Toukan writes: > > Congestion in this case is very clear. > When monitored in perf top: > 85.58% [kernel] [k] queued_spin_lock_slowpath Please look at the callers. Spinlock profiles without callers are usually useless because it's just blaming the messenger. Most likely the PCP lists are too small for your extreme allocation rate, so it goes back too often to the shared pool. You can play with the vm.percpu_pagelist_fraction setting. -Andi From mboxrd@z Thu Jan 1 00:00:00 1970 From: Aaron Lu Subject: Re: Page allocator bottleneck Date: Tue, 19 Sep 2017 15:23:43 +0800 Message-ID: <20170919072342.GB7263@intel.com> References: <20170915092839.690ea9e9@redhat.com> <6069fd36-ed0e-145c-3134-35232bf951a7@mellanox.com> <20170918073447.GB4107@intel.com> <20170918074404.GD4107@intel.com> <082e7901-7842-e9d9-221d-45322da0fcff@mellanox.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="ew6BAiZeqk4r7MaW" Cc: Jesper Dangaard Brouer , David Miller , Mel Gorman , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Linux Kernel Network Developers , Andrew Morton , Michal Hocko , linux-mm , Dave Hansen To: Tariq Toukan Return-path: Content-Disposition: inline In-Reply-To: <082e7901-7842-e9d9-221d-45322da0fcff@mellanox.com> Sender: owner-linux-mm@kvack.org List-Id: netdev.vger.kernel.org --ew6BAiZeqk4r7MaW Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Mon, Sep 18, 2017 at 06:33:20PM +0300, Tariq Toukan wrote: > > > On 18/09/2017 10:44 AM, Aaron Lu wrote: > > On Mon, Sep 18, 2017 at 03:34:47PM +0800, Aaron Lu wrote: > > > On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote: > > > > > > > > It's nice to have the option to dynamically play with the parameter. > > > > But maybe we should also think of changing the default fraction guaranteed > > > > to the PCP, so that unaware admins of networking servers would also benefit. > > > > > > I collected some performance data with will-it-scale/page_fault1 process > > > mode on different machines with different pcp->batch sizes, starting > > > from the default 31(calculated by zone_batchsize(), 31 is the standard > > > value for any zone that has more than 1/2MiB memory), then incremented > > > by 31 upwards till 527. PCP's upper limit is 6*batch. > > > > > > An image is plotted and attached: batch_full.png(full here means the > > > number of process started equals to CPU number). > > > > To be clear: X-axis is the value of batch size(31, 62, 93, ..., 527), > > Y-axis is the value of per_process_ops, generated by will-it-scale, One correction here, Y-axis isn't per_process_ops but per_process_ops * nr_processes. Still, higher is better. > > higher is better. > > > > > > > > From the image: > > > - For EX machines, they all see throughput increase with increased batch > > > size and peaked at around batch_size=310, then fall; > > > - For EP machines, Haswell-EP and Broadwell-EP also see throughput > > > increase with increased batch size and peaked at batch_size=279, then > > > fall, batch_size=310 also delivers pretty good result. Skylake-EP is > > > quite different in that it doesn't see any obvious throughput increase > > > after batch_size=93, though the trend is still increasing, but in a very > > > small way and finally peaked at batch_size=403, then fall. > > > Ivybridge EP behaves much like desktop ones. > > > - For Desktop machines, they do not see any obvious changes with > > > increased batch_size. > > > > > > So the default batch size(31) doesn't deliver good enough result, we > > > probbaly should change the default value. > > Thanks Aaron for sharing your experiment results. > That's a good analysis of the effect of the batch value. > I agree with your conclusion. > > From networking perspective, we should reconsider the defaults to be able to > reach the increasing NICs linerates. > Not only for pcp->batch, but also for pcp->high. I guess I didn't make it clear in my last email: when pcp->batch is changed, pcp->high is also changed. Their relationship is: pcp->high = pcp->batch * 6. Manipulating percpu_pagelist_fraction could increase pcp->high, but not pcp->batch(it has an upper limit as 96 currently). My test shows even when pcp->high being the same, changing pcp->batch could further improve will-it-scale's performance. e.g. in the below two cases, pcp->high are both set to 1860 but with different pcp->batch: will-it-scale native_queued_spin_lock_slowpath(perf) pcp->batch=96 15762348 79.95% pcp->batch=310 19291492 +22.3% 74.87% -5.1% Granted, this is the case for will-it-scale and may not apply to your case. I have a small patch that adds a batch interface for debug purpose, echo a value could set batch and high will be batch * 6. You are welcome to give it a try if you think it's worth(attached). Regards, Aaron --ew6BAiZeqk4r7MaW Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0001-percpu_pagelist_batch-add-a-batch-interface.patch" >>From e3c9516beb8302cb8fb2f5ab866bbe2686fda5fb Mon Sep 17 00:00:00 2001 From: Aaron Lu Date: Thu, 6 Jul 2017 15:00:07 +0800 Subject: [PATCH] percpu_pagelist_batch: add a batch interface Signed-off-by: Aaron Lu --- include/linux/mmzone.h | 2 ++ kernel/sysctl.c | 9 +++++++++ mm/page_alloc.c | 40 +++++++++++++++++++++++++++++++++++++++- 3 files changed, 50 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index ef6a13b7bd3e..0548d038b7cd 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -875,6 +875,8 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); +int percpu_pagelist_batch_sysctl_handler(struct ctl_table *, int, + void __user *, size_t *, loff_t *); int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int, diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 4dfba1a76cc3..85cc4544db1b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -108,6 +108,7 @@ extern unsigned int core_pipe_limit; extern int pid_max; extern int pid_max_min, pid_max_max; extern int percpu_pagelist_fraction; +extern int percpu_pagelist_batch; extern int latencytop_enabled; extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max; #ifndef CONFIG_MMU @@ -1440,6 +1441,14 @@ static struct ctl_table vm_table[] = { .proc_handler = percpu_pagelist_fraction_sysctl_handler, .extra1 = &zero, }, + { + .procname = "percpu_pagelist_batch", + .data = &percpu_pagelist_batch, + .maxlen = sizeof(percpu_pagelist_batch), + .mode = 0644, + .proc_handler = percpu_pagelist_batch_sysctl_handler, + .extra1 = &zero, + }, #ifdef CONFIG_MMU { .procname = "max_map_count", diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2302f250d6b1..aa96a4bd6467 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -129,6 +129,7 @@ unsigned long totalreserve_pages __read_mostly; unsigned long totalcma_pages __read_mostly; int percpu_pagelist_fraction; +int percpu_pagelist_batch; gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; /* @@ -5477,7 +5478,8 @@ static void pageset_set_high_and_batch(struct zone *zone, (zone->managed_pages / percpu_pagelist_fraction)); else - pageset_set_batch(pcp, zone_batchsize(zone)); + pageset_set_batch(pcp, percpu_pagelist_batch ? + percpu_pagelist_batch : zone_batchsize(zone)); } static void __meminit zone_pageset_init(struct zone *zone, int cpu) @@ -7157,6 +7159,42 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write, return ret; } +int percpu_pagelist_batch_sysctl_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos) +{ + struct zone *zone; + int old_percpu_pagelist_batch; + int ret; + + mutex_lock(&pcp_batch_high_lock); + old_percpu_pagelist_batch = percpu_pagelist_batch; + + ret = proc_dointvec_minmax(table, write, buffer, length, ppos); + if (!write || ret < 0) + goto out; + + /* Sanity checking to avoid pcp imbalance */ + if (percpu_pagelist_batch <= 0) { + ret = -EINVAL; + goto out; + } + + /* No change? */ + if (percpu_pagelist_batch == old_percpu_pagelist_batch) + goto out; + + for_each_populated_zone(zone) { + unsigned int cpu; + + for_each_possible_cpu(cpu) + pageset_set_high_and_batch(zone, + per_cpu_ptr(zone->pageset, cpu)); + } +out: + mutex_unlock(&pcp_batch_high_lock); + return ret; +} + #ifdef CONFIG_NUMA int hashdist = HASHDIST_DEFAULT; -- 2.9.5 --ew6BAiZeqk4r7MaW-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tariq Toukan Subject: Re: Page allocator bottleneck Date: Thu, 2 Nov 2017 19:21:09 +0200 Message-ID: <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: Mel Gorman , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko To: Tariq Toukan , Linux Kernel Network Developers , linux-mm Return-path: Received: from mail-ve1eur01on0049.outbound.protection.outlook.com ([104.47.1.49]:47088 "EHLO EUR01-VE1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1754004AbdKBRVT (ORCPT ); Thu, 2 Nov 2017 13:21:19 -0400 In-Reply-To: Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 18/09/2017 12:16 PM, Tariq Toukan wrote: > > > On 15/09/2017 1:23 PM, Mel Gorman wrote: >> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote: >>> Insights: Major degradation between #1 and #2, not getting any >>> close to linerate! Degradation is fixed between #2 and #3. This is >>> because page allocator cannot stand the higher allocation rate. In >>> #2, we also see that the addition of rings (cores) reduces BW (!!), >>> as result of increasing congestion over shared resources. >>> >> >> Unfortunately, no surprises there. >> >>> Congestion in this case is very clear. When monitored in perf top: >>> 85.58% [kernel] [k] queued_spin_lock_slowpath >>> >> >> While it's not proven, the most likely candidate is the zone lock >> and that should be confirmed using a call-graph profile. If so, then >> the suggestion to tune to the size of the per-cpu allocator would >> mitigate the problem. >> > Indeed, I tuned the per-cpu allocator and bottleneck is released. > Hi all, After leaving this task for a while doing other tasks, I got back to it now and see that the good behavior I observed earlier was not stable. Recall: I work with a modified driver that allocates a page (4K) per packet (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps NICs. Performance is good as long as pages are available in the allocating cores's PCP. Issue is that pages are allocated in one core, then free'd in another, making it's hard for the PCP to work efficiently, and both the allocator core and the freeing core need to access the buddy allocator very often. I'd like to share with you some testing numbers: Test: ./super_netperf 128 -H 24.134.0.51 -l 1000 100% cpu on all cores, top func in perf: 84.98% [kernel] [k] queued_spin_lock_slowpath system wide (all cores) 1135941 kmem:mm_page_alloc 2606629 kmem:mm_page_free 0 kmem:mm_page_alloc_extfrag 4784616 kmem:mm_page_alloc_zone_locked 1337 kmem:mm_page_free_batched 6488213 kmem:mm_page_pcpu_drain 8925503 net:napi_gro_receive_entry Two types of cores: A core mostly running napi (8 such cores): 221875 kmem:mm_page_alloc 17100 kmem:mm_page_free 0 kmem:mm_page_alloc_extfrag 766584 kmem:mm_page_alloc_zone_locked 16 kmem:mm_page_free_batched 35 kmem:mm_page_pcpu_drain 1340139 net:napi_gro_receive_entry Other core, mostly running user application (40 such): 2 kmem:mm_page_alloc 38922 kmem:mm_page_free 0 kmem:mm_page_alloc_extfrag 1 kmem:mm_page_alloc_zone_locked 8 kmem:mm_page_free_batched 107289 kmem:mm_page_pcpu_drain 34 net:napi_gro_receive_entry As you can see, sync overhead is enormous. PCP-wise, a key improvement in such scenarios would be reached if we could (1) keep and handle the allocated page on same cpu, or (2) somehow get the page back to the allocating core's PCP in a fast-path, without going through the regular buddy allocator paths. Regards, Tariq >>> I think that page allocator issues should be discussed separately: 1) >>> Rate: Increase the allocation rate on a single core. 2) >>> Scalability: Reduce congestion and sync overhead between cores. >>> >>> This is clearly the current bottleneck in the network stack receive >>> flow. >>> >>> I know about some efforts that were made in the past two years. For >>> example the ones from Jesper et al.: - Page-pool (not accepted >>> AFAIK). >> >> Indeed not and it would also need driver conversion. >> >>> - Page-allocation bulking. >> >> Prototypes exist but it's pointless without the pool or driver >> conversion so it's in the back burner for the moment. >> > > As I already mentioned in another reply (to Jesper), this would > perfectly fit with our Striding RQ feature, as we have large descriptors > that serve several packets, requiring the allocation of several pages at > once. I'd gladly move to using the bulking API. > >>> - Optimize order-0 allocations in Per-Cpu-Pages. >>> >> >> This had a prototype that was reverted as it must be able to cope >> with both irq and noirq contexts. > Yeah, I remember that I tested and reported the issue. > > Unfortunately I never found the time to >> revisit it but a split there to handle both would mitigate the >> problem. Probably not enough to actually reach line speed though so >> tuning of the per-cpu allocator sizes would still be needed. I don't >> know when I'll get the chance to revisit it. I'm travelling all next >> week and am mostly occupied with other work at the moment that is >> consuming all my concentration. >> >>> I am not an mm expert, but wanted to raise the issue again, to >>> combine the efforts and hear from you guys about status and >>> possible directions. >> >> The recent effort to reduce overhead from stats will help mitigate >> the problem. > I should get more familiar with these stats, check how costly they are, > and whether they can be turned off in Kconfig. > >> Finishing the page pool, the bulk allocator and converting drivers >> would be the most likely successful path forward but it's currently >> stalled as everyone that was previously involved is too busy. >> > I think we should consider changing the default allocation of PCP > fraction as well, or implement some smart dynamic heuristic. > This turned on to have significant effect over networking performance. > > Many thanks Mel! > > Regards, > Tariq From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tariq Toukan Subject: Re: Page allocator bottleneck Date: Wed, 8 Nov 2017 14:42:04 +0900 Message-ID: References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> <20171103134020.3hwquerifnc6k6qw@techsingularity.net> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: Linux Kernel Network Developers , linux-mm , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko To: Mel Gorman , Tariq Toukan Return-path: Received: from mail-eopbgr00059.outbound.protection.outlook.com ([40.107.0.59]:39648 "EHLO EUR02-AM5-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753211AbdKHFmV (ORCPT ); Wed, 8 Nov 2017 00:42:21 -0500 In-Reply-To: <20171103134020.3hwquerifnc6k6qw@techsingularity.net> Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 03/11/2017 10:40 PM, Mel Gorman wrote: > On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote: >> >> >> On 18/09/2017 12:16 PM, Tariq Toukan wrote: >>> >>> >>> On 15/09/2017 1:23 PM, Mel Gorman wrote: >>>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote: >>>>> Insights: Major degradation between #1 and #2, not getting any >>>>> close to linerate! Degradation is fixed between #2 and #3. This is >>>>> because page allocator cannot stand the higher allocation rate. In >>>>> #2, we also see that the addition of rings (cores) reduces BW (!!), >>>>> as result of increasing congestion over shared resources. >>>>> >>>> >>>> Unfortunately, no surprises there. >>>> >>>>> Congestion in this case is very clear. When monitored in perf >>>>> top: 85.58% [kernel] [k] queued_spin_lock_slowpath >>>>> >>>> >>>> While it's not proven, the most likely candidate is the zone lock >>>> and that should be confirmed using a call-graph profile. If so, then >>>> the suggestion to tune to the size of the per-cpu allocator would >>>> mitigate the problem. >>>> >>> Indeed, I tuned the per-cpu allocator and bottleneck is released. >>> >> >> Hi all, >> >> After leaving this task for a while doing other tasks, I got back to it now >> and see that the good behavior I observed earlier was not stable. >> >> Recall: I work with a modified driver that allocates a page (4K) per packet >> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps >> NICs. >> > > There is almost new in the data that hasn't been discussed before. The > suggestion to free on a remote per-cpu list would be expensive as it would > require per-cpu lists to have a lock for safe remote access. That's right, but each such lock will be significantly less congested than the buddy allocator lock. In the flow in subject two cores need to synchronize (one allocates, one frees). We also need to evaluate the cost of acquiring and releasing the lock in the case of no congestion at all. > However, > I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch > ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git . It's an > unfinished prototype I worked on a few weeks ago. I was going to revisit > in about a months time when 4.15-rc1 was out. I'd be interested in seeing > if it has a postive gain in normal page allocations without destroying > the performance of interrupt and softirq allocation contexts. The > interrupt/softirq context testing is crucial as that is something that > hurt us before when trying to improve page allocator performance. > Yes, I will test that once I get back in office (after netdev conference and vacation). Can you please elaborate in a few words about the idea behind the prototype? Does it address page-allocator scalability issues, or only the rate of single core page allocations? From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: Page allocator bottleneck Date: Thu, 9 Nov 2017 06:21:01 +0100 Message-ID: <20171109062101.64bde3b6@redhat.com> References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> <20171103134020.3hwquerifnc6k6qw@techsingularity.net> <20171108093547.ctsjv4a42xjvfsf7@techsingularity.net> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: Tariq Toukan , Linux Kernel Network Developers , linux-mm , David Miller , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko , brouer@redhat.com, "Michael S. Tsirkin" To: Mel Gorman Return-path: Received: from mx1.redhat.com ([209.132.183.28]:43910 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750723AbdKIFVg (ORCPT ); Thu, 9 Nov 2017 00:21:36 -0500 In-Reply-To: <20171108093547.ctsjv4a42xjvfsf7@techsingularity.net> Sender: netdev-owner@vger.kernel.org List-ID: On Wed, 8 Nov 2017 09:35:47 +0000 Mel Gorman wrote: > On Wed, Nov 08, 2017 at 02:42:04PM +0900, Tariq Toukan wrote: > > > > Hi all, > > > > > > > > After leaving this task for a while doing other tasks, I got back to it now > > > > and see that the good behavior I observed earlier was not stable. > > > > > > > > Recall: I work with a modified driver that allocates a page (4K) per packet > > > > (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps > > > > NICs. > > > > > > > > > > There is almost new in the data that hasn't been discussed before. The > > > suggestion to free on a remote per-cpu list would be expensive as it would > > > require per-cpu lists to have a lock for safe remote access. > > > > That's right, but each such lock will be significantly less congested than > > the buddy allocator lock. > > That is not necessarily true if all the allocations and frees always happen > on the same CPUs. The contention will be equivalent to the zone lock. > Your point will only hold true if there are also heavy allocation streams > from other CPUs that are unrelated. > > > In the flow in subject two cores need to > > synchronize (one allocates, one frees). > > We also need to evaluate the cost of acquiring and releasing the lock in the > > case of no congestion at all. > > > > If the per-cpu structures have a lock, there will be a light amount of > overhead. Nothing too severe, but it shouldn't be done lightly either. > > > > However, > > > I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch > > > ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git . It's an > > > unfinished prototype I worked on a few weeks ago. I was going to revisit > > > in about a months time when 4.15-rc1 was out. I'd be interested in seeing > > > if it has a postive gain in normal page allocations without destroying > > > the performance of interrupt and softirq allocation contexts. The > > > interrupt/softirq context testing is crucial as that is something that > > > hurt us before when trying to improve page allocator performance. > > > > > Yes, I will test that once I get back in office (after netdev conference and > > vacation). > > Thanks. I'll also commit to testing this (when I return home, as Tariq I'm also in Seoul ATM). > > Can you please elaborate in a few words about the idea behind the prototype? > > Does it address page-allocator scalability issues, or only the rate of > > single core page allocations? > > Short answer -- maybe. All scalability issues or rates of allocation are > context and workload dependant so the question is impossible to answer > for the general case. > > Broadly speaking, the patch reintroduces the per-cpu lists being for !irq > context allocations again. The last time we did this, hard and soft IRQ > allocations went through the buddy allocator which couldn't scale and > the patch was reverted. With this patch, it goes through a very large > pagevec-like structure that is protected by a lock but the fast paths > for alloc/free are extremely simple operations so the lock hold times are > very small. Potentially, a development path is that the current per-cpu > allocator is replaced with pagevec-like structures that are dynamically > allocated which would also allow pages to be freed to remote CPU lists I've had huge success using ptr_ring, as a queue between CPUs, to minimize cross-CPU cache-line touching. With the recently accepted BPF map called "cpumap" used for XDP_REDIRECT. It's important to handle the two borderline cases in ptr_ring, of the queue being almost full (default handled in ptr_ring) or almost empty. Like describe in[1] slide 14: [1] http://people.netfilter.org/hawk/presentations/NetConf2017_Seoul/XDP_devel_update_NetConf2017_Seoul.pdf The use of XDP_REDIRECT + cpumap, do expose issues with the page allocator. E.g. slide 19 show ixgbe recycle scheme failing, but still hitting the PCP. Also notice slide 22 deducing the overhead. Scale stressing ptr_ring is showed in extra slides 35-39. > (if we could detect when that is appropriate which is unclear). We could > also drain remote lists without using IPIs. The downside is that the memory > footprint of the allocator would be higher and the size could no longer > be tuned so there would need to be excellent justification for such a move. > > I haven't posted the patches properly yet because mmotm is carrying too > many patches as it is and this patch indirectly depends on the contents. I > also didn't write memory hot-remove support which would be a requirement > before merging. I hadn't intended to put further effort into it until I > had some evidence the approach had promise. My own testing indicated it > worked but the drivers I was using for network tests did not allocate > intensely enough to show any major gain/loss. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tariq Toukan Subject: Re: Page allocator bottleneck Date: Mon, 23 Apr 2018 11:54:57 +0300 Message-ID: <0dea4da6-8756-22d4-c586-267217a5fa63@mellanox.com> References: <20170915102320.zqceocmvvkyybekj@techsingularity.net> <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com> <20180421081505.GA24916@intel.com> <127df719-b978-60b7-5d77-3c8efbf2ecff@mellanox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Cc: Linux Kernel Network Developers , linux-mm , Mel Gorman , David Miller , Jesper Dangaard Brouer , Eric Dumazet , Alexei Starovoitov , Saeed Mahameed , Eran Ben Elisha , Andrew Morton , Michal Hocko To: Aaron Lu Return-path: Received: from mail-ve1eur01on0053.outbound.protection.outlook.com ([104.47.1.53]:63303 "EHLO EUR01-VE1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1754186AbeDWIzI (ORCPT ); Mon, 23 Apr 2018 04:55:08 -0400 In-Reply-To: <127df719-b978-60b7-5d77-3c8efbf2ecff@mellanox.com> Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 22/04/2018 7:43 PM, Tariq Toukan wrote: > > > On 21/04/2018 11:15 AM, Aaron Lu wrote: >> Sorry to bring up an old thread... >> > > I want to thank you very much for bringing this up! > >> On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote: >>> >>> >>> On 18/09/2017 12:16 PM, Tariq Toukan wrote: >>>> >>>> >>>> On 15/09/2017 1:23 PM, Mel Gorman wrote: >>>>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote: >>>>>> Insights: Major degradation between #1 and #2, not getting any >>>>>> close to linerate! Degradation is fixed between #2 and #3. This is >>>>>> because page allocator cannot stand the higher allocation rate. In >>>>>> #2, we also see that the addition of rings (cores) reduces BW (!!), >>>>>> as result of increasing congestion over shared resources. >>>>>> >>>>> >>>>> Unfortunately, no surprises there. >>>>> >>>>>> Congestion in this case is very clear. When monitored in perf >>>>>> top: 85.58% [kernel] [k] queued_spin_lock_slowpath >>>>>> >>>>> >>>>> While it's not proven, the most likely candidate is the zone lock >>>>> and that should be confirmed using a call-graph profile. If so, then >>>>> the suggestion to tune to the size of the per-cpu allocator would >>>>> mitigate the problem. >>>>> >>>> Indeed, I tuned the per-cpu allocator and bottleneck is released. >>>> >>> >>> Hi all, >>> >>> After leaving this task for a while doing other tasks, I got back to >>> it now >>> and see that the good behavior I observed earlier was not stable. >> >> I posted a patchset to improve zone->lock contention for order-0 pages >> recently, it can almost eliminate 80% zone->lock contention for >> will-it-scale/page_fault1 testcase when tested on a 2 sockets Intel >> Skylake server and it doesn't require PCP size tune, so should have >> some effects on your workload where one CPU does allocation while >> another does free. >> > > That is great news. In our driver's memory scheme (and many others as > well) we allocate only order-0 pages (the only flow that does not do > that yet in upstream will do so very soon, we already have the patches > in our internal branch). > Allocation of order-0 pages is not only the common case, but is the only > type of allocation in our data-path. Let's optimize it! > > >> It did this by some disruptive changes: >> 1 on free path, it skipped doing merge(so could be bad for mixed >>    workloads where both 4K and high order pages are needed); > > I think there are so many advantages to not using high order > allocations, especially in production servers that are not rebooted for > long periods and become fragmented. > AFAIK, the community direction (at least in networking) is using order-0 > pages in datapath, so optimizing their allocaiton is a very good idea. > Need of course to perf evaluate possible degradations, and see how > important these use cases are. > >> 2 on allocation path, it avoided touching multiple cachelines. >> > > Great! > >> RFC v2 patchset: >> https://lkml.org/lkml/2018/3/20/171 >> >> repo: >> https://github.com/aaronlu/linux zone_lock_rfc_v2 >> > > I will check them out first thing tomorrow! > > p.s., I will be on vacation for a week starting Tuesday. > I hope I can make some progress before that :) > > Thanks, > Tariq > Hi, I ran my tests with your patches. Initial BW numbers are significantly higher than I documented back then in this mail-thread. For example, in driver #2 (see original mail thread), with 6 rings, I now get 92Gbps (slightly less than linerate) in comparison to 64Gbps back then. However, there were many kernel changes since then, I need to isolate your changes. I am not sure I can finish this today, but I will surely get to it next week after I'm back from vacation. Still, when I increase the scale (more rings, i.e. more cpus), I see that queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower than it used to be. This should be root solved by the (orthogonal) changes planned in network subsystem, which will change the SKB allocation/free scheme so that SKBs are released on the originating cpu. Thanks, Tariq >>> Recall: I work with a modified driver that allocates a page (4K) per >>> packet >>> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps >>> NICs. >>> >>> Performance is good as long as pages are available in the allocating >>> cores's >>> PCP. >>> Issue is that pages are allocated in one core, then free'd in another, >>> making it's hard for the PCP to work efficiently, and both the allocator >>> core and the freeing core need to access the buddy allocator very often. >>> >>> I'd like to share with you some testing numbers: >>> >>> Test: ./super_netperf 128 -H 24.134.0.51 -l 1000 >>> >>> 100% cpu on all cores, top func in perf: >>>     84.98%  [kernel]             [k] queued_spin_lock_slowpath >>> >>> system wide (all cores) >>>             1135941      kmem:mm_page_alloc >>> >>>             2606629      kmem:mm_page_free >>> >>>                   0      kmem:mm_page_alloc_extfrag >>>             4784616      kmem:mm_page_alloc_zone_locked >>> >>>                1337      kmem:mm_page_free_batched >>> >>>             6488213      kmem:mm_page_pcpu_drain >>> >>>             8925503      net:napi_gro_receive_entry >>> >>> >>> Two types of cores: >>> A core mostly running napi (8 such cores): >>>              221875      kmem:mm_page_alloc >>> >>>               17100      kmem:mm_page_free >>> >>>                   0      kmem:mm_page_alloc_extfrag >>>              766584      kmem:mm_page_alloc_zone_locked >>> >>>                  16      kmem:mm_page_free_batched >>> >>>                  35      kmem:mm_page_pcpu_drain >>> >>>             1340139      net:napi_gro_receive_entry >>> >>> >>> Other core, mostly running user application (40 such): >>>                   2      kmem:mm_page_alloc >>> >>>               38922      kmem:mm_page_free >>> >>>                   0      kmem:mm_page_alloc_extfrag >>>                   1      kmem:mm_page_alloc_zone_locked >>> >>>                   8      kmem:mm_page_free_batched >>> >>>              107289      kmem:mm_page_pcpu_drain >>> >>>                  34      net:napi_gro_receive_entry >>> >>> >>> As you can see, sync overhead is enormous. >>> >>> PCP-wise, a key improvement in such scenarios would be reached if we >>> could >>> (1) keep and handle the allocated page on same cpu, or (2) somehow >>> get the >>> page back to the allocating core's PCP in a fast-path, without going >>> through >>> the regular buddy allocator paths.