From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrei Vagin Subject: Re: [PATCHv4 26/28] x86/vdso: Align VDSO functions by CPU L1 cache line Date: Sat, 22 Jun 2019 22:26:48 -0700 Message-ID: <20190623052647.GA9838@gmail.com> References: <20190612192628.23797-1-dima@arista.com> <20190612192628.23797-27-dima@arista.com> Mime-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org To: Thomas Gleixner Cc: Dmitry Safonov , linux-kernel@vger.kernel.org, Adrian Reber , Andrei Vagin , Andy Lutomirski , Arnd Bergmann , Christian Brauner , Cyrill Gorcunov , Dmitry Safonov <0x7f454c46@gmail.com>, "Eric W. Biederman" , "H. Peter Anvin" , Ingo Molnar , Jann Horn , Jeff Dike , Oleg Nesterov , Pavel Emelyanov , Shuah Khan , Vincenzo Frascino , containers@lists.linux-foundation.org, criu@openvz.org, linux-api@vger.kernel.org, x86@kernel.org List-Id: linux-api@vger.kernel.org On Fri, Jun 14, 2019 at 04:13:31PM +0200, Thomas Gleixner wrote: > On Wed, 12 Jun 2019, Dmitry Safonov wrote: >=20 > > From: Andrei Vagin > >=20 > > After performance testing VDSO patches a noticeable 20% regression was > > found on gettime_perf selftest with a cold cache. > > As it turns to be, before time namespaces introduction, VDSO functions > > were quite aligned to cache lines, but adding a new code to adjust > > timens offset inside namespace created a small shift and vdso functions > > become unaligned on cache lines. > >=20 > > Add align to vdso functions with gcc option to fix performance drop. > >=20 > > Coping the resulting numbers from cover letter: > >=20 > > Hot CPU cache (more gettime_perf.c cycles - the better): > > | before | CONFIG_TIME_NS=3Dn | host | inside timens > > --------|------------|------------------|-------------|------------- > > cycles | 139887013 | 139453003 | 139899785 | 128792458 > > diff (%)| 100 | 99.7 | 100 | 92 >=20 > Why is CONFIG_TIME_NS=3Dn behaving worse than current mainline and > worse than 'host' mode? We had to specify a precision of these numbers, it is more than this 0.3%, so at that time I decided that here is nothing to worry about. I did these measurments a few mounth ago for the second version of this series. I repeated measurments for this set of patches: | before | CONFIG_TIME_NS=3Dn | host | inside timens -------------------------------------------------------------- | 144645498 | 142916801 | 140364862 | 132378440 | 143440633 | 141545739 | 140540053 | 132714190 | 144876395 | 144650599 | 140026814 | 131843318 | 143984551 | 144595770 | 140359260 | 131683544 | 144875682 | 143799788 | 140692618 | 131300332 -------------------------------------------------------------- avg | 144364551 | 143501739 | 140396721 | 131983964 diff % | 100 | 99.4 | 97.2 | 91.4 ------------------------------------------------------------- stdev % | 0.4 | 0.9 | 0.1 | 0.4 >=20 > > Cold cache (lesser tsc per gettime_perf_cold.c cycle - the better): > > | before | CONFIG_TIME_NS=3Dn | host | inside timens > > --------|------------|------------------|-------------|------------- > > tsc | 6748 | 6718 | 6862 | 12682 > > diff (%)| 100 | 99.6 | 101.7 | 188 >=20 > Weird, now CONFIG_TIME_NS=3Dn is better than current mainline and 'host' = mode > drops. The precision of these numbers is much smaller than of the previous set. These numbers are for the second version of this series, so I decided to repeat measurements for this version. When I run the test, I found that there is some degradation in compare with v5.0. I bisected and found that the problem is in 2b539aefe9e4 ("mm/resource: Let walk_system_ram_range() search child resources"). At this point, I realized that my test isn't quite right. On each iteration, the test starts a new process, then do start=3Drdtsc();clock_gettime();end=3Drdtsc() and prints (end-start). The problem here is that when clock_gettime() is called the first time, vdso pages are not mapped into a process address space, so the test measures how fast vdso pages are mapped into the process address space. I modified this test, now it uses the clflush instruction to drop cpu caches. Here are the results: | before | CONFIG_TIME_NS=3Dn | host | inside timens -------------------------------------------------------------- tsc | 434 | 433 | 437 | 477 stdev(tsc) | 5 | 5 | 5 | 3 diff (%) | 1 | 1 | 100.1 | 109 Here is the source code for the modified test: https://github.com/avagin/linux-task-diag/blob/wip/timens-rfc-v4/tools/test= ing/selftests/timens/gettime_perf_cold.c This test does 10K iterations. At the first glance, the numbers look noisy, so I sort them and take only 8K numbers in the middle: $ ./gettime_perf_cold > raw $ cat raw | sort -n | tail -n 9000 | head -n 8000 > results >=20 > Either I'm misreading the numbers or missing something or I'm just confus= ed > as usual :) >=20 > Thanks, > = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = > tglx