From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933167Ab2AFLHZ (ORCPT ); Fri, 6 Jan 2012 06:07:25 -0500 Received: from mx2.mail.elte.hu ([157.181.151.9]:55756 "EHLO mx2.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758144Ab2AFLHY (ORCPT ); Fri, 6 Jan 2012 06:07:24 -0500 Date: Fri, 6 Jan 2012 12:05:20 +0100 From: Ingo Molnar To: Jan Beulich Cc: tglx@linutronix.de, hpa@zytor.com, linux-kernel@vger.kernel.org, Linus Torvalds , Andrew Morton Subject: Re: [PATCH] x86-64: fix memset() to support sizes of 4Gb and above Message-ID: <20120106110519.GA32673@elte.hu> References: <4F05D992020000780006AA09@nat28.tlf.novell.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F05D992020000780006AA09@nat28.tlf.novell.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-ELTE-SpamScore: -2.0 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-2.0 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.3.1 -2.0 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Jan Beulich wrote: > While currently there doesn't appear to be any reachable > in-tree case where such large memory blocks may be passed to > memset() (alloc_bootmem() being the primary non-reachable one, > as it gets called with suitably large sizes in FLATMEM > configurations), we have recently hit the problem a second > time in our Xen kernels. Rather than working around it a > second time, prevent others from falling into the same trap by > fixing this long standing limitation. > > Signed-off-by: Jan Beulich Have you checked the before/after size of the hotpath? The patch suggests that it got shorter by 3 instructions: > - movl %edx,%r8d > - andl $7,%r8d > - movl %edx,%ecx > - shrl $3,%ecx > + movq %rdx,%rcx > + andl $7,%edx > + shrq $3,%rcx [...] > movq %rdi,%r10 > - movq %rdx,%r11 > [...] > - movl %r11d,%ecx > - andl $7,%ecx > + andl $7,%edx Is that quick impression correct? I have not tried building or measuring it. Also, note that we have some cool instrumentation tech upstream, lately we've added a way to measure the *kernel*'s memcpy routine performance in user-space, using perf bench: $ perf bench mem memcpy -r x86 # Running mem/memcpy benchmark... Unknown routine:x86 Available routines... default ... Default memcpy() provided by glibc x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S $ perf bench mem memcpy -r x86-64-unrolled # Running mem/memcpy benchmark... # Copying 1MB Bytes ... 1.888902 GB/Sec 15.024038 GB/Sec (with prefault) This builds the routine in arch/x86/lib/memcpy_64.S and measures its bandwidth in the cache-cold and cache-hot case as well. Would be nice to add support for arch/x86/lib/memset_64.S as well, and look at the before/after performance of it. In userspace we can do a lot more accurate measurements of this kind: $ perf stat --repeat 1000 -e instructions -e cycles perf bench mem memcpy -r x86-64-unrolled >/dev/null Performance counter stats for 'perf bench mem memcpy -r x86-64-unrolled' (1000 runs): 4,924,378 instructions # 0.84 insns per cycle ( +- 0.03% ) 5,892,603 cycles # 0.000 GHz ( +- 0.06% ) 0.002208000 seconds time elapsed ( +- 0.10% ) Check how the confidence interval of the measurement is in the 0.03% range, thus even a single cycle change in overhead caused by your patch should be measurable. Note, you'll need to rebuild perf with the different memset routines. For example the kernel's memcpy routine in slightly faster than glibc's: $ perf stat --repeat 1000 -e instructions -e cycles perf bench mem memcpy -r default >/dev/null Performance counter stats for 'perf bench mem memcpy -r default' (1000 runs): 4,927,173 instructions # 0.83 insns per cycle ( +- 0.03% ) 5,928,168 cycles # 0.000 GHz ( +- 0.06% ) 0.002157349 seconds time elapsed ( +- 0.10% ) If such measurements all suggests equal or better performance, and if there's no erratum in current CPUs that would make 4G string copies dangerous [which your research suggests should be fine], i have no principal objection against this patch. I'd not be surprised (at all) if other OSs did larger than 4GB memsets in certain circumstances. Thanks, Ingo