From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Nicolas Pitre <nico@cam.org>,
Andrew Morton <akpm@linux-foundation.org>,
torvalds@linux-foundation.org, dhowells@redhat.com,
mingo@elte.hu, a.p.zijlstra@chello.nl, linux-kernel@vger.k
Subject: Re: [PATCH] convert cnt32_to_63 to inline
Date: Tue, 11 Nov 2008 15:11:30 -0500 [thread overview]
Message-ID: <20081111201130.GA10801@Krystal> (raw)
In-Reply-To: <20081111191355.GA13724@flint.arm.linux.org.uk>
* Russell King (rmk+lkml@arm.linux.org.uk) wrote:
> On Tue, Nov 11, 2008 at 01:28:00PM -0500, Mathieu Desnoyers wrote:
> > Let's see what it gives once implemented. Only compile-tested. Assumes
> > pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
> > arm versatile.
>
> Versatile is also UP only.
>
> The following are results from PXA built with gcc 3.4.3:
>
> 1. two additional registers used in sched_clock()
> 2. 8 additional bytes of code (which are needless if gcc was more inteligent)
>
> both of these I put down to inefficiencies in gcc's register allocation.
>
> 3. worse instruction scheduling - two inter-dependent loads next to each
> other causing a pipeline stall
>
> Actual reading of variables/hardware is unaffected by this patch.
>
> Old code:
>
> c: e59f3050 ldr r3, [pc, #80] ; load address of oscr2ns_scale
> 10: e59fc050 ldr ip, [pc, #80] ; load address of __m_cnt_hi
> 14: e5932000 ldr r2, [r3] ; read oscr2ns_scale
> 18: e59f304c ldr r3, [pc, #76] ; load address of OSCR
> 1c: e59c1000 ldr r1, [ip] ; read __m_cnt_hi
> 20: e1a07002 mov r7, r2
> 24: e3a08000 mov r8, #0 ; 0x0
> 28: e5933000 ldr r3, [r3] ; read OSCR register
> ...
> 58: e1820b04 orr r0, r2, r4, lsl #22
> 5c: e1a01524 lsr r1, r4, #10
> 60: e89da9f0 ldm sp, {r4, r5, r6, r7, r8, fp, sp, pc}
>
>
> New code:
>
> c: e59f0058 ldr r0, [pc, #88] ; load address of oscr2ns_scale
> 10: e5901000 ldr r1, [r0] ; read oscr2ns_scale <= pipeline stall
> 14: e59f3054 ldr r3, [pc, #84] ; load address of __m_cnt_hi
> 18: e1a08001 mov r8, r1
> 1c: e5932000 ldr r2, [r3] ; read __m_cnt_hi
> 20: e59f304c ldr r3, [pc, #76] ; load address of OSCR
> 24: e1a09002 mov r9, r2
> 28: e3a0a000 mov sl, #0 ; 0x0
> 2c: e5933000 ldr r3, [r3] ; read OSCR
> ...
> 58: e1825b04 orr r5, r2, r4, lsl #22
> 5c: e1a06524 lsr r6, r4, #10
> 60: e1a01006 mov r1, r6
> 64: e1a00005 mov r0, r5
> 68: e89daff0 ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}
>
> Versatile:
>
> 1. 12 additional bytes of code
> 2. same number of registers
> 3. worse instruction scheduling causing pipeline stall
>
> Actual reading of variables/hardware is unaffected by this patch.
>
> So, we have two platforms where this patch makes things visibly worse
> with no material benefit.
>
I think the added barrier() are causing these pipeline stalls. They
don't allow the compiler to read variables such as oscr2ns_scale before
the barrier because gcc cannot assume it won't be modified. However, to
insure that OSCR read is done after __m_cnt_hi read, this barrier seems
required to be safe against gcc optimizations.
Have you compared my patch to Nicolas'patch, which adds a smp_rmb() in
the macro or to a vanilla tree ?
I wonder if reading those values sooner would help gcc ?
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
WARNING: multiple messages have this Message-ID (diff)
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Nicolas Pitre <nico@cam.org>,
Andrew Morton <akpm@linux-foundation.org>,
torvalds@linux-foundation.org, dhowells@redhat.com,
mingo@elte.hu, a.p.zijlstra@chello.nl,
linux-kernel@vger.kernel.org, ralf@linux-mips.org,
benh@kernel.crashing.org, paulus@samba.org, davem@davemloft.net,
mingo@redhat.com, tglx@linutronix.de, rostedt@goodmis.org,
linux-arch@vger.kernel.org
Subject: Re: [PATCH] convert cnt32_to_63 to inline
Date: Tue, 11 Nov 2008 15:11:30 -0500 [thread overview]
Message-ID: <20081111201130.GA10801@Krystal> (raw)
Message-ID: <20081111201130.l9pQwlJxzckCMgXycxwShmRz1B5iZMYEH2eOmODF8rc@z> (raw)
In-Reply-To: <20081111191355.GA13724@flint.arm.linux.org.uk>
* Russell King (rmk+lkml@arm.linux.org.uk) wrote:
> On Tue, Nov 11, 2008 at 01:28:00PM -0500, Mathieu Desnoyers wrote:
> > Let's see what it gives once implemented. Only compile-tested. Assumes
> > pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
> > arm versatile.
>
> Versatile is also UP only.
>
> The following are results from PXA built with gcc 3.4.3:
>
> 1. two additional registers used in sched_clock()
> 2. 8 additional bytes of code (which are needless if gcc was more inteligent)
>
> both of these I put down to inefficiencies in gcc's register allocation.
>
> 3. worse instruction scheduling - two inter-dependent loads next to each
> other causing a pipeline stall
>
> Actual reading of variables/hardware is unaffected by this patch.
>
> Old code:
>
> c: e59f3050 ldr r3, [pc, #80] ; load address of oscr2ns_scale
> 10: e59fc050 ldr ip, [pc, #80] ; load address of __m_cnt_hi
> 14: e5932000 ldr r2, [r3] ; read oscr2ns_scale
> 18: e59f304c ldr r3, [pc, #76] ; load address of OSCR
> 1c: e59c1000 ldr r1, [ip] ; read __m_cnt_hi
> 20: e1a07002 mov r7, r2
> 24: e3a08000 mov r8, #0 ; 0x0
> 28: e5933000 ldr r3, [r3] ; read OSCR register
> ...
> 58: e1820b04 orr r0, r2, r4, lsl #22
> 5c: e1a01524 lsr r1, r4, #10
> 60: e89da9f0 ldm sp, {r4, r5, r6, r7, r8, fp, sp, pc}
>
>
> New code:
>
> c: e59f0058 ldr r0, [pc, #88] ; load address of oscr2ns_scale
> 10: e5901000 ldr r1, [r0] ; read oscr2ns_scale <= pipeline stall
> 14: e59f3054 ldr r3, [pc, #84] ; load address of __m_cnt_hi
> 18: e1a08001 mov r8, r1
> 1c: e5932000 ldr r2, [r3] ; read __m_cnt_hi
> 20: e59f304c ldr r3, [pc, #76] ; load address of OSCR
> 24: e1a09002 mov r9, r2
> 28: e3a0a000 mov sl, #0 ; 0x0
> 2c: e5933000 ldr r3, [r3] ; read OSCR
> ...
> 58: e1825b04 orr r5, r2, r4, lsl #22
> 5c: e1a06524 lsr r6, r4, #10
> 60: e1a01006 mov r1, r6
> 64: e1a00005 mov r0, r5
> 68: e89daff0 ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}
>
> Versatile:
>
> 1. 12 additional bytes of code
> 2. same number of registers
> 3. worse instruction scheduling causing pipeline stall
>
> Actual reading of variables/hardware is unaffected by this patch.
>
> So, we have two platforms where this patch makes things visibly worse
> with no material benefit.
>
I think the added barrier() are causing these pipeline stalls. They
don't allow the compiler to read variables such as oscr2ns_scale before
the barrier because gcc cannot assume it won't be modified. However, to
insure that OSCR read is done after __m_cnt_hi read, this barrier seems
required to be safe against gcc optimizations.
Have you compared my patch to Nicolas'patch, which adds a smp_rmb() in
the macro or to a vanilla tree ?
I wonder if reading those values sooner would help gcc ?
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
next prev parent reply other threads:[~2008-11-11 20:11 UTC|newest]
Thread overview: 123+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 01/18] get_cycles() : kconfig HAVE_GET_CYCLES Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 02/18] get_cycles() : x86 HAVE_GET_CYCLES Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 03/18] get_cycles() : sparc64 HAVE_GET_CYCLES Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 04/18] get_cycles() : powerpc64 HAVE_GET_CYCLES Mathieu Desnoyers
2008-11-07 14:56 ` Josh Boyer
2008-11-07 18:14 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 05/18] get_cycles() : MIPS HAVE_GET_CYCLES_32 Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 06/18] Trace clock generic Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 07/18] Trace clock core Mathieu Desnoyers
2008-11-07 5:52 ` Andrew Morton
2008-11-07 6:16 ` Mathieu Desnoyers
2008-11-07 6:26 ` Andrew Morton
2008-11-07 16:12 ` Mathieu Desnoyers
2008-11-07 16:19 ` Andrew Morton
2008-11-07 18:16 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() Mathieu Desnoyers
2008-11-07 6:05 ` Andrew Morton
2008-11-07 8:12 ` Nicolas Pitre
2008-11-07 8:38 ` Andrew Morton
2008-11-07 11:20 ` David Howells
2008-11-07 15:01 ` Nicolas Pitre
2008-11-07 15:50 ` Andrew Morton
2008-11-07 16:21 ` David Howells
2008-11-07 16:29 ` Andrew Morton
2008-11-07 17:10 ` David Howells
2008-11-07 17:26 ` Andrew Morton
2008-11-07 18:00 ` Mathieu Desnoyers
2008-11-07 18:21 ` Andrew Morton
2008-11-07 18:30 ` Harvey Harrison
2008-11-07 18:42 ` Mathieu Desnoyers
2008-11-07 18:33 ` Mathieu Desnoyers
2008-11-07 18:36 ` Linus Torvalds
2008-11-07 18:45 ` Andrew Morton
2008-11-07 16:47 ` Nicolas Pitre
2008-11-07 16:55 ` David Howells
2008-11-07 17:21 ` Andrew Morton
2008-11-07 20:03 ` Nicolas Pitre
2008-11-07 16:07 ` David Howells
2008-11-07 16:47 ` Mathieu Desnoyers
2008-11-07 17:04 ` David Howells
2008-11-07 17:17 ` Mathieu Desnoyers
2008-11-07 23:27 ` David Howells
2008-11-07 20:11 ` Russell King
2008-11-07 21:36 ` Mathieu Desnoyers
2008-11-07 21:36 ` Mathieu Desnoyers
2008-11-07 22:18 ` Russell King
2008-11-07 22:36 ` Mathieu Desnoyers
2008-11-07 22:36 ` Mathieu Desnoyers
2008-11-07 23:41 ` David Howells
2008-11-08 0:15 ` Russell King
2008-11-08 0:45 ` David Howells
2008-11-08 15:24 ` Nicolas Pitre
2008-11-08 23:20 ` [PATCH] clarify usage expectations for cnt32_to_63() Nicolas Pitre
2008-11-09 2:25 ` Mathieu Desnoyers
2008-11-09 2:54 ` Nicolas Pitre
2008-11-09 5:06 ` Nicolas Pitre
2008-11-09 5:27 ` [PATCH v2] " Nicolas Pitre
2008-11-09 6:48 ` Mathieu Desnoyers
2008-11-09 13:34 ` Nicolas Pitre
2008-11-09 13:43 ` Russell King
2008-11-09 16:22 ` Mathieu Desnoyers
2008-11-10 4:20 ` Nicolas Pitre
2008-11-10 4:42 ` Andrew Morton
2008-11-10 21:34 ` Nicolas Pitre
2008-11-10 21:58 ` Andrew Morton
2008-11-10 23:15 ` Nicolas Pitre
2008-11-10 23:22 ` Andrew Morton
2008-11-10 23:38 ` Steven Rostedt
2008-11-11 0:26 ` Nicolas Pitre
2008-11-11 18:28 ` [PATCH] convert cnt32_to_63 to inline Mathieu Desnoyers
2008-11-11 19:13 ` Russell King
2008-11-11 20:11 ` Mathieu Desnoyers [this message]
2008-11-11 20:11 ` Mathieu Desnoyers
2008-11-11 21:51 ` Russell King
2008-11-12 3:48 ` Mathieu Desnoyers
2008-11-12 3:48 ` Mathieu Desnoyers
2008-11-11 21:00 ` Nicolas Pitre
2008-11-11 21:13 ` Russell King
2008-11-11 22:31 ` David Howells
2008-11-11 22:37 ` Peter Zijlstra
2008-11-12 1:13 ` Steven Rostedt
2008-11-07 11:03 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() David Howells
2008-11-07 16:51 ` Mathieu Desnoyers
2008-11-07 20:18 ` Nicolas Pitre
2008-11-07 23:55 ` David Howells
2008-11-07 10:59 ` David Howells
2008-11-07 10:55 ` David Howells
2008-11-07 17:09 ` Mathieu Desnoyers
2008-11-07 17:33 ` Steven Rostedt
2008-11-07 19:18 ` Mathieu Desnoyers
2008-11-07 19:32 ` Peter Zijlstra
2008-11-07 20:02 ` Mathieu Desnoyers
2008-11-07 20:45 ` Mathieu Desnoyers
2008-11-07 20:54 ` Paul E. McKenney
2008-11-07 21:04 ` Steven Rostedt
2008-11-08 0:34 ` Paul E. McKenney
2008-11-07 21:16 ` Mathieu Desnoyers
2008-11-07 23:50 ` David Howells
2008-11-08 0:55 ` Steven Rostedt
2008-11-09 11:51 ` David Howells
2008-11-09 14:31 ` Steven Rostedt
2008-11-09 16:18 ` Mathieu Desnoyers
2008-11-07 20:08 ` Steven Rostedt
2008-11-07 20:55 ` Paul E. McKenney
2008-11-07 21:27 ` Mathieu Desnoyers
2008-11-07 20:36 ` Nicolas Pitre
2008-11-07 20:55 ` Mathieu Desnoyers
2008-11-07 21:22 ` Nicolas Pitre
2008-11-07 5:23 ` [RFC patch 09/18] Powerpc : Trace clock Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 10/18] Sparc64 " Mathieu Desnoyers
2008-11-07 5:45 ` David Miller
2008-11-07 5:23 ` [RFC patch 11/18] LTTng timestamp sh Mathieu Desnoyers
2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 12/18] LTTng - TSC synchronicity test Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 13/18] x86 : remove arch-specific tsc_sync.c Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 14/18] MIPS use tsc_sync.c Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 15/18] MIPS : export hpt frequency for trace_clock Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 16/18] MIPS create empty sync_core() Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 17/18] MIPS : Trace clock Mathieu Desnoyers
2008-11-07 11:53 ` Peter Zijlstra
2008-11-07 17:44 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 18/18] x86 trace clock Mathieu Desnoyers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081111201130.GA10801@Krystal \
--to=mathieu.desnoyers@polymtl.ca \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=dhowells@redhat.com \
--cc=linux-kernel@vger.k \
--cc=mingo@elte.hu \
--cc=nico@cam.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.