From mboxrd@z Thu Jan 1 00:00:00 1970 From: Boqun Feng Subject: Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests Date: Mon, 15 Aug 2016 08:56:17 +0800 Message-ID: <20160815005607.GD18611@tardis.cn.ibm.com> References: <1469135662-31512-1-git-send-email-mathieu.desnoyers@efficios.com> <374861479.8581.1470957990793.JavaMail.zimbra@efficios.com> <20160812012854.GC1740@tardis.cn.ibm.com> <365498072.8664.1470971438957.JavaMail.zimbra@efficios.com> <20160812053008.GA18611@tardis.cn.ibm.com> <20160812163542.GB18611@tardis.cn.ibm.com> <434180831.9286.1471025505641.JavaMail.zimbra@efficios.com> <20160813012834.GC18611@tardis.cn.ibm.com> <832944127.9855.1471186940516.JavaMail.zimbra@efficios.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="AH+kv8CCoFf6qPuz" Return-path: Content-Disposition: inline In-Reply-To: <832944127.9855.1471186940516.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Mathieu Desnoyers Cc: Dave Watson , Andrew Morton , Russell King , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , linux-kernel , linux-api , Paul Turner , Andrew Hunter , Peter Zijlstra , Andy Lutomirski , Andi Kleen , Chris Lameter , Ben Maurer , rostedt , "Paul E. McKenney" , Josh Triplett , Linus Torvalds , Catalin Marinas , Will Deacon , Michael Kerrisk List-Id: linux-api@vger.kernel.org --AH+kv8CCoFf6qPuz Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Aug 14, 2016 at 03:02:20PM +0000, Mathieu Desnoyers wrote: > ----- On Aug 12, 2016, at 9:28 PM, Boqun Feng boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote: >=20 > > On Fri, Aug 12, 2016 at 06:11:45PM +0000, Mathieu Desnoyers wrote: > >> ----- On Aug 12, 2016, at 12:35 PM, Boqun Feng boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wr= ote: > >>=20 > >> > On Fri, Aug 12, 2016 at 01:30:15PM +0800, Boqun Feng wrote: > >> > [snip] > >> >> > > Besides, do we allow userspace programs do read-only access to = the > >> >> > > memory objects modified by do_rseq(). If so, we have a problem = when > >> >> > > there are two writes in a do_rseq()(either in the rseq critical= section > >> >> > > or in the asm block), because in current implemetation, these t= wo writes > >> >> > > are unordered, which makes the readers outside a do_rseq() coul= d observe > >> >> > > the ordering of writes differently. > >> >> > >=20 > >> >> > > For rseq_finish2(), a simple solution would be making the "fina= l" write > >> >> > > a RELEASE. > >> >> >=20 > >> >> > Indeed, we would need a release semantic for the final store here= if this > >> >> > is the common use. Or we could duplicate the "flavors" of rseq_fi= nish2 and > >> >> > add a rseq_finish2_release. We should find a way to eliminate cod= e duplication > >> >>=20 > >> >> I'm in favor of a separate rseq_finish2_release(). > >> >>=20 > >> >> > there. I suspect we'll end up doing macros. > >> >> >=20 > >> >>=20 > >> >> Me too. Lemme have a try ;-) > >> >>=20 > >> >=20 > >> > How about this? Although a little messy, I separated the asm block i= nto > >> > several parts and implemented each part in a arch-diagnose way. > >>=20 > >> I find it rather hard to follow the per-arch assembly with this approa= ch. > >> It might prove to be troublesome if we want to do arch-specific optimi= zations > >> in the future. > >>=20 > >=20 > > It might be, but I was just trying to kill as much duplicate code as > > possible, because the more duplicate we have, the more maintain effort > > we need. > >=20 > > For example, PPC32 and PPC64 may have the same asm code to check the > > event counter, but different code to do the final store. Having the > > same RSEQ_CHECK_COUNTER() for PPC32 and PPC64 actually makes it easy if > > we come up a way to optimize the counter check code on PPC. > >=20 > > And if some arch wants to have some very specifical optimizations, > > it could always write the whole asm block again rather than use the > > helpers macros. >=20 > Creating macros for each assembly "operation" done in the restartable > sequence ends up requiring that people learn a new custom mini-language, > and implement those macros for each architecture. >=20 > I'd rather prefer to let each architecture maintainer express the > restartable sequence directly in assembly, which is already known to > them, than require them to learn a new small macro-based language. >=20 > Eliminating duplicated code is a goal I agree with, but there are > ways to achieve this which don't end up creating a macro-based custom > mini-language (such as what I proposed below). >=20 Fair point ;-) One more thing, do we want to use arch-specific header files to put arch-specific assembly code? For example, rseq-x86.h, rseq-powerpc.h, etc. This may save readers a lot of time if he or she is only interested in a particular arch, and also make maintaining a little easier(no need to worry about breaking other archs accidentally) [...] >=20 > In terms of fast-path, you would be trading: >=20 > (1) > "ldr r0, %[current_event_counter]\n\t" \ > "mov r1, #0\n\t" > "cmp %[start_event_counter], r0\n\t" \ > "bne %l[failure]\n\t" \ > "str %[to_write_final], [%[target_final]]\n\t" \ > "2:\n\t" \ > "str r1, [%[rseq_cs]]\n\t" \ > for >=20 > (2) > "ldr r0, %[current_event_counter]\n\t" \ > "cmp %[start_event_counter], r0\n\t" \ > "bne %l[failure]\n\t" \ > "str %[to_write_final], [%[target_final]]\n\t" \ > "2:\n\t" \ > "mov r0, #0\n\t" > "str r0, [%[rseq_cs]]\n\t" \ >=20 > Your proposal (2) saves a register (does not clobber r1), but this > is at the expense of a slower fast-path. In (1), loading the constant > 0 is done while the processor is stalled on the current_event_counter > load, which is needed by a following comparison. Therefore, we can > increase instruction-level parallelism by placing the immediate value > 0 load right after the ldr instruction. This, however, requires that > we use a different register than r0, because r0 is already used by the > ldr/cmp instructions. >=20 > Since this is a fast-path, achieving higher instruction throughput > is more important than saving a register. >=20 > I came up with this as an optimization while doing benchmarking > on a ARM32 Cubietruck as a reference architecture. >=20 Nice ;-) Better to put a comment there? I should try to investigate something similar for powerpc. > >=20 > >> + "b 4f\n\t" \ > >> + ".balign 32\n\t" \ > >> + "3:\n\t" \ > >> + ".word 1b, 0x0, 2b, 0x0, l[failure], 0x0, 0x0, 0x0\n\t" \ > >> + "4:\n\t" \ > >> + : /* no outputs */ \ > >> + : [start_event_counter]"r"((_start_value).event_counter), \ > >> + [current_event_counter]"m"((_start_value).rseqp->u.e.event_counte= r), \ > >> + [to_write_final]"r"(_to_write_final), \ > >> + [target_final]"r"(_target_final), \ > >> + [rseq_cs]"r"(&(_start_value).rseqp->rseq_cs) \ > >> + extra_input \ > >> + RSEQ_INJECT_INPUT \ > >> + : "r0", "r1", "memory", "cc" \ > >> + RSEQ_INJECT_CLOBBER \ > >> + : _failure \ > >> ); > >=20 > > [...] > >=20 > >> + > >> +#define RSEQ_FINISH2_RELEASE_SPECULATIVE_STORE_ASM() \ > >> + RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \ > >> + "dmb\n\t" > >> + > >=20 > > Having a RELEASE barrier here may be OK for all current archs we > > support, but there are archs which rather than have a lightweight > > RELEASE barrier but use a special instruction for RELEASE operations, > > for example, AArch64. Do we need to take that into consideration and > > define a RSEQ_FINISH_ASM_RELEASE() rather than a > > RSEQ_FINISH2_SPECULATIVE_STORE_INPUT_ASM()? >=20 > Good point. We should introduce the barrier before the final > store to fit this scenario. This would also work if we want to > do many speculative stores followed by a final store: it really > makes sense to put the barrier just before the final store rather > than after each speculative store. >=20 > I just pushed a commit in my dev branch implementing this. Testing > is welcome. >=20 Sure, let me play around ;-) Regards, Boqun > Thanks! >=20 > Mathieu >=20 > >=20 > > [...] > >=20 > > Regards > > Boqun >=20 > --=20 > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com --AH+kv8CCoFf6qPuz Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABCAAGBQJXsRMuAAoJEEl56MO1B/q4TpEH/2qlI+1O0uYgOpEEZ75zh6yO /p9lDZD03regU5iTco7mX9fAOxZkDIjNMhee/J70ycM2ZrFbqvgXzUIG3/cLc70N 4IZ6hVky2nfQ5VXO16ENiMk0vmssDhqp7T+UgU8eiswS7RDAFq2ISJk09lZrgdpq PND5nEtq8bh/9cZKs7XEl7t4z/EG8Hqh74OKOPGxXGz1sMsmwnTX1RHtKAR9y0b1 /ck+ayKjhClPboPv1+94Bw8KY1hGzxvp6BSli4gkQPQhCZfNzc3WFACBcxn39+nN QB/zWe3oOohVdRJle+JzIfgOq3Smxx69VfEUi5JNByX+7FOwnDSD3D9s2TsDpv0= =dsHo -----END PGP SIGNATURE----- --AH+kv8CCoFf6qPuz--