From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mathieu Desnoyers Subject: Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections Date: Fri, 11 Dec 2015 13:39:45 +0000 (UTC) Message-ID: <729444271.232573.1449841185204.JavaMail.zimbra@efficios.com> References: <20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com> <1070636085.232143.1449835536723.JavaMail.zimbra@efficios.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <1070636085.232143.1449835536723.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Paul Turner Cc: Peter Zijlstra , Andrew Hunter , Andy Lutomirski , Andi Kleen , Dave Watson , "Paul E. McKenney" , linux-api , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Josh Triplett , Ingo Molnar , Chris Lameter , Linus Torvalds List-Id: linux-api@vger.kernel.org ----- On Dec 11, 2015, at 7:05 AM, Mathieu Desnoyers mathieu.desnoyers@= efficios.com wrote: > ----- On Oct 27, 2015, at 7:56 PM, Paul Turner commonly-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wro= te: >=20 >> This is an update to the previously posted series at: >> https://lkml.org/lkml/2015/6/24/665 >>=20 >> Dave Watson has posted a similar follow-up which allows additional c= ritical >> regions to be registered as well as single-step support at: >> https://lkml.org/lkml/2015/10/22/588 >>=20 >> This series is a new approach which introduces an alternate ABI that= does not >> depend on open-coded assembly nor a central 'repository' of rseq seq= uences. >> Sequences may now be inlined and the preparatory[*] work for the seq= uence can >> be written in a higher level language. >=20 > Hi Paul, >=20 > Thanks for sending this updated series! A few questions: >=20 > When you say that the preparatory work can be written in a higher > level language, does it mean it still has to be inlined, or function > calls are explicitly supported ? It would be good to clarify this in > the changelog/documentation. >=20 > Another point: in this changelog, it's unclear to me what the [*] ref= ers > to (after "preparatory"). Below there is a [*] before "A sequence ess= entially > has 3 steps", and and plain '*' before "0. Userspace stores current e= vent+cpu > counter values". >=20 >>=20 >> This new ABI has also been written to support debugger interaction i= n a way >> that the previous ABI could not. >>=20 >> [*] A sequence essentially has 3 steps: >> 1) Determine which cpu the sequence is being run on >> 2) Preparatory work specific to the state read in 1) >> 3) A single commit instruction which finalizes any state updates. >=20 > The link between those 3 steps and the following "progress" items > below seems a bit confusing, because those are two numbered lists. > Perhaps using the numbers above for the overall steps, and sub-number= s > after a dot could help exposing the relation, e.g.: >=20 > 1.1 Userspace stores current event+cpu counter values > 2.1 Userspace loads the rip to move to at failure into cx > 2.2 Userspace loads the rip of the instruction following > the critical section into a registered TLS address. > .... >=20 >=20 >>=20 >> We require a single instruction for (3) so that if it is interrupted= in any >> way, we can proceed from (1) once more [or potentially bail]. >>=20 >> This new ABI can be described as: >> Progress is ordered as follows: >> *0. Userspace stores current event+cpu counter values >=20 > This does not explain where those "current" values are > read from ? Moreover, step [3] below seems to refer to values > read at [0], but [0] is about storing values, not reading them. > There is probably a very simple explanation to this, but in its > current form, this is puzzling. Moreover, where is it stored ? >=20 > I would suggest to change event+cpu -> (event, cpu), since they > are not summed, but rather combined as a tuple. >=20 >> 1. Userspace loads the rip to move to at failure into cx >=20 > Is it really cx (16-bit), or ecx/rcx ? >=20 > "the rip to move to at failure" could be defined as > "failure address" or "restart address", and then referred > to afterward. This would clarify point [4] below rather than > saying "to the address from =C2=AD[1]". >=20 >> 2. Userspace loads the rip of the instruction following >> the critical section into a registered TLS address. >=20 > The critical section is not clearly defined. What is the > first instruction included in that c.s., and what is the > first instruction after the c.s. ? >=20 >> 3. Userspace loads the values read at [0] into a known >> location. >=20 > How can we "load the value ... into" ? We usually load from, > and store to a location. What I think you mean here is that > Userspace loads the values read at [0] into a known register. > (I would expect location target a memory address). >=20 >> 4. Userspace tests to see whether the current event and >> cpu counter values match those stored at 0. Manually >> jumping to the address from [1] in the case of a >> mismatch. >=20 > address from [1] -> "failure/restart address" >=20 >>=20 >> Note that if we are preempted or otherwise interrupted >> then the kernel can also now perform this comparison >> and conditionally jump us to [1]. >=20 > Does it jump us to [1] or to the failure/restart address ? >=20 >> 5. Our final instruction before [2] is then our commit. >> The critical section is self-terminating. [2] must >> also be cleared at this point. >=20 > Is it the step [2] or the instruction following the critical > section to you refer to here ? >=20 >>=20 >> For x86_64: >> [3] uses rdx to represent cpu and event counter as a >> single 64-bit value. >>=20 >> For i386: >> [3] uses ax for cpu and dx for the event_counter. >=20 > Having 16 bit only for CPU id limits us to 65536 CPUs. > Can this be an issue ? >=20 > Having 16 bit only for the event counter may start to become > tricky for counter overflows. >=20 > Are we strictly required to put both cpu id and seq counter > into the same register, or those could be put into two regs ? >=20 >>=20 >> Both: >> Instruction after commit: rseq_state->post_commit_instr >> Current event and cpu state: rseq_state->event_and_cpu >>=20 >> Exactly, for x86_64 this looks like: >> movq , rcx [1] >> movq $1f, [2] >> cmpq , [3] (start is in rcx) >=20 > As you stated yourself in a reply, rcx -> rdx. >=20 >> jnz (4) >> movq , () (5) >> 1: movq $0, >>=20 >> There has been some related discussion, which I am supportive of, in= which >> we use fs/gs instead of TLS. This maps naturally to the above and r= emoves >> the current requirement for per-thread initialization (this is a goo= d thing!). >=20 > It would be important to keep allowing other architectures to use TLS= -based > approaches though. >=20 >>=20 >> On debugger interactions: >>=20 >> There are some nice properties about this new style of API which all= ow it to >> actually support safe interactions with a debugger: >> a) The event counter is a per-cpu value. This means that we can not= advance >> it if no threads from the same process execute on that cpu. This >> naturally allows basic single step support with thread-isolation. >> b) Single-step can be augmented to evalute the ABI without increment= ing the >> event count. >> c) A debugger can also be augmented to evaluate this ABI and push re= starts >> on the kernel's behalf. >>=20 >> This is also compatible with David's approach of not single stepping= between >> 2-4 above. However, I think these are ultimately a little stronger = since true >> single-stepping and breakpoint support would be available. Which wo= uld be >> nice to allow actual debugging of sequences. >>=20 >> (Note that I haven't bothered implementing these in the current patc= hset as we >> are still winnowing down on the ABI and they just add complexity. I= t's >> important to note that they are possible however.) >=20 > Ensuring these restartable critical sections can be used on shared me= mory > across processes seems rather important. I wonder if the debuggabilit= y of the > critical section really justifies to introduce a "PRIVATE" and a "SHA= RED" mode > for the rseq (similar to sys_futex), or if we could just enforce that= no > breakpoint should be inserted within the critical section (except the= very > first instruction). One more thought about breakpoints: I now understand that with your new= approach, the critical section ranges are only tracked momentarily during the cri= tical section. There is no registry kept of all critical sections, which coul= d make ptrace peek tweaks challenging. An alternative approach might be to allow the debugger to put a breakpo= int within the critical section, but override the breakpoint handler in the= kernel so that it just resumes user-space execution, effectively ignoring the breakpoint. Thanks, Mathieu >=20 > Thoughts ? >=20 > Thanks, >=20 > Mathieu >=20 >=20 >>=20 >> Thanks, >>=20 >> - Paul >=20 > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com --=20 Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com