From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mathieu Desnoyers Subject: Re: [RFC PATCH for 4.18 00/14] Restartable Sequences Date: Thu, 3 May 2018 12:12:21 -0400 (EDT) Message-ID: <1718748931.10084.1525363941807.JavaMail.zimbra@efficios.com> References: <20180430224433.17407-1-mathieu.desnoyers@efficios.com> <660904075.9201.1525276988842.JavaMail.zimbra@efficios.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org To: Daniel Colascione Cc: Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Andy Lutomirski , Dave Watson , linux-kernel , linux-api , Paul Turner , Andrew Morton , Russell King , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Andrew Hunter , Andi Kleen , Chris Lameter , Ben Maurer , rostedt , Josh Triplett , Linus Torvalds , Catalin Marinas List-Id: linux-api@vger.kernel.org ----- On May 2, 2018, at 12:07 PM, Daniel Colascione dancol@google.com wrote: > On Wed, May 2, 2018 at 9:03 AM Mathieu Desnoyers < > mathieu.desnoyers@efficios.com> wrote: > >> ----- On May 1, 2018, at 11:53 PM, Daniel Colascione dancol@google.com > wrote: >> [...] >> > >> > I think a small enhancement to rseq would let us build a perfect > userspace >> > mutex, one that spins on lock-acquire only when the lock owner is > running >> > and that sleeps otherwise, freeing userspace from both specifying ad-hoc >> > spin counts and from trying to detect situations in which spinning is >> > generally pointless. >> > >> > It'd work like this: in the per-thread rseq data structure, we'd > include a >> > description of a futex operation for the kernel would perform (in the >> > context of the preempted thread) upon preemption, immediately before >> > schedule(). If the futex operation itself sleeps, that's no problem: we >> > will have still accomplished our goal of running some other thread > instead >> > of the preempted thread. > >> Hi Daniel, > >> I agree that the problem you are aiming to solve is important. Let's see >> what prevents the proposed rseq implementation from doing what you > envision. > >> The main issue here is touching userspace immediately before schedule(). >> At that specific point, it's not possible to take a page fault. In the > proposed >> rseq implementation, we get away with it by raising a task struct flag, > and using >> it in a return to userspace notifier (where we can actually take a > fault), where >> we touch the userspace TLS area. > >> If we can find a way to solve this limitation, then the rest of your > design >> makes sense to me. > > Thanks for taking a look! > > Why couldn't we take a page fault just before schedule? The reason we can't > take a page fault in atomic context is that doing so might call schedule. > Here, we're about to call schedule _anyway_, so what harm does it do to > call something that might call schedule? If we schedule via that call, we > can skip the manual schedule we were going to perform. By the way, if we eventually find a way to enhance user-space mutexes in the fashion you describe here, it would belong to another TLS area, and would be registered by another system call than rseq. I proposed a more generic "TLS area registration" system call a few years ago, but Linus told me he wanted a system call that was specific to rseq. If we need to implement other use-cases in a TLS area shared between kernel and user-space in a similar fashion, the plan is to do it in a distinct system call. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com