From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8E31AC4743E for ; Sat, 5 Jun 2021 01:10:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 23E15613E9 for ; Sat, 5 Jun 2021 01:10:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231172AbhFEBLr (ORCPT ); Fri, 4 Jun 2021 21:11:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47188 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230188AbhFEBLr (ORCPT ); Fri, 4 Jun 2021 21:11:47 -0400 Received: from mail-pj1-x1033.google.com (mail-pj1-x1033.google.com [IPv6:2607:f8b0:4864:20::1033]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8E233C061766; Fri, 4 Jun 2021 18:09:51 -0700 (PDT) Received: by mail-pj1-x1033.google.com with SMTP id m13-20020a17090b068db02901656cc93a75so8380038pjz.3; Fri, 04 Jun 2021 18:09:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:subject:to:cc:references:in-reply-to:mime-version :message-id:content-transfer-encoding; bh=+ymcPBkFzODnVXwuUwD5PtjbB+QQ9q8/HCoXcbiRNhs=; b=J79OZPj1yrF4atqaUyHZvNrQpi7R8arRnaxWxYWRcs8cvvkYluOFGH3cVhQb4u/fFB /JANEX5qjMf5WpmmaYiI6N4T0ww7ToXIllFLSTzWLeSgMWjZnFl62FJff//Mf9kp67rG O64lpPtVIfSgmnP7iCcBRq9HVygJBYdjsJ0Y3MF1Ikr/D1otLP5Wm8qZ7r0CQyyFmotI 7TocXIb6c5JGvejbCcNs+OO8BJ10zQk78J5GwuD/+UIfu7qJBLM6pJzVFviTgdpNy5xI V58VYaz8xvJ/vfeHblpPGzEdDpYG3h9UCn2LojjqgllRriWA30Acii79itxjEt8gAqhV 8SUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:subject:to:cc:references:in-reply-to :mime-version:message-id:content-transfer-encoding; bh=+ymcPBkFzODnVXwuUwD5PtjbB+QQ9q8/HCoXcbiRNhs=; b=h5FS9lXqa6QySGn7+ebHBjXXjojepOr2WWi86qapUODZay7TYn65FubxuGeRgde2l+ 4U1fSLcc7SesKsIX/xcUOJNAsP1JO77BieQOztTe1Y5AwzmbXooMa7xqyeLrIjb035Qh xChXiHbGvcG+ndeXnny6yHRuYQz00y4WZh0bTPHj8YeNWOOpHAtEBlzriPRC7/W/1ivv KMJfr2LOkbz7PhaYPjarODqLxXK+DuQls8/90i3iyFl5Vq1H6Ihh0MChofOsDd/ubQT3 woWlfd72+QMCXUrUWRNfUiMipAHCucTGt5qqKbG5/z0rZn6KvAsWe3fUepWP0IqchxRg x85g== X-Gm-Message-State: AOAM530iyIZrbNPznb9C1jBVvO6Si/H8kJzRVWx0RDyGsOA5nsoNpII4 3UJZtf5frDUuUnthiehG1xZEHfNeZtU= X-Google-Smtp-Source: ABdhPJyB3RkL7XEtTxQkrqNjWJLnPnWa4hmpCz+oxnl97Ql7j+jLd33lGAKu1ki14J3yzUZiH+IRmg== X-Received: by 2002:a17:90b:33d1:: with SMTP id lk17mr7759723pjb.154.1622855390928; Fri, 04 Jun 2021 18:09:50 -0700 (PDT) Received: from localhost (60-242-147-73.tpgi.com.au. [60.242.147.73]) by smtp.gmail.com with ESMTPSA id s9sm2607668pfm.120.2021.06.04.18.09.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 04 Jun 2021 18:09:50 -0700 (PDT) Date: Sat, 05 Jun 2021 11:09:45 +1000 From: Nicholas Piggin Subject: Re: [PATCH v4 00/15] Add futex2 syscalls To: =?iso-8859-1?b?QW5kcuk=?= Almeida Cc: acme@kernel.org, Andrey Semashev , Sebastian Andrzej Siewior , corbet@lwn.net, Davidlohr Bueso , Darren Hart , fweimer@redhat.com, joel@joelfernandes.org, kernel@collabora.com, krisman@collabora.com, libc-alpha@sourceware.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, malteskarupke@fastmail.fm, Ingo Molnar , Peter Zijlstra , pgriffais@valvesoftware.com, Peter Oskolkov , Steven Rostedt , shuah@kernel.org, Thomas Gleixner , z.figura12@gmail.com References: <20210603195924.361327-1-andrealmeid@collabora.com> <1622799088.hsuspipe84.astroid@bobo.none> In-Reply-To: MIME-Version: 1.0 Message-Id: <1622853816.mokf23xgnt.astroid@bobo.none> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-api@vger.kernel.org Excerpts from Andr=C3=A9 Almeida's message of June 5, 2021 6:01 am: > =C3=80s 08:36 de 04/06/21, Nicholas Piggin escreveu: >> Excerpts from Andr=C3=A9 Almeida's message of June 4, 2021 5:59 am: >>> Hi, >>> >>> This patch series introduces the futex2 syscalls. >>> >>> * What happened to the current futex()? >>> >>> For some years now, developers have been trying to add new features to >>> futex, but maintainers have been reluctant to accept then, given the >>> multiplexed interface full of legacy features and tricky to do big >>> changes. Some problems that people tried to address with patchsets are: >>> NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2= ]. >>> NUMA, for instance, just doesn't fit the current API in a reasonable >>> way. Considering that, it's not possible to merge new features into the >>> current futex. >>> >>> ** The NUMA problem >>> >>> At the current implementation, all futex kernel side infrastructure is >>> stored on a single node. Given that, all futex() calls issued by >>> processors that aren't located on that node will have a memory access >>> penalty when doing it. >>> >>> ** The 32bit sized futex problem >>> >>> Futexes are used to implement atomic operations in userspace. >>> Supporting 8, 16, 32 and 64 bit sized futexes allows user libraries to >>> implement all those sizes in a performant way. Thanks Boost devs for >>> feedback: https://lists.boost.org/Archives/boost/2021/05/251508.php >>> >>> Embedded systems or anything with memory constrains could benefit of >>> using smaller sizes for the futex userspace integer. >>> >>> ** The wait on multiple problem >>> >>> The use case lies in the Wine implementation of the Windows NT interfa= ce >>> WaitMultipleObjects. This Windows API function allows a thread to slee= p >>> waiting on the first of a set of event sources (mutexes, timers, signa= l, >>> console input, etc) to signal. Considering this is a primitive >>> synchronization operation for Windows applications, being able to quic= kly >>> signal events on the producer side, and quickly go to sleep on the >>> consumer side is essential for good performance of those running over = Wine. >>> >>> [0] https://lore.kernel.org/lkml/20160505204230.932454245@linutronix.de= / >>> [1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskarupke@we= b.de/ >>> [2] https://lore.kernel.org/lkml/20200213214525.183689-1-andrealmeid@co= llabora.com/ >>> >>> * The solution >>> >>> As proposed by Peter Zijlstra and Florian Weimer[3], a new interface >>> is required to solve this, which must be designed with those features i= n >>> mind. futex2() is that interface. As opposed to the current multiplexed >>> interface, the new one should have one syscall per operation. This will >>> allow the maintainability of the API if it gets extended, and will help >>> users with type checking of arguments. >>> >>> In particular, the new interface is extended to support the ability to >>> wait on any of a list of futexes at a time, which could be seen as a >>> vectored extension of the FUTEX_WAIT semantics. >>> >>> [3] https://lore.kernel.org/lkml/20200303120050.GC2596@hirez.programmin= g.kicks-ass.net/ >>> >>> * The interface >>> >>> The new interface can be seen in details in the following patches, but >>> this is a high level summary of what the interface can do: >>> >>> - Supports wake/wait semantics, as in futex() >>> - Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but wit= h >>> individual flags for each address >>> - Supports waiting for a vector of futexes, using a new syscall named >>> futex_waitv() >>> - Supports variable sized futexes (8bits, 16bits, 32bits and 64bits) >>> - Supports NUMA-awareness operations, where the user can specify on >>> which memory node would like to operate >>=20 >> A few comments. >>=20 >> - Do you need a syscall for wait and for waitv, or can wait just be a >> degenerate case of waitv (which could use the stack variable)? I guess >> it does save the user access unlock. >>=20 >=20 > Yes. waitv with one element has a overhead compared to wait, given the > extra user_copy(). Given that waiting on a single futex is the common > case, I think it's worth to have it. Okay. >> - Did you consider a wakev interface? An example is a read-write mutex=20 >> which has read-blocking futexes split (e.g., per-node) for scalability=20 >> then the writer may unlock and wake all readers with one call. We=20 >> actually have some scalability challenges of this nature with certain=20 >> large database programs. >>=20 >=20 > Not really, I haven't heard any use case for that until now. It should > be easy to implement it, though, and I think you have an interesting use > case here. Could you point me some of those database programs? Not source code unfortunately. I know that's not a very good answer, but=20 they are far ahead of what most open source apps are doing scalability=20 wise today, and they end up rolling their own complex locking. Hopefully the example I give is simple enough to understand. >=20 >> - Great to see 64-bit support in, it is helpful to do some interesting=20 >> things with locks without hacks (e.g., putting an address in the lock=20 >> word). >>=20 >> - Are we really keen on squashing node ID into flags in this day and age= ? >> I guess okay but seems like it would be nice to allow a bit more space >> in general for the operations. I don't want to turn it into a whole big >> multiplexing nightmare again with lots of such flags, or propose >> complexity with no code behind it, but I think we need a bit of leeway >> for unforeseen locking innovations to be added carefully. The pthread >> locking today is still fairly primitive really, I don't think we know >> what will work best for the next 10 years. >=20 > In the interface that I'd proposed, the node ID isn't part of the flags. > You have a flag FUTEX_FLAG_NUMA, and when that is used, you pass in > `void *uaddr` a pointer to a `struct futex_numa { int value, int hint > }`, where hint should be the node ID you would like to work on, and > value is just the userspace futex. This is documented in more details in > patch 7 "docs: locking: futex2: Add documentation". >=20 > If you have any feedback about how this NUMA interface looks like, I > would like to hear. >=20 > Also, did something in my writing indicated that the node ID would be > part of the flags? I'll improve this it if so. Oh I did miss this, thank you. No it wasn't your writing, I think it was=20 me trying to read through a lot of messages and got confused with some earlier conversations. I'll look a bit more at the NUMA interface. >=20 >>=20 >> One scalability issue we are starting to hit and will only get worse is=20 >> futex queue spinlock contention. Perhaps this is better addressed in=20 >> userspace but the kernel could play a part so I like to leave some doors >> open. One example is that the wait (or wake) side may like to depend not >> just on the memory value, but on the success of a cmpxchg to avoid=20 >> unqueueing and queueing spuriously, which increases lock contention but >> also ends up putting the poor task on the back of the list -- yes RT >> priorities can help the realtime case, but non-RT cases can get bad >> outlier latencies if lock stealing is allowed (which can be very good >> for performance). >>=20 >=20 > Sorry, I'm not sure what do you mean here. Are you proposing to have a > cmpxchg in kernel side, so the lock would be taken by the kernel, and > not by the userspace like it's now? Yes. Only in slow paths, of course, to reduce starvation / erratic latencies and spurious wait queue manipulations. Actually one other scalability thing while I remember it: futex_wait currently requires that the lock word is tested under the=20 queue spin lock (to avoid consuming a wakeup). The problem with this is=20 that the lock word can be a very hot cache line if you have a lot of concurrency, so accessing it under the queue lock can increase queue lock hold time. I would prefer if the new API was relaxed to avoid this restriction (e.g., any wait call may consume a wakeup so it's up to userspace to avoid that if it is a problem). >> - The private global futex hash table sucks for various reasons, and >> having 128 waiters per thread makes it orders of magnitude easier for >> userspace to DoS stuff with hash collisions. NUMA doesn't fix that, the >> per process hashing that Thomas suggested does fix the DoS but the >> non-deterministic hash collisions still seem to be a problem for real >> time response, and at the other end of the scale some apps (certain=20 >> databases, etc) can have ten thousand futex waiters at once so birthday >> paradox can also lead to guaranteed (low level) variable beahviour=20 >> within a single process. >>=20 >> I know the kernel in general is not very robust against this kind of=20 >> DoS/nondeterminism, but it's a bit sad to introduce new APIs with the=20 >> problem still there. Yes we could address it later, but I think it's=20 >> better done first because the solution might influence what the best=20 >> syscall API is. >>=20 >> For example the idea of using the address as the handle for the wait=20 >> queue _and_ the value is very convenient but hashing is annoying for >> all the above reasons and the shared wait queue case is pretty clunky.=20 >> It's also constraining in some corner cases to have the wait queue=20 >> associated with the address 1:1. For example a type of NUMA mutex might=20 >> want to have per-node waitqueues associated with a lock word, and wake >> local node waiters 99% of the time. Not trivial to do with futexes and >> seems to at least require bouncing of more cache lines, possibly more >> atomics, etc. >>=20 >> Could anything else be done? >=20 > I wasn't aware that userspace doing DoS is something to be concerned > from the kernel point of view. Is this scenario considering a malicious > actor? If so, there are plenty of resources to be denied, so not sure > how futex could be protected of this. Or is this just a program that > uses tons of futexes? Both really. AFAIKS one of the efforts that prompted the futex=20 modernisation work was the RT latency issues from Thomas in 2016 when=20 the per process table was proposed. >> I'll be burned at the stake for suggesting it but it would be great if=20 >> we could use file descriptors. At least for the shared futex, maybe=20 >> private could use a per-process futex allocator. It solves all of the >> above, although I'm sure has many of its own problem. It may not play >> so nicely with the pthread mutex API because of the whole static >> initialiser problem, but the first futex proposal did use fds. But it's >> an example of an alternate API. >>=20 >=20 > FDs and futex doesn't play well, because for futex_wait() you need to > tell the kernel the expected value in the futex address to avoid > sleeping in a free lock. FD operations (poll, select) don't have this > `value` argument, so they could sleep forever, but I'm not sure if you > had taken this in consideration. I had. The futex wait API would take a fd additional. The only=20 difference is the waitqueue that is used when a sleep or wake is=20 required is derived from the fd, not from an address. I think the bigger sticking points would be if it's too heavyweight an=20 object to use (which could be somewhat mitigated with a simpler ida allocator although that's difficult to do with shared), and whether libc could sanely use them due to the static initialiser problem of pthread mutexes. Thanks, Nick