From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E6A5EE49B2 for ; Sat, 19 Aug 2023 06:18:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245151AbjHSGJa (ORCPT ); Sat, 19 Aug 2023 02:09:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34126 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244561AbjHSGJ1 (ORCPT ); Sat, 19 Aug 2023 02:09:27 -0400 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7BA27420F for ; Fri, 18 Aug 2023 23:09:25 -0700 (PDT) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-58d37b541a2so25626087b3.2 for ; Fri, 18 Aug 2023 23:09:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692425364; x=1693030164; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:from:to:cc:subject:date:message-id:reply-to; bh=V7+jae9+tPNItTCIj+E1nAAp463mdzPtLiylyzIw/ss=; b=TaxfD1WhSd/BtK4RmWGaGKnbUuuYh7WAabLdDDiWWrPR2/vgxB8JQr0fod6ZQUuajS PoUF1D/+jYLjT1W/W3zQVYAzK7PcJ4xUy8R/yZQnC43I7uAk+NGFLo6165E8OKA2gLb0 YZWUpQ3WdEQFHtjBAJrK2C3QXOLisYnoa+KI1i8HBB/Z3wh81IjCWuTo9YxYGtrGD4T+ JPEXF4qiGAbcOdDWYh0z7GHqyXR5IRBGXea2ypNnx+vOX63EDOjmOsNrb6VZ7nkD9Cpx 0PwfB7ITW9tKe5t2o6SFsLYJLhvT4iodhZ9O68J+jRvbp1Weu7ebrvAbAD9k9ZKfDMcY b5Uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692425364; x=1693030164; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=V7+jae9+tPNItTCIj+E1nAAp463mdzPtLiylyzIw/ss=; b=QNPI1nKHX+yTs6Q4p+jNk933f0SQJVnJfceWKyZEQu83U6+5mx0AkUI19ZBvfkTUDO ldjydt1o0Gbtls+FLcxJIiXYNaSZuH5YL6e/Vx8JXh1Ukuk1JrndrHW+FfelpQSnNgsz bTHsYm7FqhPD/LCDdRd7BmnaS9luwetesM0/F3V+15IZPkql9neaK8r32N66gSEG6VtI +VbwGnekfFCQmmJCeft3s4V5cYoLIBvSSESnhdOLqU03csGPyDPN7OprAE/AVYMZaqA0 GJ/v/WgrpCV9D0gJWubkP28q6kIvjkhcR5nRA9k0aPBZXOUCLQZz+3/XrijgRsC17dgV z5+A== X-Gm-Message-State: AOJu0YzaPvTZpggTXSEYMfODp05eoMifjK4j/1PU8KcRadWMPXwXuvrw M3WfsWCt/LP9vcG1gN02dynQbabxwrme7EbeLyhJ7AAdi7ipGzceCnSQgT6laYyqKxixfuo/K/9 ujy2ylO+6arhDDDCWYFOLZ82d25qC4Xa+KMH/hrjjt9ch++0A8uCTLx1pvcZ5hzxMYcAJKiU= X-Google-Smtp-Source: AGHT+IH2oAqmpyNeg8k7iYn8Hx//joEOXmEMnHfcgrI9TShkmK8H7a11B/4UM/Tq9J+mP73by03KBJvZps/l X-Received: from jstultz-noogler2.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:600]) (user=jstultz job=sendgmr) by 2002:a25:694d:0:b0:d4c:5b69:e95c with SMTP id e74-20020a25694d000000b00d4c5b69e95cmr10308ybc.1.1692425364275; Fri, 18 Aug 2023 23:09:24 -0700 (PDT) Date: Sat, 19 Aug 2023 06:08:34 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.42.0.rc1.204.g551eb34607-goog Message-ID: <20230819060915.3001568-1-jstultz@google.com> Subject: [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Youssef Esmat , Mel Gorman , Daniel Bristot de Oliveira , Will Deacon , Waiman Long , Boqun Feng , "Paul E . McKenney" , kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Getting this iteration of the patch series out to the list has taken way longer than I had hoped! Right before I sent out the v4 release, I had found that while the patch series was fairly stable with lower cpu counts, I could very easily run into crashes with higher cpus during bootup. Not being able to quickly resolve the crashes I was seeing, I went ahead and sent out v4 for review, and figured I=E2=80=99d debug the issue and send out v5 as a quick follow on.=20 But unfortunately trying to diagnose the failures and fix them started uncovering other crashes. Additionally, the v4 patch series wasn=E2=80=99t properly bisectable as earlier changes were missing fixes from later in the series, which made debugging issues quite difficult. So after playing constant wack-a-mole with bugs, I took the entire engine apart and laid all the bits on the floor, and started slowly re-assembling things, testing each step as I went. This has been very laborious, as cutting small chunks from the larger patches off would run into self-caused issues that I spent much time debugging, due to my missing needed logic still in the larger patch. But slowly I=E2=80=99ve managed to get almost all the fine-grained patches to boot and run. In fact, the patch series here is coarser than what I=E2=80=99ve created, as there are a number of small =E2=80=9Ctest=E2=80=9D steps to hel= p validate behavior I changed, which would then be replaced by the real logic afterwards. Including those here would just cause more work for reviewers, but if you=E2=80=99re interested you can find the fine-grained tree here:=20 https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v5-6.5-rc= 6-fine-grained https://github.com/johnstultz-work/linux-dev.git proxy-exec-v5-6.5-rc6-fi= ne-grained Having this fine grained tree really helped me chase down and resolve a large number of bugs in the logic. In some cases I=E2=80=99ve significantly reworked parts of the changes, so for instance, I found the previous versions of the patch had a fair amount of logic changing the blocked_on state in the try_to_wake_up paths which were racy particularly with the return-migration. I=E2=80=99ve moved the return-migration logic into __sched so we make sure we don=E2=80=99t run proxy-migrated blocked tasks on an incorrect cpu when they are unblocked. This change has its own complexity, so feedback/suggestions for improvements would be appreciated. But as a result, the patch series is much more stable (particularly the earlier components). Now, I have run into a few issues still, particularly around the enqueuing of tasks on deactivated/sleeping owners, particularly with the test-ww_mutex logic (ww_mutexes have been difficult as they break assumptions that tasks are unblocked along the blocked_on chain in an orderly fashion - Instead a task in the middle of the chain may become suddenly runnable because it got an EDEADLK from its ww_mutex). Additionally I haven=E2=80=99t been able to test and debug chain migration. So I=E2=80=99ve left those patches out for now, but will re-include them on the next revision. Given the number of patches here, I suspect reviewers won=E2=80=99t mind. :)=20 As mentioned previously, this Proxy Execution series has a long history: First described in a paper[2] by Watkins, Straub, Niehaus, then from patches from Peter Zijlstra, extended with lots of work by Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank you to Steven Rostedt for providing additional details here!) So again, many thanks to those above, as all the credit for this series really is due to them - while the mistakes are likely mine. Overview: --------- Proxy Execution is a generalized form of priority inheritance. Classic priority inheritance works well for real-time tasks where there is a straight forward priority order to how things are run. But it breaks down when used between CFS or DEADLINE tasks, as there are lots of parameters involved outside of just the task=E2=80=99s nice value when selecting the next task to run (via pick_next_task()). So ideally we want to imbue the mutex holder with all the scheduler attributes of the blocked waiting task. Proxy Execution does this via a few changes: * Keeping tasks that are blocked on a mutex *on* the runqueue * Keeping additional tracking of which mutex a task is blocked on, and which task holds a specific mutex. * Special handling for when we select a blocked task to run, so that we instead run the mutex holder.=20 The first of these is the most difficult to grasp (I do get the mental friction here: blocked tasks on the *run*queue sounds like nonsense! Personally I like to think of the runqueue in this model more like a =E2=80=9Ctask-selection queue=E2=80=9D). By leaving blocked tasks on the runqueue, we allow pick_next_task() to choose the task that should run next (even if it=E2=80=99s blocked waiting on a mutex). If we do select a blocked task, we look at the task=E2=80=99s blocked_on mutex and from there look at the mutex=E2=80=99s owner task. And in the simple case, the task which owns the mutex is what we then choose to run, allowing it to release the mutex. This means that instead of just tracking =E2=80=9Ccurr=E2=80=9D, the schedu= ler needs to track both the scheduler context (what was picked and all the state used for scheduling decisions), and the execution context (what we=E2=80=99re actually running). In this way, the mutex owner is run =E2=80=9Con behalf=E2=80=9D of the bloc= ked task that was picked to run, essentially inheriting the scheduler context of the waiting blocked task. As Connor outlined in a previous submission of this patch series, this raises a number of complicated situations: The mutex owner might itself be blocked on another mutex, or it could be sleeping, running on a different CPU, in the process of migrating between CPUs, etc. But the functionality provided by Proxy Execution is useful, as in Android we have a large number of cases where we are seeing priority inversion (not unbounded, but longer than we=E2=80=99d like) between =E2=80=9Cforeground=E2=80=9D and =E2=80=9Cbackground=E2=80=9D SCHED= _NORMAL applications, so having a generalized solution would be very useful. New in v5: --------- * Broke the patch series up into fine grained changes * Heavily reworked the return-migration handling by moving it out of try_to_wake_up() and into __schedule(). * Reworked blocked_on tracking logic (so it is handled completely in mutex code), and added a blocked_on_waking value, so we can distinguish when tasks have not acquired the mutex, but need to wake up to try to do so. * Resolved lost-wakeup issues caused by the wake_qs being managed in the ctx structures instead of being on the stack. I went back to an earlier version of that patch from Juri though re-adding some fixes from Connor=E2=80=99s version. * Resolved null se pointer crashes seen at bootup, caused by incorrect put_prev_task() calls * Fixes to the test-ww_mutex test logic that was causing livelocks & hangs. (Sent as a separate series) * And more I=E2=80=99ve likely forgotten Also, I know Peter didn=E2=80=99t like the blocked_on wrappers, so I dropped them last time, but I found them (and the debug checks within) crucial to working out issues in this patch series. I=E2=80=99m fine to consider dropping them later if they are still objectionable, but their utility was too high at this point to get rid of them. Issues still to address: =E2=80=94---------- * As mentioned above, the deactivated owner handling (where we deactivate waiting tasks and enqueue them onto a list, then reactivate them when the owner wakes up) has some major issues with ww_mutexes. Additionally I think there are other races possible, since everyone is using a single blocked_entry list_head, anyone who wakes up may think it=E2=80=99s the front of the chain and wake everyone else up(not just those waiting on it). I believe if two processes are woken at the same time, they could fight trying to consume and activate tasks onto different cpus. * Still need to validate and re-add the chain migration patches. * Seen some rare crashes around rt runqueue accounting. Will be investigating this. * =E2=80=9Crq_selected()=E2=80=9D naming. Peter doesn=E2=80=99t like it, bu= t I=E2=80=99ve not thought of a better name. Open to suggestions. * As discussed at OSPM[1], I want to split pick_next_task() up into two phases selecting and setting the next tasks, as currently pick_next_task() assumes the returned task will be run which results in various side-effects in sched class logic when it=E2=80=99s run.=20 * CFS load balancing. Blocked tasks may carry forward load (PELT) to the lock owner's CPU, so CPU may look like it is overloaded. * I still want to push down the split scheduler and execution context awareness further through the scheduling code, as lots of logic still assumes there=E2=80=99s only a single =E2=80=9Crq->curr=E2= =80=9D task. * Optimization to avoid migrating blocked tasks (allowing for optimistic spinning) if the runnable lock-owner at the end of the blocked_on chain is already running (though this is difficult as locking rules to traverse the blocked on chain across run queues isn=E2=80=99t trivial). Performance: =E2=80=94---------- This patch series switches mutexes to use handoff mode rather than optimistic spinning. This is a potential concern where locks are under high contention. However, earlier performance analysis (on both x86 and mobile devices) did not see major regressions. That said, Chenyu did report a regression[3], which I=E2=80=99ll need to look further into. I also briefly re-tested with this v5 series and saw some average latencies grow suggesting the changes to return-migration (and extra locking) have some impact. I=E2=80=99ll be digging more there. As mentioned above, there may be some additional optimizations that can help here, but my focus is on getting the code working well before I spend time optimizing. Review and feedback would be greatly appreciated! If folks find it easier to test/tinker with, this patch series can also be = found here: https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v5-6.5-rc= 6 https://github.com/johnstultz-work/linux-dev.git proxy-exec-v5-6.5-rc6 Thanks so much! -john [1] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk) [2] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf [3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/ Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Youssef Esmat Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E . McKenney" Cc: kernel-team@android.com John Stultz (8): locking/mutex: Removes wakeups from under mutex::wait_lock locking/mutex: Split blocked_on logic into two states (blocked_on and blocked_on_waking) sched: Fix runtime accounting w/ split exec & sched contexts sched: Unnest ttwu_runnable in prep for proxy-execution sched: Split out __sched() deactivate task logic into a helper sched: Add a very simple proxy() function sched: Add proxy deactivate helper sched: Handle blocked-waiter migration (and return migration) Juri Lelli (2): locking/mutex: make mutex::wait_lock irq safe locking/mutex: Expose __mutex_owner() Peter Zijlstra (7): sched: Unify runtime accounting across classes locking/mutex: Rework task_struct::blocked_on locking/mutex: Add task_struct::blocked_lock to serialize changes to the blocked_on state locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC sched: Split scheduler execution context sched: Start blocked_on chain processing in proxy() sched: Add blocked_donor link to task for smarter mutex handoffs Valentin Schneider (2): locking/mutex: Add p->blocked_on wrappers for correctness checks sched: Fix proxy/current (push,pull)ability include/linux/sched.h | 33 ++- init/Kconfig | 7 + init/init_task.c | 1 + kernel/Kconfig.locks | 2 +- kernel/fork.c | 6 +- kernel/locking/mutex-debug.c | 9 +- kernel/locking/mutex.c | 128 ++++++--- kernel/locking/mutex.h | 25 ++ kernel/locking/ww_mutex.h | 72 +++-- kernel/sched/core.c | 522 +++++++++++++++++++++++++++++++---- kernel/sched/deadline.c | 50 ++-- kernel/sched/fair.c | 104 +++++-- kernel/sched/rt.c | 78 +++--- kernel/sched/sched.h | 61 +++- kernel/sched/stop_task.c | 13 +- 15 files changed, 868 insertions(+), 243 deletions(-) --=20 2.42.0.rc1.204.g551eb34607-goog