From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 691142D47F3 for ; Fri, 2 Jan 2026 12:39:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767357543; cv=none; b=OdIHt7k9pkHsxFpbjdvG5ZPuRo9hx+1JpJVFu7lek/me9wRzfYazB7m1WLCOJKOWBkDLHgS3jdwGOG/UtgqjdXZA2XZN+SzuSMoxmHsPvZfVN+HvhbeSz/dE2XgWyBxu6wbzxGAwnu+cMK49ahkeuXkjGfq7xzrwtRSx2O7i4Xs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767357543; c=relaxed/simple; bh=p0OlC1SOg2IUJhou0PF8KSZwegP5vIa6TaDehc0wq1M=; h=Message-ID:Date:MIME-Version:Subject:From:To:Cc:References: In-Reply-To:Content-Type; b=Bt/KdhDwO83RvEr6brtMX3Q/sJLblw99S6k6dl6bVCTLujak9AXQvFemuc3fSLXOVxc8cJT5Pq4NWXxFFA2KFP3ppZZOiTzJXL/Wl2/KUU4mxk440FgSZNyrqtvbawxhnKPrGUrKxqTZM9nhFJwDT/qPe9V4ONrA+inyQTyDKCA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DB0DF497; Fri, 2 Jan 2026 04:38:53 -0800 (PST) Received: from [10.57.94.221] (unknown [10.57.94.221]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id BA9AF3F63F; Fri, 2 Jan 2026 04:38:59 -0800 (PST) Message-ID: Date: Fri, 2 Jan 2026 12:38:58 +0000 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Content-Language: en-GB From: Ryan Roberts To: Mel Gorman , "Peter Zijlstra (Intel)" Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Aishwarya TCV References: <20251112122521.1331238-3-mgorman@techsingularity.net> <176339661525.498.7070393041762616565.tip-bot2@tip-bot2> <4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com> In-Reply-To: <4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Hi, I appreciate I sent this report just before Xmas so most likely you haven't had a chance to look, but wanted to bring it back to the top of your mailbox in case it was missed. Happy new year! Thanks, Ryan On 22/12/2025 10:57, Ryan Roberts wrote: > Hi Mel, Peter, > > We are building out a kernel performance regression monitoring lab at Arm, and > I've noticed some fairly large perofrmance regressions in real-world workloads, > for which bisection has fingered this patch. > > We are looking at performance changes between v6.18 and v6.19-rc1, and by > reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan > to move the testing to linux-next over the next couple of quarters so hopefully > we will be able to deliver this sort of news prior to merging in future). > > All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean > statistically significant regression/improvement, where "statistically > significant" means the 95% confidence intervals do not overlap". > > The below is a large scale mysql workload, running across 2 AWS instances (a > load generator and the mysql server). We have a partner for whom this is a very > important workload. Performance regresses by 1.3% between 6.18 and 6.19-rc1 > (where the patch is added). By reverting the patch, the regression is not only > fixed by performance is now nearly 6% better than v6.18: > > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert-next-buddy | > +=================================+====================================================+=================+==============+===================+ > | repro-collection/mysql-workload | db transaction rate (transactions/min) | 646267.33 | (R) -1.33% | (I) 5.87% | > | | new order rate (orders/min) | 213256.50 | (R) -1.32% | (I) 5.87% | > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > > > Next are a bunch of benchmarks all running on a single system. specjbb is the > SPEC Java Business Benchmark. The mysql one is the same as above but this time > both loadgen and server are on the same system. pgbench is the PostgreSQL > benchmark. > > I'm showing hackbench for completeness, but I don't consider it a high priority > issue. > > Interestingly, nginx improves significantly with the patch. > > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert-next-buddy | > +=================================+====================================================+=================+==============+===================+ > | specjbb/composite | critical-jOPS (jOPS) | 94700.00 | (R) -5.10% | -0.90% | > | | max-jOPS (jOPS) | 113984.50 | (R) -3.90% | -0.65% | > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > | repro-collection/mysql-workload | db transaction rate (transactions/min) | 245438.25 | (R) -3.88% | -0.13% | > | | new order rate (orders/min) | 80985.75 | (R) -3.78% | -0.07% | > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > | pts/pgbench | Scale: 1 Clients: 1 Read Only (TPS) | 63124.00 | (I) 2.90% | 0.74% | > | | Scale: 1 Clients: 1 Read Only - Latency (ms) | 0.016 | (I) 5.49% | 1.05% | > | | Scale: 1 Clients: 1 Read Write (TPS) | 974.92 | 0.11% | -0.08% | > | | Scale: 1 Clients: 1 Read Write - Latency (ms) | 1.03 | 0.12% | -0.06% | > | | Scale: 1 Clients: 250 Read Only (TPS) | 1915931.58 | (R) -2.25% | (I) 2.12% | > | | Scale: 1 Clients: 250 Read Only - Latency (ms) | 0.13 | (R) -2.37% | (I) 2.09% | > | | Scale: 1 Clients: 250 Read Write (TPS) | 855.67 | -1.36% | -0.14% | > | | Scale: 1 Clients: 250 Read Write - Latency (ms) | 292.39 | -1.31% | -0.08% | > | | Scale: 1 Clients: 1000 Read Only (TPS) | 1534130.08 | (R) -11.37% | 0.08% | > | | Scale: 1 Clients: 1000 Read Only - Latency (ms) | 0.65 | (R) -11.38% | 0.08% | > | | Scale: 1 Clients: 1000 Read Write (TPS) | 578.75 | -1.11% | 2.15% | > | | Scale: 1 Clients: 1000 Read Write - Latency (ms) | 1736.98 | -1.26% | 2.47% | > | | Scale: 100 Clients: 1 Read Only (TPS) | 57170.33 | 1.68% | 0.10% | > | | Scale: 100 Clients: 1 Read Only - Latency (ms) | 0.018 | 1.94% | 0.00% | > | | Scale: 100 Clients: 1 Read Write (TPS) | 836.58 | -0.37% | -0.41% | > | | Scale: 100 Clients: 1 Read Write - Latency (ms) | 1.20 | -0.37% | -0.40% | > | | Scale: 100 Clients: 250 Read Only (TPS) | 1773440.67 | -1.61% | 1.67% | > | | Scale: 100 Clients: 250 Read Only - Latency (ms) | 0.14 | -1.40% | 1.56% | > | | Scale: 100 Clients: 250 Read Write (TPS) | 5505.50 | -0.17% | -0.86% | > | | Scale: 100 Clients: 250 Read Write - Latency (ms) | 45.42 | -0.17% | -0.85% | > | | Scale: 100 Clients: 1000 Read Only (TPS) | 1393037.50 | (R) -10.31% | -0.19% | > | | Scale: 100 Clients: 1000 Read Only - Latency (ms) | 0.72 | (R) -10.30% | -0.17% | > | | Scale: 100 Clients: 1000 Read Write (TPS) | 5085.92 | 0.27% | 0.07% | > | | Scale: 100 Clients: 1000 Read Write - Latency (ms) | 196.79 | 0.23% | 0.05% | > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > | mmtests/hackbench | hackbench-process-pipes-1 (seconds) | 0.14 | -1.51% | -1.05% | > | | hackbench-process-pipes-4 (seconds) | 0.44 | (I) 6.49% | (I) 5.42% | > | | hackbench-process-pipes-7 (seconds) | 0.68 | (R) -18.36% | (I) 3.40% | > | | hackbench-process-pipes-12 (seconds) | 1.24 | (R) -19.89% | -0.45% | > | | hackbench-process-pipes-21 (seconds) | 1.81 | (R) -8.41% | -1.22% | > | | hackbench-process-pipes-30 (seconds) | 2.39 | (R) -9.06% | (R) -2.95% | > | | hackbench-process-pipes-48 (seconds) | 3.18 | (R) -11.68% | (R) -4.10% | > | | hackbench-process-pipes-79 (seconds) | 3.84 | (R) -9.74% | (R) -3.25% | > | | hackbench-process-pipes-110 (seconds) | 4.68 | (R) -6.57% | (R) -2.12% | > | | hackbench-process-pipes-141 (seconds) | 5.75 | (R) -5.86% | (R) -3.44% | > | | hackbench-process-pipes-172 (seconds) | 6.80 | (R) -4.28% | (R) -2.81% | > | | hackbench-process-pipes-203 (seconds) | 7.94 | (R) -4.01% | (R) -3.00% | > | | hackbench-process-pipes-234 (seconds) | 9.02 | (R) -3.52% | (R) -2.81% | > | | hackbench-process-pipes-256 (seconds) | 9.78 | (R) -3.24% | (R) -2.81% | > | | hackbench-process-sockets-1 (seconds) | 0.29 | 0.50% | 0.26% | > | | hackbench-process-sockets-4 (seconds) | 0.76 | (I) 17.44% | (I) 16.31% | > | | hackbench-process-sockets-7 (seconds) | 1.16 | (I) 12.10% | (I) 9.78% | > | | hackbench-process-sockets-12 (seconds) | 1.86 | (I) 10.19% | (I) 9.83% | > | | hackbench-process-sockets-21 (seconds) | 3.12 | (I) 9.38% | (I) 9.20% | > | | hackbench-process-sockets-30 (seconds) | 4.30 | (I) 6.43% | (I) 6.11% | > | | hackbench-process-sockets-48 (seconds) | 6.58 | (I) 3.00% | (I) 2.19% | > | | hackbench-process-sockets-79 (seconds) | 10.56 | (I) 2.87% | (I) 3.31% | > | | hackbench-process-sockets-110 (seconds) | 13.85 | -1.15% | (I) 2.33% | > | | hackbench-process-sockets-141 (seconds) | 19.23 | -1.40% | (I) 14.53% | > | | hackbench-process-sockets-172 (seconds) | 26.33 | (I) 3.52% | (I) 30.37% | > | | hackbench-process-sockets-203 (seconds) | 30.27 | 1.10% | (I) 27.20% | > | | hackbench-process-sockets-234 (seconds) | 35.12 | 1.60% | (I) 28.24% | > | | hackbench-process-sockets-256 (seconds) | 38.74 | 0.70% | (I) 28.74% | > | | hackbench-thread-pipes-1 (seconds) | 0.17 | -1.32% | -0.76% | > | | hackbench-thread-pipes-4 (seconds) | 0.45 | (I) 6.91% | (I) 7.64% | > | | hackbench-thread-pipes-7 (seconds) | 0.74 | (R) -7.51% | (I) 5.26% | > | | hackbench-thread-pipes-12 (seconds) | 1.32 | (R) -8.40% | (I) 2.32% | > | | hackbench-thread-pipes-21 (seconds) | 1.95 | (R) -2.95% | 0.91% | > | | hackbench-thread-pipes-30 (seconds) | 2.50 | (R) -4.61% | 1.47% | > | | hackbench-thread-pipes-48 (seconds) | 3.32 | (R) -5.45% | (I) 2.15% | > | | hackbench-thread-pipes-79 (seconds) | 4.04 | (R) -5.53% | 1.85% | > | | hackbench-thread-pipes-110 (seconds) | 4.94 | (R) -2.33% | 1.51% | > | | hackbench-thread-pipes-141 (seconds) | 6.04 | (R) -2.47% | 1.15% | > | | hackbench-thread-pipes-172 (seconds) | 7.15 | -0.91% | 1.48% | > | | hackbench-thread-pipes-203 (seconds) | 8.31 | -1.29% | 0.77% | > | | hackbench-thread-pipes-234 (seconds) | 9.49 | -1.03% | 0.77% | > | | hackbench-thread-pipes-256 (seconds) | 10.30 | -0.80% | 0.42% | > | | hackbench-thread-sockets-1 (seconds) | 0.31 | 0.05% | -0.05% | > | | hackbench-thread-sockets-4 (seconds) | 0.79 | (I) 18.91% | (I) 16.82% | > | | hackbench-thread-sockets-7 (seconds) | 1.16 | (I) 12.57% | (I) 10.63% | > | | hackbench-thread-sockets-12 (seconds) | 1.87 | (I) 12.65% | (I) 12.26% | > | | hackbench-thread-sockets-21 (seconds) | 3.16 | (I) 11.62% | (I) 12.74% | > | | hackbench-thread-sockets-30 (seconds) | 4.32 | (I) 7.35% | (I) 8.89% | > | | hackbench-thread-sockets-48 (seconds) | 6.45 | (I) 2.69% | (I) 3.06% | > | | hackbench-thread-sockets-79 (seconds) | 10.15 | (I) 3.30% | 1.98% | > | | hackbench-thread-sockets-110 (seconds) | 13.45 | -0.25% | (I) 3.68% | > | | hackbench-thread-sockets-141 (seconds) | 17.87 | (R) -2.18% | (I) 8.46% | > | | hackbench-thread-sockets-172 (seconds) | 24.38 | 1.02% | (I) 24.33% | > | | hackbench-thread-sockets-203 (seconds) | 28.38 | -0.99% | (I) 24.20% | > | | hackbench-thread-sockets-234 (seconds) | 32.75 | -0.42% | (I) 24.35% | > | | hackbench-thread-sockets-256 (seconds) | 36.49 | -1.30% | (I) 26.22% | > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > | pts/nginx | Connections: 200 (Requests Per Second) | 252332.60 | (I) 17.54% | -0.53% | > | | Connections: 1000 (Requests Per Second) | 248591.29 | (I) 20.41% | 0.10% | > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > > All of the benchmarks have been run multiple times and I have high confidence in > the results. I can share min/mean/max/stdev/ci95 stats if that's helpful though. > > I'm not providing the data, but we also see similar regressions on AmpereOne > (another arm64 server system). And we have seen a few functional tests (kvm > selftests) that have started to timeout due to this patch slowing things down on > arm64. > > I'm hoping you can advise on the best way to proceed? We have a bigger library > than what I'm showing, but the only improvement I see due to this patch is > nginx. So based on that, my preference would be to revert the patch upstream > until the issues can be worked out. I'm guessing the story is quite different > for x86 though? > > Thanks, > Ryan > > > > On 17/11/2025 16:23, tip-bot2 for Mel Gorman wrote: >> The following commit has been merged into the sched/core branch of tip: >> >> Commit-ID: e837456fdca81899a3c8e47b3fd39e30eae6e291 >> Gitweb: https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291 >> Author: Mel Gorman >> AuthorDate: Wed, 12 Nov 2025 12:25:21 >> Committer: Peter Zijlstra >> CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00 >> >> sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals >> >> Reimplement NEXT_BUDDY preemption to take into account the deadline and >> eligibility of the wakee with respect to the waker. In the event >> multiple buddies could be considered, the one with the earliest deadline >> is selected. >> >> Sync wakeups are treated differently to every other type of wakeup. The >> WF_SYNC assumption is that the waker promises to sleep in the very near >> future. This is violated in enough cases that WF_SYNC should be treated >> as a suggestion instead of a contract. If a waker does go to sleep almost >> immediately then the delay in wakeup is negligible. In other cases, it's >> throttled based on the accumulated runtime of the waker so there is a >> chance that some batched wakeups have been issued before preemption. >> >> For all other wakeups, preemption happens if the wakee has a earlier >> deadline than the waker and eligible to run. >> >> While many workloads were tested, the two main targets were a modified >> dbench4 benchmark and hackbench because the are on opposite ends of the >> spectrum -- one prefers throughput by avoiding preemption and the other >> relies on preemption. >> >> First is the dbench throughput data even though it is a poor metric but >> it is the default metric. The test machine is a 2-socket machine and the >> backing filesystem is XFS as a lot of the IO work is dispatched to kernel >> threads. It's important to note that these results are not representative >> across all machines, especially Zen machines, as different bottlenecks >> are exposed on different machines and filesystems. >> >> dbench4 Throughput (misleading but traditional) >> 6.18-rc1 6.18-rc1 >> vanilla sched-preemptnext-v5 >> Hmean 1 1268.80 ( 0.00%) 1269.74 ( 0.07%) >> Hmean 4 3971.74 ( 0.00%) 3950.59 ( -0.53%) >> Hmean 7 5548.23 ( 0.00%) 5420.08 ( -2.31%) >> Hmean 12 7310.86 ( 0.00%) 7165.57 ( -1.99%) >> Hmean 21 8874.53 ( 0.00%) 9149.04 ( 3.09%) >> Hmean 30 9361.93 ( 0.00%) 10530.04 ( 12.48%) >> Hmean 48 9540.14 ( 0.00%) 11820.40 ( 23.90%) >> Hmean 79 9208.74 ( 0.00%) 12193.79 ( 32.42%) >> Hmean 110 8573.12 ( 0.00%) 11933.72 ( 39.20%) >> Hmean 141 7791.33 ( 0.00%) 11273.90 ( 44.70%) >> Hmean 160 7666.60 ( 0.00%) 10768.72 ( 40.46%) >> >> As throughput is misleading, the benchmark is modified to use a short >> loadfile report the completion time duration in milliseconds. >> >> dbench4 Loadfile Execution Time >> 6.18-rc1 6.18-rc1 >> vanilla sched-preemptnext-v5 >> Amean 1 14.62 ( 0.00%) 14.69 ( -0.46%) >> Amean 4 18.76 ( 0.00%) 18.85 ( -0.45%) >> Amean 7 23.71 ( 0.00%) 24.38 ( -2.82%) >> Amean 12 31.25 ( 0.00%) 31.87 ( -1.97%) >> Amean 21 45.12 ( 0.00%) 43.69 ( 3.16%) >> Amean 30 61.07 ( 0.00%) 54.33 ( 11.03%) >> Amean 48 95.91 ( 0.00%) 77.22 ( 19.49%) >> Amean 79 163.38 ( 0.00%) 123.08 ( 24.66%) >> Amean 110 243.91 ( 0.00%) 175.11 ( 28.21%) >> Amean 141 343.47 ( 0.00%) 239.10 ( 30.39%) >> Amean 160 401.15 ( 0.00%) 283.73 ( 29.27%) >> Stddev 1 0.52 ( 0.00%) 0.51 ( 2.45%) >> Stddev 4 1.36 ( 0.00%) 1.30 ( 4.04%) >> Stddev 7 1.88 ( 0.00%) 1.87 ( 0.72%) >> Stddev 12 3.06 ( 0.00%) 2.45 ( 19.83%) >> Stddev 21 5.78 ( 0.00%) 3.87 ( 33.06%) >> Stddev 30 9.85 ( 0.00%) 5.25 ( 46.76%) >> Stddev 48 22.31 ( 0.00%) 8.64 ( 61.27%) >> Stddev 79 35.96 ( 0.00%) 18.07 ( 49.76%) >> Stddev 110 59.04 ( 0.00%) 30.93 ( 47.61%) >> Stddev 141 85.38 ( 0.00%) 40.93 ( 52.06%) >> Stddev 160 96.38 ( 0.00%) 39.72 ( 58.79%) >> >> That is still looking good and the variance is reduced quite a bit. >> Finally, fairness is a concern so the next report tracks how many >> milliseconds does it take for all clients to complete a workfile. This >> one is tricky because dbench makes to effort to synchronise clients so >> the durations at benchmark start time differ substantially from typical >> runtimes. This problem could be mitigated by warming up the benchmark >> for a number of minutes but it's a matter of opinion whether that >> counts as an evasion of inconvenient results. >> >> dbench4 All Clients Loadfile Execution Time >> 6.18-rc1 6.18-rc1 >> vanilla sched-preemptnext-v5 >> Amean 1 15.06 ( 0.00%) 15.07 ( -0.03%) >> Amean 4 603.81 ( 0.00%) 524.29 ( 13.17%) >> Amean 7 855.32 ( 0.00%) 1331.07 ( -55.62%) >> Amean 12 1890.02 ( 0.00%) 2323.97 ( -22.96%) >> Amean 21 3195.23 ( 0.00%) 2009.29 ( 37.12%) >> Amean 30 13919.53 ( 0.00%) 4579.44 ( 67.10%) >> Amean 48 25246.07 ( 0.00%) 5705.46 ( 77.40%) >> Amean 79 29701.84 ( 0.00%) 15509.26 ( 47.78%) >> Amean 110 22803.03 ( 0.00%) 23782.08 ( -4.29%) >> Amean 141 36356.07 ( 0.00%) 25074.20 ( 31.03%) >> Amean 160 17046.71 ( 0.00%) 13247.62 ( 22.29%) >> Stddev 1 0.47 ( 0.00%) 0.49 ( -3.74%) >> Stddev 4 395.24 ( 0.00%) 254.18 ( 35.69%) >> Stddev 7 467.24 ( 0.00%) 764.42 ( -63.60%) >> Stddev 12 1071.43 ( 0.00%) 1395.90 ( -30.28%) >> Stddev 21 1694.50 ( 0.00%) 1204.89 ( 28.89%) >> Stddev 30 7945.63 ( 0.00%) 2552.59 ( 67.87%) >> Stddev 48 14339.51 ( 0.00%) 3227.55 ( 77.49%) >> Stddev 79 16620.91 ( 0.00%) 8422.15 ( 49.33%) >> Stddev 110 12912.15 ( 0.00%) 13560.95 ( -5.02%) >> Stddev 141 20700.13 ( 0.00%) 14544.51 ( 29.74%) >> Stddev 160 9079.16 ( 0.00%) 7400.69 ( 18.49%) >> >> This is more of a mixed bag but it at least shows that fairness >> is not crippled. >> >> The hackbench results are more neutral but this is still important. >> It's possible to boost the dbench figures by a large amount but only by >> crippling the performance of a workload like hackbench. The WF_SYNC >> behaviour is important for these workloads and is why the WF_SYNC >> changes are not a separate patch. >> >> hackbench-process-pipes >> 6.18-rc1 6.18-rc1 >> vanilla sched-preemptnext-v5 >> Amean 1 0.2657 ( 0.00%) 0.2150 ( 19.07%) >> Amean 4 0.6107 ( 0.00%) 0.6060 ( 0.76%) >> Amean 7 0.7923 ( 0.00%) 0.7440 ( 6.10%) >> Amean 12 1.1500 ( 0.00%) 1.1263 ( 2.06%) >> Amean 21 1.7950 ( 0.00%) 1.7987 ( -0.20%) >> Amean 30 2.3207 ( 0.00%) 2.5053 ( -7.96%) >> Amean 48 3.5023 ( 0.00%) 3.9197 ( -11.92%) >> Amean 79 4.8093 ( 0.00%) 5.2247 ( -8.64%) >> Amean 110 6.1160 ( 0.00%) 6.6650 ( -8.98%) >> Amean 141 7.4763 ( 0.00%) 7.8973 ( -5.63%) >> Amean 172 8.9560 ( 0.00%) 9.3593 ( -4.50%) >> Amean 203 10.4783 ( 0.00%) 10.8347 ( -3.40%) >> Amean 234 12.4977 ( 0.00%) 13.0177 ( -4.16%) >> Amean 265 14.7003 ( 0.00%) 15.5630 ( -5.87%) >> Amean 296 16.1007 ( 0.00%) 17.4023 ( -8.08%) >> >> Processes using pipes are impacted but the variance (not presented) indicates >> it's close to noise and the results are not always reproducible. If executed >> across multiple reboots, it may show neutral or small gains so the worst >> measured results are presented. >> >> Hackbench using sockets is more reliably neutral as the wakeup >> mechanisms are different between sockets and pipes. >> >> hackbench-process-sockets >> 6.18-rc1 6.18-rc1 >> vanilla sched-preemptnext-v2 >> Amean 1 0.3073 ( 0.00%) 0.3263 ( -6.18%) >> Amean 4 0.7863 ( 0.00%) 0.7930 ( -0.85%) >> Amean 7 1.3670 ( 0.00%) 1.3537 ( 0.98%) >> Amean 12 2.1337 ( 0.00%) 2.1903 ( -2.66%) >> Amean 21 3.4683 ( 0.00%) 3.4940 ( -0.74%) >> Amean 30 4.7247 ( 0.00%) 4.8853 ( -3.40%) >> Amean 48 7.6097 ( 0.00%) 7.8197 ( -2.76%) >> Amean 79 14.7957 ( 0.00%) 16.1000 ( -8.82%) >> Amean 110 21.3413 ( 0.00%) 21.9997 ( -3.08%) >> Amean 141 29.0503 ( 0.00%) 29.0353 ( 0.05%) >> Amean 172 36.4660 ( 0.00%) 36.1433 ( 0.88%) >> Amean 203 39.7177 ( 0.00%) 40.5910 ( -2.20%) >> Amean 234 42.1120 ( 0.00%) 43.5527 ( -3.42%) >> Amean 265 45.7830 ( 0.00%) 50.0560 ( -9.33%) >> Amean 296 50.7043 ( 0.00%) 54.3657 ( -7.22%) >> >> As schbench has been mentioned in numerous bugs recently, the results >> are interesting. A test case that represents the default schbench >> behaviour is >> >> schbench Wakeup Latency (usec) >> 6.18.0-rc1 6.18.0-rc1 >> vanilla sched-preemptnext-v5 >> Amean Wakeup-50th-80 7.17 ( 0.00%) 6.00 ( 16.28%) >> Amean Wakeup-90th-80 46.56 ( 0.00%) 19.78 ( 57.52%) >> Amean Wakeup-99th-80 119.61 ( 0.00%) 89.94 ( 24.80%) >> Amean Wakeup-99.9th-80 3193.78 ( 0.00%) 328.22 ( 89.72%) >> >> schbench Requests Per Second (ops/sec) >> 6.18.0-rc1 6.18.0-rc1 >> vanilla sched-preemptnext-v5 >> Hmean RPS-20th-80 8900.91 ( 0.00%) 9176.78 ( 3.10%) >> Hmean RPS-50th-80 8987.41 ( 0.00%) 9217.89 ( 2.56%) >> Hmean RPS-90th-80 9123.73 ( 0.00%) 9273.25 ( 1.64%) >> Hmean RPS-max-80 9193.50 ( 0.00%) 9301.47 ( 1.17%) >> >> Signed-off-by: Mel Gorman >> Signed-off-by: Peter Zijlstra (Intel) >> Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net >> --- >> kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++------- >> 1 file changed, 130 insertions(+), 22 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 071e07f..c6e5c64 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect) >> if (cfs_rq->nr_queued == 1) >> return curr && curr->on_rq ? curr : se; >> >> + /* >> + * Picking the ->next buddy will affect latency but not fairness. >> + */ >> + if (sched_feat(PICK_BUDDY) && >> + cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { >> + /* ->next will never be delayed */ >> + WARN_ON_ONCE(cfs_rq->next->sched_delayed); >> + return cfs_rq->next; >> + } >> + >> if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) >> curr = NULL; >> >> @@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se) >> return delta_exec; >> } >> >> +static void set_next_buddy(struct sched_entity *se); >> + >> /* >> * Used by other classes to account runtime. >> */ >> @@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) >> { >> struct sched_entity *se; >> >> - /* >> - * Picking the ->next buddy will affect latency but not fairness. >> - */ >> - if (sched_feat(PICK_BUDDY) && >> - cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { >> - /* ->next will never be delayed */ >> - WARN_ON_ONCE(cfs_rq->next->sched_delayed); >> - return cfs_rq->next; >> - } >> - >> se = pick_eevdf(cfs_rq); >> if (se->sched_delayed) { >> dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); >> @@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) >> hrtick_update(rq); >> } >> >> -static void set_next_buddy(struct sched_entity *se); >> - >> /* >> * Basically dequeue_task_fair(), except it can deal with dequeue_entity() >> * failing half-way through and resume the dequeue later. >> @@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se) >> } >> } >> >> +enum preempt_wakeup_action { >> + PREEMPT_WAKEUP_NONE, /* No preemption. */ >> + PREEMPT_WAKEUP_SHORT, /* Ignore slice protection. */ >> + PREEMPT_WAKEUP_PICK, /* Let __pick_eevdf() decide. */ >> + PREEMPT_WAKEUP_RESCHED, /* Force reschedule. */ >> +}; >> + >> +static inline bool >> +set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags, >> + struct sched_entity *pse, struct sched_entity *se) >> +{ >> + /* >> + * Keep existing buddy if the deadline is sooner than pse. >> + * The older buddy may be cache cold and completely unrelated >> + * to the current wakeup but that is unpredictable where as >> + * obeying the deadline is more in line with EEVDF objectives. >> + */ >> + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) >> + return false; >> + >> + set_next_buddy(pse); >> + return true; >> +} >> + >> +/* >> + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not >> + * strictly enforced because the hint is either misunderstood or >> + * multiple tasks must be woken up. >> + */ >> +static inline enum preempt_wakeup_action >> +preempt_sync(struct rq *rq, int wake_flags, >> + struct sched_entity *pse, struct sched_entity *se) >> +{ >> + u64 threshold, delta; >> + >> + /* >> + * WF_SYNC without WF_TTWU is not expected so warn if it happens even >> + * though it is likely harmless. >> + */ >> + WARN_ON_ONCE(!(wake_flags & WF_TTWU)); >> + >> + threshold = sysctl_sched_migration_cost; >> + delta = rq_clock_task(rq) - se->exec_start; >> + if ((s64)delta < 0) >> + delta = 0; >> + >> + /* >> + * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they >> + * could run on other CPUs. Reduce the threshold before preemption is >> + * allowed to an arbitrary lower value as it is more likely (but not >> + * guaranteed) the waker requires the wakee to finish. >> + */ >> + if (wake_flags & WF_RQ_SELECTED) >> + threshold >>= 2; >> + >> + /* >> + * As WF_SYNC is not strictly obeyed, allow some runtime for batch >> + * wakeups to be issued. >> + */ >> + if (entity_before(pse, se) && delta >= threshold) >> + return PREEMPT_WAKEUP_RESCHED; >> + >> + return PREEMPT_WAKEUP_NONE; >> +} >> + >> /* >> * Preempt the current task with a newly woken task if needed: >> */ >> static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags) >> { >> + enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK; >> struct task_struct *donor = rq->donor; >> struct sched_entity *se = &donor->se, *pse = &p->se; >> struct cfs_rq *cfs_rq = task_cfs_rq(donor); >> int cse_is_idle, pse_is_idle; >> - bool do_preempt_short = false; >> >> if (unlikely(se == pse)) >> return; >> @@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int >> if (task_is_throttled(p)) >> return; >> >> - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { >> - set_next_buddy(pse); >> - } >> - >> /* >> * We can come here with TIF_NEED_RESCHED already set from new task >> * wake up path. >> @@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int >> * When non-idle entity preempt an idle entity, >> * don't give idle entity slice protection. >> */ >> - do_preempt_short = true; >> + preempt_action = PREEMPT_WAKEUP_SHORT; >> goto preempt; >> } >> >> @@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int >> * If @p has a shorter slice than current and @p is eligible, override >> * current's slice protection in order to allow preemption. >> */ >> - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); >> + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { >> + preempt_action = PREEMPT_WAKEUP_SHORT; >> + goto pick; >> + } >> >> /* >> + * Ignore wakee preemption on WF_FORK as it is less likely that >> + * there is shared data as exec often follow fork. Do not >> + * preempt for tasks that are sched_delayed as it would violate >> + * EEVDF to forcibly queue an ineligible task. >> + */ >> + if ((wake_flags & WF_FORK) || pse->sched_delayed) >> + return; >> + >> + /* >> + * If @p potentially is completing work required by current then >> + * consider preemption. >> + * >> + * Reschedule if waker is no longer eligible. */ >> + if (in_task() && !entity_eligible(cfs_rq, se)) { >> + preempt_action = PREEMPT_WAKEUP_RESCHED; >> + goto preempt; >> + } >> + >> + /* Prefer picking wakee soon if appropriate. */ >> + if (sched_feat(NEXT_BUDDY) && >> + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) { >> + >> + /* >> + * Decide whether to obey WF_SYNC hint for a new buddy. Old >> + * buddies are ignored as they may not be relevant to the >> + * waker and less likely to be cache hot. >> + */ >> + if (wake_flags & WF_SYNC) >> + preempt_action = preempt_sync(rq, wake_flags, pse, se); >> + } >> + >> + switch (preempt_action) { >> + case PREEMPT_WAKEUP_NONE: >> + return; >> + case PREEMPT_WAKEUP_RESCHED: >> + goto preempt; >> + case PREEMPT_WAKEUP_SHORT: >> + fallthrough; >> + case PREEMPT_WAKEUP_PICK: >> + break; >> + } >> + >> +pick: >> + /* >> * If @p has become the most eligible task, force preemption. >> */ >> - if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) >> + if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse) >> goto preempt; >> >> - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) >> + if (sched_feat(RUN_TO_PARITY)) >> update_protect_slice(cfs_rq, se); >> >> return; >> >> preempt: >> - if (do_preempt_short) >> + if (preempt_action == PREEMPT_WAKEUP_SHORT) >> cancel_protect_slice(se); >> >> resched_curr_lazy(rq); >> >