public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Haris Okanovic <harisokn@amazon.com>
To: <linux-kernel@vger.kernel.org>, <linux-pm@vger.kernel.org>,
	<linux-assembly@vger.kernel.org>
Cc: <peterz@infradead.org>, Haris Okanovic <harisokn@amazon.com>,
	Ali Saidi <alisaidi@amazon.com>,
	Geoff Blake <blakgeof@amazon.com>,
	Brian Silver <silverbr@amazon.com>
Subject: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
Date: Mon, 1 Apr 2024 20:47:06 -0500	[thread overview]
Message-ID: <20240402014706.3969151-3-harisokn@amazon.com> (raw)
In-Reply-To: <20240402014706.3969151-1-harisokn@amazon.com>

An arm64 cpuidle driver with two states: (1) First polls for new runable
tasks up to 100 us (by default) before (2) a wfi idle and awoken by
interrupt (the current arm64 behavior). It allows CPUs to return from
idle more quickly by avoiding the longer interrupt wakeup path, which
may require EL1/EL2 transition in certain VM scenarios.

Poll duration is optionally configured at load time via the poll_limit
module parameter.

The default 100 us duration was experimentally chosen, by measuring QPS
(queries per sec) of the MLPerf bert inference benchmark, which seems
particularly susceptible to this change; see procedure below. 100 us is
the inflection point where QPS stopped growing in a range of tested
values. All results are from AWS m7g.16xlarge instances (Graviton3 SoC)
with dedicated tenancy (dedicated hardware).

| before | 10us  | 25us | 50us | 100us | 125us | 150us | 200us | 300us |
| 5.87   | 5.91  | 5.96 | 6.01 | 6.06  | 6.07  | 6.06  | 6.06  | 6.06  |

Perf's scheduler benchmarks also improve with a range of poll_limit
values >= 10 us. Higher limits produce near identical results within a
3% noise margin. The following tables are `perf bench sched` results,
run times in seconds.

`perf bench sched messaging -l 80000`
| AWS instance  | SoC       | Before | After  | % Change |
| c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none     |
| c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none     |
| c6g.metal     | Graviton2 | 17.621 | 16.744 | none     |
| c7g.metal     | Graviton3 | 13.430 | 13.404 | none     |

`perf bench sched pipe -l 2500000`
| AWS instance  | SoC       | Before | After  | % Change |
| c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50%     |
| c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34%     |
| c6g.metal     | Graviton2 | 17.609 | 15.170 | -14%     |
| c7g.metal     | Graviton3 | 14.103 | 12.304 | -13%     |

`perf bench sched seccomp-notify -l 2500000`
| AWS instance  | SoC       | Before | After  | % Change |
| c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52%     |
| c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33%     |
| c6g.metal     | Graviton2 | 15.717 | 13.536 | -14%     |
| c7g.metal     | Graviton3 | 13.301 | 11.491 | -14%     |

Steps to run MLPerf bert inference on Ubuntu 22.04:
 sudo apt install build-essential python3 python3-pip
 pip install "pybind11[global]" tensorflow  transformers
 export TF_ENABLE_ONEDNN_OPTS=1
 export DNNL_DEFAULT_FPMATH_MODE=BF16
 git clone https://github.com/mlcommons/inference.git --recursive
 cd inference
 git checkout v2.0
 cd loadgen
 CFLAGS="-std=c++14" python3 setup.py bdist_wheel
 pip install dist/*.whl
 cd ../language/bert
 make setup
 python3 run.py --backend=tf --scenario=SingleStream

Suggested-by: Ali Saidi <alisaidi@amazon.com>
Reviewed-by: Ali Saidi <alisaidi@amazon.com>
Reviewed-by: Geoff Blake <blakgeof@amazon.com>
Cc: Brian Silver <silverbr@amazon.com>
Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
 drivers/cpuidle/Kconfig.arm           |  13 ++
 drivers/cpuidle/Makefile              |   1 +
 drivers/cpuidle/cpuidle-arm-polling.c | 171 ++++++++++++++++++++++++++
 3 files changed, 185 insertions(+)
 create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c

diff --git a/drivers/cpuidle/Kconfig.arm b/drivers/cpuidle/Kconfig.arm
index a1ee475d180d..484666dda38d 100644
--- a/drivers/cpuidle/Kconfig.arm
+++ b/drivers/cpuidle/Kconfig.arm
@@ -14,6 +14,19 @@ config ARM_CPUIDLE
 	  initialized by calling the CPU operations init idle hook
 	  provided by architecture code.
 
+config ARM_POLL_CPUIDLE
+	bool "ARM64 CPU idle Driver with polling"
+	depends on ARM64
+	depends on ARM_ARCH_TIMER_EVTSTREAM
+	select CPU_IDLE_MULTIPLE_DRIVERS
+	help
+	  Select this to enable a polling cpuidle driver for ARM64:
+	  The first state polls TIF_NEED_RESCHED for best latency on short
+	  sleep intervals. The second state falls back to arch_cpu_idle() to
+	  wait for interrupt. This is can be helpful in workloads that
+	  frequently block/wake at short intervals or VMs where wakeup IPIs
+	  are more expensive.
+
 config ARM_PSCI_CPUIDLE
 	bool "PSCI CPU idle Driver"
 	depends on ARM_PSCI_FW
diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
index d103342b7cfc..23c21422792d 100644
--- a/drivers/cpuidle/Makefile
+++ b/drivers/cpuidle/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE)         += cpuidle-ux500.o
 obj-$(CONFIG_ARM_AT91_CPUIDLE)          += cpuidle-at91.o
 obj-$(CONFIG_ARM_EXYNOS_CPUIDLE)        += cpuidle-exynos.o
 obj-$(CONFIG_ARM_CPUIDLE)		+= cpuidle-arm.o
+obj-$(CONFIG_ARM_POLL_CPUIDLE)		+= cpuidle-arm-polling.o
 obj-$(CONFIG_ARM_PSCI_CPUIDLE)		+= cpuidle-psci.o
 obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN)	+= cpuidle-psci-domain.o
 obj-$(CONFIG_ARM_TEGRA_CPUIDLE)		+= cpuidle-tegra.o
diff --git a/drivers/cpuidle/cpuidle-arm-polling.c b/drivers/cpuidle/cpuidle-arm-polling.c
new file mode 100644
index 000000000000..bca128568114
--- /dev/null
+++ b/drivers/cpuidle/cpuidle-arm-polling.c
@@ -0,0 +1,171 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ARM64 CPU idle driver using wfe polling
+ *
+ * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights reserved.
+ *
+ * Authors:
+ *   Haris Okanovic <harisokn@amazon.com>
+ *   Brian Silver <silverbr@amazon.com>
+ *
+ * Based on cpuidle-arm.c
+ * Copyright (C) 2014 ARM Ltd.
+ * Author: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
+ */
+
+#include <linux/cpu.h>
+#include <linux/cpu_cooling.h>
+#include <linux/cpuidle.h>
+#include <linux/sched/clock.h>
+
+#include <asm/cpuidle.h>
+#include <asm/readex.h>
+
+#include "dt_idle_states.h"
+
+/* Max duration of the wfe() poll loop in us, before transitioning to
+ * arch_cpu_idle()/wfi() sleep.
+ */
+#define DEFAULT_POLL_LIMIT_US 100
+static unsigned int poll_limit __read_mostly = DEFAULT_POLL_LIMIT_US;
+
+/*
+ * arm_idle_wfe_poll - Polls state in wfe loop until reschedule is
+ * needed or timeout
+ */
+static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device *dev,
+				struct cpuidle_driver *drv, int idx)
+{
+	u64 time_start, time_limit;
+
+	time_start = local_clock();
+	dev->poll_time_limit = false;
+
+	local_irq_enable();
+
+	if (current_set_polling_and_test())
+		goto end;
+
+	time_limit = cpuidle_poll_time(drv, dev);
+
+	do {
+		// exclusive read arms the monitor for wfe
+		if (__READ_ONCE_EX(current_thread_info()->flags) & _TIF_NEED_RESCHED)
+			goto end;
+
+		// may exit prematurely, see ARM_ARCH_TIMER_EVTSTREAM
+		wfe();
+	} while (local_clock() - time_start < time_limit);
+
+	dev->poll_time_limit = true;
+
+end:
+	current_clr_polling();
+	return idx;
+}
+
+/*
+ * arm_idle_wfi - Places cpu in lower power state until interrupt,
+ * a fallback to polling
+ */
+static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev,
+				struct cpuidle_driver *drv, int idx)
+{
+	if (current_clr_polling_and_test()) {
+		local_irq_enable();
+		return idx;
+	}
+	arch_cpu_idle();
+	return idx;
+}
+
+static struct cpuidle_driver arm_poll_idle_driver __initdata = {
+	.name = "arm_poll_idle",
+	.owner = THIS_MODULE,
+	.states = {
+		{
+			.enter			= arm_idle_wfe_poll,
+			.exit_latency		= 0,
+			.target_residency	= 0,
+			.exit_latency_ns	= 0,
+			.power_usage		= UINT_MAX,
+			.flags			= CPUIDLE_FLAG_POLLING,
+			.name			= "WFE",
+			.desc			= "ARM WFE",
+		},
+		{
+			.enter			= arm_idle_wfi,
+			.exit_latency		= DEFAULT_POLL_LIMIT_US,
+			.target_residency	= DEFAULT_POLL_LIMIT_US,
+			.power_usage		= UINT_MAX,
+			.name			= "WFI",
+			.desc			= "ARM WFI",
+		},
+	},
+	.state_count = 2,
+};
+
+/*
+ * arm_poll_init_cpu - Initializes arm cpuidle polling driver for one cpu
+ */
+static int __init arm_poll_init_cpu(int cpu)
+{
+	int ret;
+	struct cpuidle_driver *drv;
+
+	drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv), GFP_KERNEL);
+	if (!drv)
+		return -ENOMEM;
+
+	drv->cpumask = (struct cpumask *)cpumask_of(cpu);
+	drv->states[1].exit_latency = poll_limit;
+	drv->states[1].target_residency = poll_limit;
+
+	ret = cpuidle_register(drv, NULL);
+	if (ret) {
+		pr_err("failed to register driver: %d, cpu %d\n", ret, cpu);
+		goto out_kfree_drv;
+	}
+
+	pr_info("registered driver cpu %d\n", cpu);
+
+	cpuidle_cooling_register(drv);
+
+	return 0;
+
+out_kfree_drv:
+	kfree(drv);
+	return ret;
+}
+
+/*
+ * arm_poll_init - Initializes arm cpuidle polling driver
+ */
+static int __init arm_poll_init(void)
+{
+	int cpu, ret;
+	struct cpuidle_driver *drv;
+	struct cpuidle_device *dev;
+
+	for_each_possible_cpu(cpu) {
+		ret = arm_poll_init_cpu(cpu);
+		if (ret)
+			goto out_fail;
+	}
+
+	return 0;
+
+out_fail:
+	pr_info("de-register all");
+	while (--cpu >= 0) {
+		dev = per_cpu(cpuidle_devices, cpu);
+		drv = cpuidle_get_cpu_driver(dev);
+		cpuidle_unregister(drv);
+		kfree(drv);
+	}
+
+	return ret;
+}
+
+module_param(poll_limit, uint, 0444);
+device_initcall(arm_poll_init);
-- 
2.34.1


  parent reply	other threads:[~2024-04-02  1:47 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-02  1:47 [PATCH 1/3] arm64: Add TIF_POLLING_NRFLAG Haris Okanovic
2024-04-02  1:47 ` [PATCH 2/3] arm64: add __READ_ONCE_EX() Haris Okanovic
2024-04-02 16:48   ` Mark Rutland
2024-04-08 14:51   ` David Laight
2024-04-02  1:47 ` Haris Okanovic [this message]
2024-04-02  2:30   ` [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle Okanovic, Haris
2024-04-02 17:23   ` Mark Rutland
2024-04-02 23:17     ` Ankur Arora
2024-04-05 19:36       ` Okanovic, Haris
2024-04-05 20:22         ` Ankur Arora
2024-04-05 20:05     ` Okanovic, Haris

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240402014706.3969151-3-harisokn@amazon.com \
    --to=harisokn@amazon.com \
    --cc=alisaidi@amazon.com \
    --cc=blakgeof@amazon.com \
    --cc=linux-assembly@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=silverbr@amazon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox