* [PATCH v2 0/2] powerpc/rtas: improved busy and extended delay status handling
@ 2021-11-17 6:02 Nathan Lynch
2021-11-17 6:02 ` [PATCH v2 1/2] powerpc/rtas: rtas_busy_delay() improvements Nathan Lynch
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Nathan Lynch @ 2021-11-17 6:02 UTC (permalink / raw)
To: linuxppc-dev; +Cc: tyreld, ajd, aik, cheloha, ldufour
This can be considered a successor to:
https://lore.kernel.org/linuxppc-dev/20210504030358.1715034-1-nathanl@linux.ibm.com/
which tried to address both the suboptimal delay behavior as well as API
issues. This version achieves the performance improvements and leaves major
API changes for another time.
Changes since v1:
* Drop major API changes.
* Avoid division when calculating the arguments to usleep_range() (Alexey).
* Improve kernel-doc for rtas_busy_delay(), rtas_busy_delay_time().
Nathan Lynch (2):
powerpc/rtas: rtas_busy_delay() improvements
powerpc/rtas: rtas_busy_delay_time() kernel-doc
arch/powerpc/include/asm/rtas.h | 2 +-
arch/powerpc/kernel/rtas.c | 95 +++++++++++++++++++++++++++++----
2 files changed, 87 insertions(+), 10 deletions(-)
--
2.31.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v2 1/2] powerpc/rtas: rtas_busy_delay() improvements
2021-11-17 6:02 [PATCH v2 0/2] powerpc/rtas: improved busy and extended delay status handling Nathan Lynch
@ 2021-11-17 6:02 ` Nathan Lynch
2021-11-17 6:13 ` Nathan Lynch
2021-11-17 6:02 ` [PATCH v2 2/2] powerpc/rtas: rtas_busy_delay_time() kernel-doc Nathan Lynch
2021-11-25 9:36 ` [PATCH v2 0/2] powerpc/rtas: improved busy and extended delay status handling Michael Ellerman
2 siblings, 1 reply; 5+ messages in thread
From: Nathan Lynch @ 2021-11-17 6:02 UTC (permalink / raw)
To: linuxppc-dev; +Cc: tyreld, ajd, aik, cheloha, ldufour
Generally RTAS cannot block, and in PAPR it is required to return control
to the OS within a few tens of microseconds. In order to support operations
which may take longer to complete, many RTAS primitives can return
intermediate -2 ("busy") or 990x ("extended delay") values, which indicate
that the OS should reattempt the same call with the same arguments at some
point in the future.
Current versions of PAPR are less than clear about this, but the intended
meanings of these values in more detail are:
RTAS_BUSY (-2): RTAS has suspended a potentially long-running operation in
order to meet its latency obligation and give the OS the opportunity to
perform other work. RTAS can resume making progress as soon as the OS
reattempts the call.
RTAS_EXTENDED_DELAY_{MIN...MAX} (9900-9905): RTAS must wait for an external
event to occur or for internal contention to resolve before it can complete
the requested operation. The value encodes a non-binding hint as to roughly
how long the OS should wait before calling again, but the OS is allowed to
reattempt the call sooner or even immediately.
Linux of course must take its own CPU scheduling obligations into account
when handling these statuses; e.g. a task which receives an RTAS_BUSY
status should check whether to reschedule before it attempts the RTAS call
again to avoid starving other tasks.
rtas_busy_delay() is a helper function that "consumes" a busy or extended
delay status. Common usage:
int rc;
do {
rc = rtas_call(rtas_token("some-function"), ...);
} while (rtas_busy_delay(rc));
/* convert rc to Linux error value, etc */
If rc is a busy or extended delay status, the caller can rely on
rtas_busy_delay() to perform an appropriate sleep or reschedule and return
nonzero. Other statuses are handled normally by the caller.
The current implementation of rtas_busy_delay() both oversleeps and
overuses the CPU:
* It performs msleep() for all 990x and even when no delay is
suggested (-2), but this is understood to actually sleep for two jiffies
minimum in practice (20ms with HZ=100). 9900 (1ms) and 9901 (10ms)
appear to be the most common extended delay statuses, and the
oversleeping measurably lengthens DLPAR operations, which perform
many RTAS calls.
* It does not sleep on 990x unless need_resched() is true, causing code
like the loop above to needlessly retry, wasting CPU time.
Alter the logic to align better with the intended meanings:
* When passed RTAS_BUSY, perform cond_resched() and return without
sleeping. The caller should reattempt immediately
* Always sleep when passed an extended delay status, using usleep_range()
for precise shorter sleeps. Limit the sleep time to one second even
though there are higher architected values.
Change rtas_busy_delay()'s return type to bool to better reflect its usage,
and add kernel-doc.
rtas_busy_delay_time() is unchanged, even though it "incorrectly" returns 1
for RTAS_BUSY. There are users of that API with open-coded delay loops in
sensitive contexts that will have to be taken on an individual basis.
Brief results for addition and removal of 5GB memory on a small P9 PowerVM
partition follow. Load was generated with stress-ng --cpu N. For add,
elapsed time is greatly reduced without significant change in the number of
RTAS calls or time spent on CPU. For remove, elapsed time is modestly
reduced, with significant reductions in RTAS calls and time spent on CPU.
With no competing workload (- before, + after):
Performance counter stats for 'bash -c echo "memory add count 20" > /sys/kernel/dlpar' (10 runs):
- 1,935 probe:rtas_call # 0.003 M/sec ( +- 0.22% )
- 609.99 msec task-clock # 0.183 CPUs utilized ( +- 0.19% )
+ 1,956 probe:rtas_call # 0.003 M/sec ( +- 0.17% )
+ 618.56 msec task-clock # 0.278 CPUs utilized ( +- 0.11% )
- 3.3322 +- 0.0670 seconds time elapsed ( +- 2.01% )
+ 2.2222 +- 0.0416 seconds time elapsed ( +- 1.87% )
Performance counter stats for 'bash -c echo "memory remove count 20" > /sys/kernel/dlpar' (10 runs):
- 6,224 probe:rtas_call # 0.008 M/sec ( +- 2.57% )
- 750.36 msec task-clock # 0.190 CPUs utilized ( +- 2.01% )
+ 843 probe:rtas_call # 0.003 M/sec ( +- 0.12% )
+ 250.66 msec task-clock # 0.068 CPUs utilized ( +- 0.17% )
- 3.9394 +- 0.0890 seconds time elapsed ( +- 2.26% )
+ 3.678 +- 0.113 seconds time elapsed ( +- 3.07% )
With all CPUs 100% busy (- before, + after):
Performance counter stats for 'bash -c echo "memory add count 20" > /sys/kernel/dlpar' (10 runs):
- 2,979 probe:rtas_call # 0.003 M/sec ( +- 0.12% )
- 1,096.62 msec task-clock # 0.105 CPUs utilized ( +- 0.10% )
+ 2,981 probe:rtas_call # 0.003 M/sec ( +- 0.22% )
+ 1,095.26 msec task-clock # 0.154 CPUs utilized ( +- 0.21% )
- 10.476 +- 0.104 seconds time elapsed ( +- 1.00% )
+ 7.1124 +- 0.0865 seconds time elapsed ( +- 1.22% )
Performance counter stats for 'bash -c echo "memory remove count 20" > /sys/kernel/dlpar' (10 runs):
- 2,702 probe:rtas_call # 0.004 M/sec ( +- 4.00% )
- 722.71 msec task-clock # 0.067 CPUs utilized ( +- 2.41% )
+ 1,246 probe:rtas_call # 0.003 M/sec ( +- 0.25% )
+ 487.73 msec task-clock # 0.049 CPUs utilized ( +- 0.20% )
- 10.829 +- 0.163 seconds time elapsed ( +- 1.51% )
+ 9.9887 +- 0.0866 seconds time elapsed ( +- 0.87% )
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
---
Notes:
This could be considered a sequel to:
https://lore.kernel.org/linuxppc-dev/20210504030358.1715034-1-nathanl@linux.ibm.com/
which tried to address both the suboptimal delay behavior as well as naming
issues. The present change achieves the performance improvement and leaves
the matter of naming for another time.
I've incorporated Alexey's feedback from that series to avoid division when
calculating the arguments to usleep_range().
arch/powerpc/include/asm/rtas.h | 2 +-
arch/powerpc/kernel/rtas.c | 74 +++++++++++++++++++++++++++++----
2 files changed, 68 insertions(+), 8 deletions(-)
diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 9dc97d2f9d27..82e5b055fa2a 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -264,7 +264,7 @@ extern void rtas_get_rtc_time(struct rtc_time *rtc_time);
extern int rtas_set_rtc_time(struct rtc_time *rtc_time);
extern unsigned int rtas_busy_delay_time(int status);
-extern unsigned int rtas_busy_delay(int status);
+bool rtas_busy_delay(int status);
extern int early_init_dt_scan_rtas(unsigned long node,
const char *uname, int depth, void *data);
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index ff80bbad22a5..d686834fe7f5 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -513,17 +513,77 @@ unsigned int rtas_busy_delay_time(int status)
}
EXPORT_SYMBOL(rtas_busy_delay_time);
-/* For an RTAS busy status code, perform the hinted delay. */
-unsigned int rtas_busy_delay(int status)
+/**
+ * rtas_busy_delay() - helper for RTAS busy and extended delay statuses
+ *
+ * @status: a value returned from rtas_call() or similar APIs which return
+ * the status of a RTAS function call.
+ *
+ * Context: Process context. May sleep or schedule.
+ *
+ * Return:
+ * * true - @status is RTAS_BUSY or an extended delay hint. The
+ * caller may assume that the CPU has been yielded if necessary,
+ * and that an appropriate delay for @status has elapsed.
+ * Generally the caller should reattempt the RTAS call which
+ * yielded @status.
+ *
+ * * false - @status is not @RTAS_BUSY nor an extended delay hint. The
+ * caller is responsible for handling @status.
+ */
+bool rtas_busy_delay(int status)
{
unsigned int ms;
+ bool ret;
- might_sleep();
- ms = rtas_busy_delay_time(status);
- if (ms && need_resched())
- msleep(ms);
+ switch (status) {
+ case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
+ ret = true;
+ ms = rtas_busy_delay_time(status);
+ /*
+ * The extended delay hint can be as high as 100 seconds.
+ * Surely any function returning such a status is either
+ * buggy or isn't going to be significantly slowed by us
+ * polling at 1HZ. Clamp the sleep time to one second.
+ */
+ ms = clamp(ms, 1U, 1000U);
+ /*
+ * The delay hint is an order-of-magnitude suggestion, not
+ * a minimum. It is fine, possibly even advantageous, for
+ * us to pause for less time than hinted. For small values,
+ * use usleep_range() to ensure we don't sleep much longer
+ * than actually needed.
+ *
+ * See Documentation/timers/timers-howto.rst for
+ * explanation of the threshold used here. In effect we use
+ * usleep_range() for 9900 and 9901, msleep() for
+ * 9902-9905.
+ */
+ if (ms <= 20)
+ usleep_range(ms * 100, ms * 1000);
+ else
+ msleep(ms);
+ break;
+ case RTAS_BUSY:
+ ret = true;
+ /*
+ * We should call again immediately if there's no other
+ * work to do.
+ */
+ cond_resched();
+ break;
+ default:
+ ret = false;
+ /*
+ * Not a busy or extended delay status; the caller should
+ * handle @status itself. Ensure we warn on misuses in
+ * atomic context regardless.
+ */
+ might_sleep();
+ break;
+ }
- return ms;
+ return ret;
}
EXPORT_SYMBOL(rtas_busy_delay);
--
2.31.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v2 2/2] powerpc/rtas: rtas_busy_delay_time() kernel-doc
2021-11-17 6:02 [PATCH v2 0/2] powerpc/rtas: improved busy and extended delay status handling Nathan Lynch
2021-11-17 6:02 ` [PATCH v2 1/2] powerpc/rtas: rtas_busy_delay() improvements Nathan Lynch
@ 2021-11-17 6:02 ` Nathan Lynch
2021-11-25 9:36 ` [PATCH v2 0/2] powerpc/rtas: improved busy and extended delay status handling Michael Ellerman
2 siblings, 0 replies; 5+ messages in thread
From: Nathan Lynch @ 2021-11-17 6:02 UTC (permalink / raw)
To: linuxppc-dev; +Cc: tyreld, ajd, aik, cheloha, ldufour
Provide API documentation for rtas_busy_delay_time(), explaining why we
return the same value for 9900 and -2.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
---
arch/powerpc/kernel/rtas.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index d686834fe7f5..4cfe9f93a9cd 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -492,8 +492,25 @@ int rtas_call(int token, int nargs, int nret, int *outputs, ...)
}
EXPORT_SYMBOL(rtas_call);
-/* For RTAS_BUSY (-2), delay for 1 millisecond. For an extended busy status
- * code of 990n, perform the hinted delay of 10^n (last digit) milliseconds.
+/**
+ * rtas_busy_delay_time() - From an RTAS status value, calculate the
+ * suggested delay time in milliseconds.
+ *
+ * @status: a value returned from rtas_call() or similar APIs which return
+ * the status of a RTAS function call.
+ *
+ * Context: Any context.
+ *
+ * Return:
+ * * 100000 - If @status is 9905.
+ * * 10000 - If @status is 9904.
+ * * 1000 - If @status is 9903.
+ * * 100 - If @status is 9902.
+ * * 10 - If @status is 9901.
+ * * 1 - If @status is either 9900 or -2. This is "wrong" for -2, but
+ * some callers depend on this behavior, and the worst outcome
+ * is that they will delay for longer than necessary.
+ * * 0 - If @status is not a busy or extended delay value.
*/
unsigned int rtas_busy_delay_time(int status)
{
--
2.31.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2 1/2] powerpc/rtas: rtas_busy_delay() improvements
2021-11-17 6:02 ` [PATCH v2 1/2] powerpc/rtas: rtas_busy_delay() improvements Nathan Lynch
@ 2021-11-17 6:13 ` Nathan Lynch
0 siblings, 0 replies; 5+ messages in thread
From: Nathan Lynch @ 2021-11-17 6:13 UTC (permalink / raw)
To: linuxppc-dev; +Cc: aik, tyreld, cheloha, ldufour, ajd
Nathan Lynch <nathanl@linux.ibm.com> writes:
>
> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
> ---
>
> Notes:
> This could be considered a sequel to:
>
> https://lore.kernel.org/linuxppc-dev/20210504030358.1715034-1-nathanl@linux.ibm.com/
>
> which tried to address both the suboptimal delay behavior as well as naming
> issues. The present change achieves the performance improvement and leaves
> the matter of naming for another time.
>
> I've incorporated Alexey's feedback from that series to avoid division when
> calculating the arguments to usleep_range().
Oops, this part is redundant with information in the cover. Git should
discard it if/when applied, though.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2 0/2] powerpc/rtas: improved busy and extended delay status handling
2021-11-17 6:02 [PATCH v2 0/2] powerpc/rtas: improved busy and extended delay status handling Nathan Lynch
2021-11-17 6:02 ` [PATCH v2 1/2] powerpc/rtas: rtas_busy_delay() improvements Nathan Lynch
2021-11-17 6:02 ` [PATCH v2 2/2] powerpc/rtas: rtas_busy_delay_time() kernel-doc Nathan Lynch
@ 2021-11-25 9:36 ` Michael Ellerman
2 siblings, 0 replies; 5+ messages in thread
From: Michael Ellerman @ 2021-11-25 9:36 UTC (permalink / raw)
To: Nathan Lynch, linuxppc-dev; +Cc: aik, tyreld, cheloha, ldufour, ajd
On Wed, 17 Nov 2021 00:02:57 -0600, Nathan Lynch wrote:
> This can be considered a successor to:
>
> https://lore.kernel.org/linuxppc-dev/20210504030358.1715034-1-nathanl@linux.ibm.com/
>
> which tried to address both the suboptimal delay behavior as well as API
> issues. This version achieves the performance improvements and leaves major
> API changes for another time.
>
> [...]
Applied to powerpc/next.
[1/2] powerpc/rtas: rtas_busy_delay() improvements
https://git.kernel.org/powerpc/c/38f7b7067dae0c101be573106018e8af22a90fdf
[2/2] powerpc/rtas: rtas_busy_delay_time() kernel-doc
https://git.kernel.org/powerpc/c/dd5cde457a5eb77088d1d9eecface47c0563cd43
cheers
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-11-25 9:51 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-11-17 6:02 [PATCH v2 0/2] powerpc/rtas: improved busy and extended delay status handling Nathan Lynch
2021-11-17 6:02 ` [PATCH v2 1/2] powerpc/rtas: rtas_busy_delay() improvements Nathan Lynch
2021-11-17 6:13 ` Nathan Lynch
2021-11-17 6:02 ` [PATCH v2 2/2] powerpc/rtas: rtas_busy_delay_time() kernel-doc Nathan Lynch
2021-11-25 9:36 ` [PATCH v2 0/2] powerpc/rtas: improved busy and extended delay status handling Michael Ellerman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).