From: "hch@lst.de" <hch@lst.de>
To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
Cc: "linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
"hch@lst.de" <hch@lst.de>,
"kbusch@kernel.org" <kbusch@kernel.org>,
"axboe@fb.com" <axboe@fb.com>,
"sagi@grimberg.me" <sagi@grimberg.me>,
Andy Lutomirski <luto@kernel.org>
Subject: Re: [PATCH 1/1] nvme: extend and modify the APST configuration algorithm
Date: Wed, 28 Apr 2021 14:42:56 +0200 [thread overview]
Message-ID: <20210428124256.GB28566@lst.de> (raw)
In-Reply-To: <BY5PR04MB704131DBB47254C9F1FF12B38B409@BY5PR04MB7041.namprd04.prod.outlook.com>
Adding Andy who wrote the original APST code.
On Wed, Apr 28, 2021 at 09:27:36AM +0000, Alexey Bogoslavsky wrote:
> From: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
>
> The algorithm that was used until now for building the APST configuration
> table has been found to produce entries with excessively long ITPT
> (idle time prior to transition) for devices declaring relatively long
> entry and exit latencies for non-operational power states. This leads
> to unnecessary waste of power and, as a result, failure to pass
> mandatory power consumption tests on Chromebook platforms.
>
> The new algorithm is based on two predefined ITPT values and two
> predefined latency tolerances. Based on these values, as well as on
> exit and entry latencies reported by the device, the algorithm looks
> for up to 2 suitable non-operational power states to use as primary
> and secondary APST transition targets. The predefined values are
> supplied to the nvme driver as module parameters:
>
> - apst_primary_timeout_ms (default: 100)
> - apst_secondary_timeout_ms (default: 2000)
> - apst_primary_latency_tol_us (default: 15000)
> - apst_secondary_latency_tol_us (default: 100000)
>
> The algorithm echoes the approach used by Intel's and Microsoft's drivers
> on Windows. The specific default parameter values are also based on those
> drivers. Yet, this patch doesn't introduce the ability to dynamically
> regenerate the APST table in the event of switching the power source from
> AC to battery and back. Adding this functionality may be considered in the
> future. In the meantime, the timeouts and tolerances reflect a compromise
> between values used by Microsoft for AC and battery scenarios.
>
> In most NVMe devices the new algorithm causes them to implement a more
> aggressive power saving policy. While beneficial in most cases, this
> sometimes comes at the price of a higher IO processing latency in certain
> scenarios as well as at the price of a potential impact on the drive's
> endurance (due to more frequent context saving when entering deep non-
> operational states). So in order to provide a fallback for systems where
> these regressions cannot be tolerated, the patch allows to revert to
> the legacy behavior by setting either apst_primary_timeout_ms or
> apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
> fine tuning the default values of the module parameters) the legacy behavior
> can be removed.
>
> TESTING.
>
> The new algorithm has been extensively tested. Initially, simulations were
> used to compare APST tables generated by old and new algorithms for a wide
> range of devices. After that, power consumption, performance and latencies
> were measured under different workloads on devices from multiple vendors
> (WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests
> and the findings.
>
> General observations.
> The effect the patch has on the APST table varies depending on the entry and
> exit latencies advertised by the devices. For some devices, the effect is
> negligible (e.g. Kioxia KBG40ZNS), for some significant, making the
> transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making
> the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g.
> SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed:
> the initial transition happens after a longer idle time, but takes the device
> to a lower power state.
>
> Workflows.
> In order to evaluate the patch's effect on the power consumption and latency,
> 7 workflows were used for each device. The workflows were designed to test
> the scenarios where significant differences between the old and new behaviors
> are most likely. Each workflow was tested twice: with the new and with the
> old APST table generation implementation. Power consumption, performance and
> latency were measured in the process. The following workflows were used:
> 1) Consecutive write at the maximum rate with IO depth of 2, with no pauses
> 2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms
> idle time
> 3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms
> idle time
> 4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms
> idle time
> 5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s
> idle time
> 6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s
> idle time
> 7) Repeated pattern of a single random read of a 4K packet followed by 150ms
> idle time
>
> Power consumption
> Actual power consumption measurements produced predictable results in
> accordance with the APST mechanism's theory of operation.
> Devices with long entry and exit latencies such as WD SN530 showed huge
> improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia
> KBG40ZNS where the resulting APST table looks virtually identical with
> both legacy and new algorithms, showed little or no change in the average power
> consumption on all workflows. Devices with extra short latencies such as
> Samsung PM991 showed moderate increase in power consumption of up to 18% in
> worst case scenarios.
> In addition, on Intel and Samsung devices a more complex impact was observed
> on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep
> non-operational states between the writes the devices start performing background
> operations leading to an increase of power consumption. With the old APST tables
> part of these operations are delayed until the scenario is over and a longer idle
> period begins, but eventually this extra power is consumed anyway.
>
> Performance.
> In terms of performance measured on sustained write or read scenarios, the
> effect of the patch is minimal as in this case the device doesn't enter low power
> states.
>
> Latency
> As expected, in devices where the patch causes a more aggressive power saving
> policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in
> certain scenarios. Workflow number 7, specifically designed to simulate the
> worst case scenario as far as latency is concerned, indeed shows a sharp
> increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on
> WD SN530). The latency increase on other workloads and other devices is much
> milder or non-existent.
>
> Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
> ---
> drivers/nvme/host/core.c | 89 +++++++++++++++++++++++++++++++++++-----
> 1 file changed, 78 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 2f45e8fcdd7c..9768d2e84562 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -57,6 +57,26 @@ static bool force_apst;
> module_param(force_apst, bool, 0644);
> MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
>
> +static unsigned long apst_primary_timeout_ms = 100;
> +module_param(apst_primary_timeout_ms, ulong, 0644);
> +MODULE_PARM_DESC(apst_primary_timeout_ms,
> + "primary APST timeout in ms");
> +
> +static unsigned long apst_secondary_timeout_ms = 2000;
> +module_param(apst_secondary_timeout_ms, ulong, 0644);
> +MODULE_PARM_DESC(apst_secondary_timeout_ms,
> + "secondary APST timeout in ms");
> +
> +static unsigned long apst_primary_latency_tol_us = 15000;
> +module_param(apst_primary_latency_tol_us, ulong, 0644);
> +MODULE_PARM_DESC(apst_primary_latency_tol_us,
> + "primary APST latency tolerance in us");
> +
> +static unsigned long apst_secondary_latency_tol_us = 100000;
> +module_param(apst_secondary_latency_tol_us, ulong, 0644);
> +MODULE_PARM_DESC(apst_secondary_latency_tol_us,
> + "secondary APST latency tolerance in us");
> +
> static bool streams;
> module_param(streams, bool, 0644);
> MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
> @@ -2185,14 +2205,54 @@ static int nvme_configure_acre(struct nvme_ctrl *ctrl)
> return ret;
> }
>
> +/*
> + * The function checks whether the given total (exlat + enlat) latency of
> + * a power state allows the latter to be used as an APST transition target.
> + * It does so by comparing the latency to the primary and secondary latency
> + * tolerances defined by module params. If there's a match, the corresponding
> + * timeout value is returned and the matching tolerance index (1 or 2) is
> + * reported.
> + */
> +static bool nvme_apst_get_transition_time(u64 total_latency,
> + u64 *transition_time, unsigned *last_index)
> +{
> + if (total_latency <= apst_primary_latency_tol_us) {
> + if (*last_index == 1)
> + return false;
> + *last_index = 1;
> + *transition_time = apst_primary_timeout_ms;
> + return true;
> + }
> + if (apst_secondary_timeout_ms &&
> + total_latency <= apst_secondary_latency_tol_us) {
> + if (*last_index <= 2)
> + return false;
> + *last_index = 2;
> + *transition_time = apst_secondary_timeout_ms;
> + return true;
> + }
> + return false;
> +}
> +
> /*
> * APST (Autonomous Power State Transition) lets us program a table of power
> * state transitions that the controller will perform automatically.
> - * We configure it with a simple heuristic: we are willing to spend at most 2%
> - * of the time transitioning between power states. Therefore, when running in
> - * any given state, we will enter the next lower-power non-operational state
> - * after waiting 50 * (enlat + exlat) microseconds, as long as that state's exit
> - * latency is under the requested maximum latency.
> + *
> + * Depending on module params, one of the two supported techniques will be used:
> + *
> + * - If the parameters provide explicit timeouts and tolerances, they will be
> + * used to build a table with up to 2 non-operational states to transition to.
> + * The default parameter values were selected based on the values used by
> + * Microsoft's and Intel's NVMe drivers. Yet, since we don't implement dynamic
> + * regeneration of the APST table in the event of switching between external
> + * and battery power, the timeouts and tolerances reflect a compromise
> + * between values used by Microsoft for AC and battery scenarios.
> + * - If not, we'll configure the table with a simple heuristic: we are willing
> + * to spend at most 2% of the time transitioning between power states.
> + * Therefore, when running in any given state, we will enter the next
> + * lower-power non-operational state after waiting 50 * (enlat + exlat)
> + * microseconds, as long as that state's exit latency is under the requested
> + * maximum latency.
> *
> * We will not autonomously enter any non-operational state for which the total
> * latency exceeds ps_max_latency_us.
> @@ -2208,6 +2268,7 @@ static int nvme_configure_apst(struct nvme_ctrl *ctrl)
> int max_ps = -1;
> int state;
> int ret;
> + unsigned last_lt_index = UINT_MAX;
>
> /*
> * If APST isn't supported or if we haven't been initialized yet,
> @@ -2266,13 +2327,19 @@ static int nvme_configure_apst(struct nvme_ctrl *ctrl)
> le32_to_cpu(ctrl->psd[state].entry_lat);
>
> /*
> - * This state is good. Use it as the APST idle target for
> - * higher power states.
> + * This state is good. It can be used as the APST idle target
> + * for higher power states.
> */
> - transition_ms = total_latency_us + 19;
> - do_div(transition_ms, 20);
> - if (transition_ms > (1 << 24) - 1)
> - transition_ms = (1 << 24) - 1;
> + if (apst_primary_timeout_ms && apst_primary_latency_tol_us) {
> + if (!nvme_apst_get_transition_time(total_latency_us,
> + &transition_ms, &last_lt_index))
> + continue;
> + } else {
> + transition_ms = total_latency_us + 19;
> + do_div(transition_ms, 20);
> + if (transition_ms > (1 << 24) - 1)
> + transition_ms = (1 << 24) - 1;
> + }
>
> target = cpu_to_le64((state << 3) | (transition_ms << 8));
> if (max_ps == -1)
> --
> 2.17.1
---end quoted text---
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
next prev parent reply other threads:[~2021-04-28 12:43 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-28 9:27 [PATCH 1/1] nvme: extend and modify the APST configuration algorithm Alexey Bogoslavsky
2021-04-28 12:42 ` hch [this message]
2021-04-28 14:44 ` Andy Lutomirski
2021-04-28 15:45 ` Alexey Bogoslavsky
2021-05-19 7:00 ` hch
2023-01-16 18:32 ` [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD Alexey Bogoslavsky
2023-01-17 7:14 ` 'hch@lst.de'
2023-01-17 13:20 ` Alexey Bogoslavsky
2023-01-17 14:22 ` Bjorn Helgaas
2023-01-17 18:06 ` Alexey Bogoslavsky
2023-01-17 15:54 ` Keith Busch
2023-01-17 18:15 ` Alexey Bogoslavsky
2023-04-11 22:15 ` Bjorn Helgaas
2023-04-24 11:27 ` Alexey Bogoslavsky
2023-04-28 22:00 ` Bjorn Helgaas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210428124256.GB28566@lst.de \
--to=hch@lst.de \
--cc=Alexey.Bogoslavsky@wdc.com \
--cc=axboe@fb.com \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=luto@kernel.org \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.