public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
* [PATCH 1/1] nvme: extend and modify the APST configuration algorithm
@ 2021-04-28  9:27 Alexey Bogoslavsky
  2021-04-28 12:42 ` hch
  2021-05-19  7:00 ` hch
  0 siblings, 2 replies; 5+ messages in thread
From: Alexey Bogoslavsky @ 2021-04-28  9:27 UTC (permalink / raw)
  To: linux-nvme@lists.infradead.org
  Cc: hch@lst.de, kbusch@kernel.org, axboe@fb.com, sagi@grimberg.me

From: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>

The algorithm that was used until now for building the APST configuration
table has been found to produce entries with excessively long ITPT
(idle time prior to transition) for devices declaring relatively long
entry and exit latencies for non-operational power states. This leads
to unnecessary waste of power and, as a result, failure to pass
mandatory power consumption tests on Chromebook platforms.

The new algorithm is based on two predefined ITPT values and two
predefined latency tolerances. Based on these values, as well as on
exit and entry latencies reported by the device, the algorithm looks
for up to 2 suitable non-operational power states to use as primary
and secondary APST transition targets. The predefined values are
supplied to the nvme driver as module parameters:

 - apst_primary_timeout_ms (default: 100)
 - apst_secondary_timeout_ms (default: 2000)
 - apst_primary_latency_tol_us (default: 15000)
 - apst_secondary_latency_tol_us (default: 100000)

The algorithm echoes the approach used by Intel's and Microsoft's drivers
on Windows. The specific default parameter values are also based on those
drivers. Yet, this patch doesn't introduce the ability to dynamically
regenerate the APST table in the event of switching the power source from
AC to battery and back. Adding this functionality may be considered in the
future. In the meantime, the timeouts and tolerances reflect a compromise
between values used by Microsoft for AC and battery scenarios.

In most NVMe devices the new algorithm causes them to implement a more
aggressive power saving policy. While beneficial in most cases, this
sometimes comes at the price of a higher IO processing latency in certain
scenarios as well as at the price of a potential impact on the drive's
endurance (due to more frequent context saving when entering deep non-
operational states). So in order to provide a fallback for systems where
these regressions cannot be tolerated, the patch allows to revert to
the legacy behavior by setting either apst_primary_timeout_ms or
apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
fine tuning the default values of the module parameters) the legacy behavior
can be removed.

TESTING.

The new algorithm has been extensively tested. Initially, simulations were
used to compare APST tables generated by old and new algorithms for a wide
range of devices. After that, power consumption, performance and latencies
were measured under different workloads on devices from multiple vendors
(WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests
and the findings.

General observations.
The effect the patch has on the APST table varies depending on the entry and
exit latencies advertised by the devices. For some devices, the effect is
negligible (e.g. Kioxia KBG40ZNS), for some significant, making the
transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making
the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g.
SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed:
the initial transition happens after a longer idle time, but takes the device
to a lower power state.

Workflows.
In order to evaluate the patch's effect on the power consumption and latency,
7 workflows were used for each device. The workflows were designed to test
the scenarios where significant differences between the old and new behaviors
are most likely. Each workflow was tested twice: with the new and with the
old APST table generation implementation. Power consumption, performance and
latency were measured in the process. The following workflows were used:
1) Consecutive write at the maximum rate with IO depth of 2, with no pauses
2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms
   idle time
3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms
   idle time
4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms
   idle time
5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s
   idle time
6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s
   idle time
7) Repeated pattern of a single random read of a 4K packet followed by 150ms
   idle time

Power consumption
Actual power consumption measurements produced predictable results in
accordance with the APST mechanism's theory of operation.
Devices with long entry and exit latencies such as WD SN530 showed huge
improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia
KBG40ZNS where the resulting APST table looks virtually identical with
both legacy and new algorithms, showed little or no change in the average power
consumption on all workflows. Devices with extra short latencies such as
Samsung PM991 showed moderate increase in power consumption of up to 18% in
worst case scenarios.
In addition, on Intel and Samsung devices a more complex impact was observed
on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep
non-operational states between the writes the devices start performing background
operations leading to an increase of power consumption. With the old APST tables
part of these operations are delayed until the scenario is over and a longer idle
period begins, but eventually this extra power is consumed anyway.

Performance.
In terms of performance measured on sustained write or read scenarios, the
effect of the patch is minimal as in this case the device doesn't enter low power
states.

Latency
As expected, in devices where the patch causes a more aggressive power saving
policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in
certain scenarios. Workflow number 7, specifically designed to simulate the
worst case scenario as far as latency is concerned, indeed shows a sharp
increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on
WD SN530). The latency increase on other workloads and other devices is much
milder or non-existent.

Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
---
 drivers/nvme/host/core.c | 89 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 78 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2f45e8fcdd7c..9768d2e84562 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -57,6 +57,26 @@ static bool force_apst;
 module_param(force_apst, bool, 0644);
 MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
 
+static unsigned long apst_primary_timeout_ms = 100;
+module_param(apst_primary_timeout_ms, ulong, 0644);
+MODULE_PARM_DESC(apst_primary_timeout_ms,
+	"primary APST timeout in ms");
+
+static unsigned long apst_secondary_timeout_ms = 2000;
+module_param(apst_secondary_timeout_ms, ulong, 0644);
+MODULE_PARM_DESC(apst_secondary_timeout_ms,
+	"secondary APST timeout in ms");
+
+static unsigned long apst_primary_latency_tol_us = 15000;
+module_param(apst_primary_latency_tol_us, ulong, 0644);
+MODULE_PARM_DESC(apst_primary_latency_tol_us,
+	"primary APST latency tolerance in us");
+
+static unsigned long apst_secondary_latency_tol_us = 100000;
+module_param(apst_secondary_latency_tol_us, ulong, 0644);
+MODULE_PARM_DESC(apst_secondary_latency_tol_us,
+	"secondary APST latency tolerance in us");
+
 static bool streams;
 module_param(streams, bool, 0644);
 MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
@@ -2185,14 +2205,54 @@ static int nvme_configure_acre(struct nvme_ctrl *ctrl)
 	return ret;
 }
 
+/*
+ * The function checks whether the given total (exlat + enlat) latency of
+ * a power state allows the latter to be used as an APST transition target.
+ * It does so by comparing the latency to the primary and secondary latency
+ * tolerances defined by module params. If there's a match, the corresponding
+ * timeout value is returned and the matching tolerance index (1 or 2) is
+ * reported.
+ */
+static bool nvme_apst_get_transition_time(u64 total_latency,
+		u64 *transition_time, unsigned *last_index)
+{
+	if (total_latency <= apst_primary_latency_tol_us) {
+		if (*last_index == 1)
+			return false;
+		*last_index = 1;
+		*transition_time = apst_primary_timeout_ms;
+		return true;
+	}
+	if (apst_secondary_timeout_ms &&
+		total_latency <= apst_secondary_latency_tol_us) {
+		if (*last_index <= 2)
+			return false;
+		*last_index = 2;
+		*transition_time = apst_secondary_timeout_ms;
+		return true;
+	}
+	return false;
+}
+
 /*
  * APST (Autonomous Power State Transition) lets us program a table of power
  * state transitions that the controller will perform automatically.
- * We configure it with a simple heuristic: we are willing to spend at most 2%
- * of the time transitioning between power states.  Therefore, when running in
- * any given state, we will enter the next lower-power non-operational state
- * after waiting 50 * (enlat + exlat) microseconds, as long as that state's exit
- * latency is under the requested maximum latency.
+ *
+ * Depending on module params, one of the two supported techniques will be used:
+ *
+ * - If the parameters provide explicit timeouts and tolerances, they will be
+ *   used to build a table with up to 2 non-operational states to transition to.
+ *   The default parameter values were selected based on the values used by
+ *   Microsoft's and Intel's NVMe drivers. Yet, since we don't implement dynamic
+ *   regeneration of the APST table in the event of switching between external
+ *   and battery power, the timeouts and tolerances reflect a compromise
+ *   between values used by Microsoft for AC and battery scenarios.
+ * - If not, we'll configure the table with a simple heuristic: we are willing
+ *   to spend at most 2% of the time transitioning between power states.
+ *   Therefore, when running in any given state, we will enter the next
+ *   lower-power non-operational state after waiting 50 * (enlat + exlat)
+ *   microseconds, as long as that state's exit latency is under the requested
+ *   maximum latency.
  *
  * We will not autonomously enter any non-operational state for which the total
  * latency exceeds ps_max_latency_us.
@@ -2208,6 +2268,7 @@ static int nvme_configure_apst(struct nvme_ctrl *ctrl)
 	int max_ps = -1;
 	int state;
 	int ret;
+	unsigned last_lt_index = UINT_MAX;
 
 	/*
 	 * If APST isn't supported or if we haven't been initialized yet,
@@ -2266,13 +2327,19 @@ static int nvme_configure_apst(struct nvme_ctrl *ctrl)
 			le32_to_cpu(ctrl->psd[state].entry_lat);
 
 		/*
-		 * This state is good.  Use it as the APST idle target for
-		 * higher power states.
+		 * This state is good. It can be used as the APST idle target
+		 * for higher power states.
 		 */
-		transition_ms = total_latency_us + 19;
-		do_div(transition_ms, 20);
-		if (transition_ms > (1 << 24) - 1)
-			transition_ms = (1 << 24) - 1;
+		if (apst_primary_timeout_ms && apst_primary_latency_tol_us) {
+			if (!nvme_apst_get_transition_time(total_latency_us,
+					&transition_ms, &last_lt_index))
+				continue;
+		} else {
+			transition_ms = total_latency_us + 19;
+			do_div(transition_ms, 20);
+			if (transition_ms > (1 << 24) - 1)
+				transition_ms = (1 << 24) - 1;
+		}
 
 		target = cpu_to_le64((state << 3) | (transition_ms << 8));
 		if (max_ps == -1)
-- 
2.17.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] nvme: extend and modify the APST configuration algorithm
  2021-04-28  9:27 [PATCH 1/1] nvme: extend and modify the APST configuration algorithm Alexey Bogoslavsky
@ 2021-04-28 12:42 ` hch
  2021-04-28 14:44   ` Andy Lutomirski
  2021-05-19  7:00 ` hch
  1 sibling, 1 reply; 5+ messages in thread
From: hch @ 2021-04-28 12:42 UTC (permalink / raw)
  To: Alexey Bogoslavsky
  Cc: linux-nvme@lists.infradead.org, hch@lst.de, kbusch@kernel.org,
	axboe@fb.com, sagi@grimberg.me, Andy Lutomirski

Adding Andy who wrote the original APST code.

On Wed, Apr 28, 2021 at 09:27:36AM +0000, Alexey Bogoslavsky wrote:
> From: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
> 
> The algorithm that was used until now for building the APST configuration
> table has been found to produce entries with excessively long ITPT
> (idle time prior to transition) for devices declaring relatively long
> entry and exit latencies for non-operational power states. This leads
> to unnecessary waste of power and, as a result, failure to pass
> mandatory power consumption tests on Chromebook platforms.
> 
> The new algorithm is based on two predefined ITPT values and two
> predefined latency tolerances. Based on these values, as well as on
> exit and entry latencies reported by the device, the algorithm looks
> for up to 2 suitable non-operational power states to use as primary
> and secondary APST transition targets. The predefined values are
> supplied to the nvme driver as module parameters:
> 
>  - apst_primary_timeout_ms (default: 100)
>  - apst_secondary_timeout_ms (default: 2000)
>  - apst_primary_latency_tol_us (default: 15000)
>  - apst_secondary_latency_tol_us (default: 100000)
> 
> The algorithm echoes the approach used by Intel's and Microsoft's drivers
> on Windows. The specific default parameter values are also based on those
> drivers. Yet, this patch doesn't introduce the ability to dynamically
> regenerate the APST table in the event of switching the power source from
> AC to battery and back. Adding this functionality may be considered in the
> future. In the meantime, the timeouts and tolerances reflect a compromise
> between values used by Microsoft for AC and battery scenarios.
> 
> In most NVMe devices the new algorithm causes them to implement a more
> aggressive power saving policy. While beneficial in most cases, this
> sometimes comes at the price of a higher IO processing latency in certain
> scenarios as well as at the price of a potential impact on the drive's
> endurance (due to more frequent context saving when entering deep non-
> operational states). So in order to provide a fallback for systems where
> these regressions cannot be tolerated, the patch allows to revert to
> the legacy behavior by setting either apst_primary_timeout_ms or
> apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
> fine tuning the default values of the module parameters) the legacy behavior
> can be removed.
> 
> TESTING.
> 
> The new algorithm has been extensively tested. Initially, simulations were
> used to compare APST tables generated by old and new algorithms for a wide
> range of devices. After that, power consumption, performance and latencies
> were measured under different workloads on devices from multiple vendors
> (WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests
> and the findings.
> 
> General observations.
> The effect the patch has on the APST table varies depending on the entry and
> exit latencies advertised by the devices. For some devices, the effect is
> negligible (e.g. Kioxia KBG40ZNS), for some significant, making the
> transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making
> the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g.
> SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed:
> the initial transition happens after a longer idle time, but takes the device
> to a lower power state.
> 
> Workflows.
> In order to evaluate the patch's effect on the power consumption and latency,
> 7 workflows were used for each device. The workflows were designed to test
> the scenarios where significant differences between the old and new behaviors
> are most likely. Each workflow was tested twice: with the new and with the
> old APST table generation implementation. Power consumption, performance and
> latency were measured in the process. The following workflows were used:
> 1) Consecutive write at the maximum rate with IO depth of 2, with no pauses
> 2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms
>    idle time
> 3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms
>    idle time
> 4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms
>    idle time
> 5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s
>    idle time
> 6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s
>    idle time
> 7) Repeated pattern of a single random read of a 4K packet followed by 150ms
>    idle time
> 
> Power consumption
> Actual power consumption measurements produced predictable results in
> accordance with the APST mechanism's theory of operation.
> Devices with long entry and exit latencies such as WD SN530 showed huge
> improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia
> KBG40ZNS where the resulting APST table looks virtually identical with
> both legacy and new algorithms, showed little or no change in the average power
> consumption on all workflows. Devices with extra short latencies such as
> Samsung PM991 showed moderate increase in power consumption of up to 18% in
> worst case scenarios.
> In addition, on Intel and Samsung devices a more complex impact was observed
> on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep
> non-operational states between the writes the devices start performing background
> operations leading to an increase of power consumption. With the old APST tables
> part of these operations are delayed until the scenario is over and a longer idle
> period begins, but eventually this extra power is consumed anyway.
> 
> Performance.
> In terms of performance measured on sustained write or read scenarios, the
> effect of the patch is minimal as in this case the device doesn't enter low power
> states.
> 
> Latency
> As expected, in devices where the patch causes a more aggressive power saving
> policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in
> certain scenarios. Workflow number 7, specifically designed to simulate the
> worst case scenario as far as latency is concerned, indeed shows a sharp
> increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on
> WD SN530). The latency increase on other workloads and other devices is much
> milder or non-existent.
> 
> Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
> ---
>  drivers/nvme/host/core.c | 89 +++++++++++++++++++++++++++++++++++-----
>  1 file changed, 78 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 2f45e8fcdd7c..9768d2e84562 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -57,6 +57,26 @@ static bool force_apst;
>  module_param(force_apst, bool, 0644);
>  MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
>  
> +static unsigned long apst_primary_timeout_ms = 100;
> +module_param(apst_primary_timeout_ms, ulong, 0644);
> +MODULE_PARM_DESC(apst_primary_timeout_ms,
> +	"primary APST timeout in ms");
> +
> +static unsigned long apst_secondary_timeout_ms = 2000;
> +module_param(apst_secondary_timeout_ms, ulong, 0644);
> +MODULE_PARM_DESC(apst_secondary_timeout_ms,
> +	"secondary APST timeout in ms");
> +
> +static unsigned long apst_primary_latency_tol_us = 15000;
> +module_param(apst_primary_latency_tol_us, ulong, 0644);
> +MODULE_PARM_DESC(apst_primary_latency_tol_us,
> +	"primary APST latency tolerance in us");
> +
> +static unsigned long apst_secondary_latency_tol_us = 100000;
> +module_param(apst_secondary_latency_tol_us, ulong, 0644);
> +MODULE_PARM_DESC(apst_secondary_latency_tol_us,
> +	"secondary APST latency tolerance in us");
> +
>  static bool streams;
>  module_param(streams, bool, 0644);
>  MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
> @@ -2185,14 +2205,54 @@ static int nvme_configure_acre(struct nvme_ctrl *ctrl)
>  	return ret;
>  }
>  
> +/*
> + * The function checks whether the given total (exlat + enlat) latency of
> + * a power state allows the latter to be used as an APST transition target.
> + * It does so by comparing the latency to the primary and secondary latency
> + * tolerances defined by module params. If there's a match, the corresponding
> + * timeout value is returned and the matching tolerance index (1 or 2) is
> + * reported.
> + */
> +static bool nvme_apst_get_transition_time(u64 total_latency,
> +		u64 *transition_time, unsigned *last_index)
> +{
> +	if (total_latency <= apst_primary_latency_tol_us) {
> +		if (*last_index == 1)
> +			return false;
> +		*last_index = 1;
> +		*transition_time = apst_primary_timeout_ms;
> +		return true;
> +	}
> +	if (apst_secondary_timeout_ms &&
> +		total_latency <= apst_secondary_latency_tol_us) {
> +		if (*last_index <= 2)
> +			return false;
> +		*last_index = 2;
> +		*transition_time = apst_secondary_timeout_ms;
> +		return true;
> +	}
> +	return false;
> +}
> +
>  /*
>   * APST (Autonomous Power State Transition) lets us program a table of power
>   * state transitions that the controller will perform automatically.
> - * We configure it with a simple heuristic: we are willing to spend at most 2%
> - * of the time transitioning between power states.  Therefore, when running in
> - * any given state, we will enter the next lower-power non-operational state
> - * after waiting 50 * (enlat + exlat) microseconds, as long as that state's exit
> - * latency is under the requested maximum latency.
> + *
> + * Depending on module params, one of the two supported techniques will be used:
> + *
> + * - If the parameters provide explicit timeouts and tolerances, they will be
> + *   used to build a table with up to 2 non-operational states to transition to.
> + *   The default parameter values were selected based on the values used by
> + *   Microsoft's and Intel's NVMe drivers. Yet, since we don't implement dynamic
> + *   regeneration of the APST table in the event of switching between external
> + *   and battery power, the timeouts and tolerances reflect a compromise
> + *   between values used by Microsoft for AC and battery scenarios.
> + * - If not, we'll configure the table with a simple heuristic: we are willing
> + *   to spend at most 2% of the time transitioning between power states.
> + *   Therefore, when running in any given state, we will enter the next
> + *   lower-power non-operational state after waiting 50 * (enlat + exlat)
> + *   microseconds, as long as that state's exit latency is under the requested
> + *   maximum latency.
>   *
>   * We will not autonomously enter any non-operational state for which the total
>   * latency exceeds ps_max_latency_us.
> @@ -2208,6 +2268,7 @@ static int nvme_configure_apst(struct nvme_ctrl *ctrl)
>  	int max_ps = -1;
>  	int state;
>  	int ret;
> +	unsigned last_lt_index = UINT_MAX;
>  
>  	/*
>  	 * If APST isn't supported or if we haven't been initialized yet,
> @@ -2266,13 +2327,19 @@ static int nvme_configure_apst(struct nvme_ctrl *ctrl)
>  			le32_to_cpu(ctrl->psd[state].entry_lat);
>  
>  		/*
> -		 * This state is good.  Use it as the APST idle target for
> -		 * higher power states.
> +		 * This state is good. It can be used as the APST idle target
> +		 * for higher power states.
>  		 */
> -		transition_ms = total_latency_us + 19;
> -		do_div(transition_ms, 20);
> -		if (transition_ms > (1 << 24) - 1)
> -			transition_ms = (1 << 24) - 1;
> +		if (apst_primary_timeout_ms && apst_primary_latency_tol_us) {
> +			if (!nvme_apst_get_transition_time(total_latency_us,
> +					&transition_ms, &last_lt_index))
> +				continue;
> +		} else {
> +			transition_ms = total_latency_us + 19;
> +			do_div(transition_ms, 20);
> +			if (transition_ms > (1 << 24) - 1)
> +				transition_ms = (1 << 24) - 1;
> +		}
>  
>  		target = cpu_to_le64((state << 3) | (transition_ms << 8));
>  		if (max_ps == -1)
> -- 
> 2.17.1
---end quoted text---

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] nvme: extend and modify the APST configuration algorithm
  2021-04-28 12:42 ` hch
@ 2021-04-28 14:44   ` Andy Lutomirski
  2021-04-28 15:45     ` Alexey Bogoslavsky
  0 siblings, 1 reply; 5+ messages in thread
From: Andy Lutomirski @ 2021-04-28 14:44 UTC (permalink / raw)
  To: hch@lst.de
  Cc: Alexey Bogoslavsky, linux-nvme@lists.infradead.org,
	kbusch@kernel.org, axboe@fb.com, sagi@grimberg.me,
	Andy Lutomirski

On Wed, Apr 28, 2021 at 5:43 AM hch@lst.de <hch@lst.de> wrote:
>
> Adding Andy who wrote the original APST code.
>
> On Wed, Apr 28, 2021 at 09:27:36AM +0000, Alexey Bogoslavsky wrote:
> > From: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
> >
> > The algorithm that was used until now for building the APST configuration
> > table has been found to produce entries with excessively long ITPT
> > (idle time prior to transition) for devices declaring relatively long
> > entry and exit latencies for non-operational power states. This leads
> > to unnecessary waste of power and, as a result, failure to pass
> > mandatory power consumption tests on Chromebook platforms.
> >
> > The new algorithm is based on two predefined ITPT values and two
> > predefined latency tolerances. Based on these values, as well as on
> > exit and entry latencies reported by the device, the algorithm looks
> > for up to 2 suitable non-operational power states to use as primary
> > and secondary APST transition targets. The predefined values are
> > supplied to the nvme driver as module parameters:
> >
> >  - apst_primary_timeout_ms (default: 100)
> >  - apst_secondary_timeout_ms (default: 2000)
> >  - apst_primary_latency_tol_us (default: 15000)
> >  - apst_secondary_latency_tol_us (default: 100000)
> >
> > The algorithm echoes the approach used by Intel's and Microsoft's drivers
> > on Windows. The specific default parameter values are also based on those
> > drivers. Yet, this patch doesn't introduce the ability to dynamically
> > regenerate the APST table in the event of switching the power source from
> > AC to battery and back. Adding this functionality may be considered in the
> > future. In the meantime, the timeouts and tolerances reflect a compromise
> > between values used by Microsoft for AC and battery scenarios.
> >
> > In most NVMe devices the new algorithm causes them to implement a more
> > aggressive power saving policy. While beneficial in most cases, this
> > sometimes comes at the price of a higher IO processing latency in certain
> > scenarios as well as at the price of a potential impact on the drive's
> > endurance (due to more frequent context saving when entering deep non-
> > operational states). So in order to provide a fallback for systems where
> > these regressions cannot be tolerated, the patch allows to revert to
> > the legacy behavior by setting either apst_primary_timeout_ms or
> > apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
> > fine tuning the default values of the module parameters) the legacy behavior
> > can be removed.

Can you give an example of the APST states and latencies on a device
for which this is useful?

I'm not opposed to adjusting the algorithm, but I'd like to understand
what we're up against.  If Linux were the only game in town, I would
say that the approach in this patch is unfortunate because of the
arbitrary thresholds it introduces, but if it tracks Windows, then
we're probably okay.

--Andy

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH 1/1] nvme: extend and modify the APST configuration algorithm
  2021-04-28 14:44   ` Andy Lutomirski
@ 2021-04-28 15:45     ` Alexey Bogoslavsky
  0 siblings, 0 replies; 5+ messages in thread
From: Alexey Bogoslavsky @ 2021-04-28 15:45 UTC (permalink / raw)
  To: Andy Lutomirski, hch@lst.de
  Cc: linux-nvme@lists.infradead.org, kbusch@kernel.org, axboe@fb.com,
	sagi@grimberg.me

On Wed, Apr 28, 2021 at 5:43 AM hch@lst.de <hch@lst.de> wrote:
>
> Adding Andy who wrote the original APST code.
>
> On Wed, Apr 28, 2021 at 09:27:36AM +0000, Alexey Bogoslavsky wrote:
> > From: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
> >
> > The algorithm that was used until now for building the APST configuration
> > table has been found to produce entries with excessively long ITPT
> > (idle time prior to transition) for devices declaring relatively long
> > entry and exit latencies for non-operational power states. This leads
> > to unnecessary waste of power and, as a result, failure to pass
> > mandatory power consumption tests on Chromebook platforms.
> >
> > The new algorithm is based on two predefined ITPT values and two
> > predefined latency tolerances. Based on these values, as well as on
> > exit and entry latencies reported by the device, the algorithm looks
> > for up to 2 suitable non-operational power states to use as primary
> > and secondary APST transition targets. The predefined values are
> > supplied to the nvme driver as module parameters:
> >
> >  - apst_primary_timeout_ms (default: 100)
> >  - apst_secondary_timeout_ms (default: 2000)
> >  - apst_primary_latency_tol_us (default: 15000)
> >  - apst_secondary_latency_tol_us (default: 100000)
> >
> > The algorithm echoes the approach used by Intel's and Microsoft's drivers
> > on Windows. The specific default parameter values are also based on those
> > drivers. Yet, this patch doesn't introduce the ability to dynamically
> > regenerate the APST table in the event of switching the power source from
> > AC to battery and back. Adding this functionality may be considered in the
> > future. In the meantime, the timeouts and tolerances reflect a compromise
> > between values used by Microsoft for AC and battery scenarios.
> >
> > In most NVMe devices the new algorithm causes them to implement a more
> > aggressive power saving policy. While beneficial in most cases, this
> > sometimes comes at the price of a higher IO processing latency in certain
> > scenarios as well as at the price of a potential impact on the drive's
> > endurance (due to more frequent context saving when entering deep non-
> > operational states). So in order to provide a fallback for systems where
> > these regressions cannot be tolerated, the patch allows to revert to
> > the legacy behavior by setting either apst_primary_timeout_ms or
> > apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
> > fine tuning the default values of the module parameters) the legacy behavior
> > can be removed.

>  Can you give an example of the APST states and latencies on a device
>  for which this is useful?

Sure. Originally, we faced this problem with WD SN530 device that failed to pass
Google's power consumption tests. The device reports the following latencies:
PS3: entry: 3900, exit: 11000 (translates to 745ms ITPT)
PS4: entry: 5000, exit: 39000 (translates to 2200ms ITPT)

Then we started looking at other devices and found more with a similar problem,
e.g. Crucial P5:
PS3: entry: 10000, exit: 2500 (translates to 625ms ITPT)
PS4: entry: 12000, exit: 35000 (translates to 2350ms ITPT)

>  I'm not opposed to adjusting the algorithm, but I'd like to understand
>  what we're up against.  If Linux were the only game in town, I would
>  say that the approach in this patch is unfortunate because of the
>  arbitrary thresholds it introduces, but if it tracks Windows, then
>  we're probably okay.
>--Andy

I agree. Using arbitrary numbers would be a very bad idea. But the numbers I'm
suggesting are indeed based on the schemes used on Windows, so they have proved
viable on a huge number of devices.

Regards,
Alexey

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] nvme: extend and modify the APST configuration algorithm
  2021-04-28  9:27 [PATCH 1/1] nvme: extend and modify the APST configuration algorithm Alexey Bogoslavsky
  2021-04-28 12:42 ` hch
@ 2021-05-19  7:00 ` hch
  1 sibling, 0 replies; 5+ messages in thread
From: hch @ 2021-05-19  7:00 UTC (permalink / raw)
  To: Alexey Bogoslavsky
  Cc: linux-nvme@lists.infradead.org, hch@lst.de, kbusch@kernel.org,
	axboe@fb.com, sagi@grimberg.me, Andy Lutomirski

Thanks,

applied to nvme-5.14.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-05-19  7:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-04-28  9:27 [PATCH 1/1] nvme: extend and modify the APST configuration algorithm Alexey Bogoslavsky
2021-04-28 12:42 ` hch
2021-04-28 14:44   ` Andy Lutomirski
2021-04-28 15:45     ` Alexey Bogoslavsky
2021-05-19  7:00 ` hch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox